2. 06/22/15 2
Agenda
● What is BigData
● Hadoop and its Evolution
● Hadoop Acrchitecture and Components
● Hadoop and GlusterFS (glusterfs-hadoop plugin)
● Advantages of using GlusterFS with Hadoop
● References
3. 06/22/15 3
What is BigData
● Software solutions mostly capture, maintain and manage data
● Storing data
● Processing data
● Growing data size in current world – big data generators
● Sensors
● CC Cam
● Social networks
● Online shopping portals
● Airlines
● Hospitality
4. 06/22/15 4
Agenda
● What is BigData
● Hadoop and its Evolution
● Hadoop Acrchitecture and Components
● Hadoop and GlusterFS (glusterfs-hadoop plugin)
● Advantages of using GlusterFS with Hadoop
● References
5. 06/22/15 5
What is BigData
● 90% of total data today we have, got generated in last 2 years
● 1990
● HDD: 1-20 GB, RAM: 14-128 MB, Speed: 10kbps
● 2014
● HDD: 0.5-1 TB, RAM: 1-16 GB, Speed: 100 mbps
●
● 3 Factors which define BigData
● Volume
● Velocity
● Variety (unstructured and semi structured data)
6. 06/22/15 6
What is BigData
● SAN – Storage Area Network
● One option – Store the data on data centers and get them on need
basis and computation performed on them to process
● Computation is processor bound and a limit on the same
● As the size of the data increases we need more and more
computation as well and its not possible to perform the same on local
machine
● Solution - sending computation to the storage node and get the
processed data is better option (size of computation would be small)
7. 06/22/15 7
Hadoop Evolution
● Started with Google – white papers
● GFS (Google File System) 2003 - Storage
● MapReduce 2004 – Computation
● Yahoo
● HDFS (Hadoop Distributed File System) - 2006,7
● MapReduce (Computation mechanism) – 2007,8
● Doug Cutting and Michael Cafarrela from Yahoo
● Logo Elephant
● Apache foundation (2005 Yahoo donated)
8. 06/22/15 8
Hadoop Architecture /
Components
● Framework of tools – not an application in entirety
● Used for supporting running of applications on BigData
● Opensource'd set of tools distributed under Apache license
● Traditional Approach for handling huge data
● Powerful computer with big storage and computation capacity
● Limited by processing power of the computer with growing data
● Hadoop approach
● Break up data into smaller pieces and distribute to multiple
computers
● Breaks the computation as well into smaller pieces and distributes
them
● Combined results returned back
9. 06/22/15 9
Hadoop Architecture /
Components
● Map Reduce
● Job Tracker
● Task Tracker
● HDFS
● Name Node
● Data Node
● Applications contact the master node, a task is formed and submitted
to the Task Tracker
● Task Tracker maintains a queue of the tasks and gets them
processed using the Task Tracker and Data Nodes
● Consolidates the result and sends back to the application
10. 06/22/15 10
Hadoop Architecture /
Components
● Hadoop works on a distributed model
● Numerous low cost computers – commodity hardware
● Hadoop components
● Slaves
– Task Tracker – process smaller piece of task assigned
– Data Node – manage the piece of data distributed to this node
● Master
– Job Tracker – tracks the overall task
– Name Node – maintains the index of the data blocks stored on
different nodes
– Task Tracker
– Data Node
11. 06/22/15 11
Hadoop Architecture /
Components
Task
Tracker
Data
Node
Job
Tracker
Name
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Applications
Master
Slaves
Queue
12. 06/22/15 12
Hadoop Architecture /
Components
Task
Tracker
Data
Node
Job
Tracker
Name
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Applications
Master
Slaves
13. 06/22/15 13
Hadoop Architecture /
Components
Task
Tracker
Data
Node
Job
Tracker
Name
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Applications
Master
Slaves
14. 06/22/15 14
Hadoop and GlusterFS
● GlusterFS is a general purpose scale-out distributed file-
system supporting thousands of clients
● Aggregates storage exports over network interconnect to
provide a single unified namespace
● File-system
completely in
userspace, runs on
commodity
hardware
● Layered on disk file
systems that
support extended
attributes
15. 06/22/15 15
● Hadoop contains set of daemons running in the system
● Name Node – centralized metadata node
● Job Tracker – overall task distribution across data nodes
● Task Tracker – on data nodes to maintain task
● Data Node – to store data
● Hadoop = Map Reduce framework + HDFS
● GlusterFS can be a replacement for HDFS
● glusterfs-hadoop-plugin
● Java module which implements Hadoop file system interface
● Simple a JAR file which could be kept in Hadoop libraries
● Replaces HDFS for glusterfs
Hadoop and GlusterFS
16. 06/22/15 16
Hadoop and GlusterFS
● Data locality is ensured by Job Tracker
● Using glusterfs-hadoop-plugin ensures data locality by getting the gluster
volumes mounted as fuse mount
● Effectively no name node involved
● Only clients where map-reduce job runs
● And data nodes to store data
● Glusterfs-hadoop-plugin talks to glusterfs using fuse mounts
● In absence of name node, plugin uses xfattrs mechanism to get the details
from volume and consolidates the data using the same
● Reads the data directly from the bricks and bypasses the volume as such for
improved performance
17. 06/22/15 17
Hadoop and GlusterFS
● As simple as to execute map reduce daemon and then submit the hadoop
task to use glusterfs as storage
● Analytics uses – using HDFS makes files moving around the nodes whereas
glusterfs just need to fuse mount the volume and no moving around the files
18. 06/22/15 18
Advantages
● Elimination of centralized metadata server (name node)
● Compatibility with MapReduce and Hadoop based applications
● Elimination of code rewrites for Hadoop enablement of glusterfs
● Fault tolerant file system
● Allows co-location of compute and data nodes and ability to run Hadoop jobs
across multiple namespaces using multiple glusterfs volumes
● Data access through serveral different mechanisms / protocols (Fuse, NFS,
SMB and SWIFT …. and of course Hadoop)