Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It has several related projects including Pig, Hive, Mahout, Avro, ZooKeeper, and Chukwa. Large companies like Yahoo, Facebook, and Amazon use Hadoop to process petabytes of data daily on clusters of thousands of servers.
2. Hadoop – What is it ?
● An open source system developed using Java
● Supports very large data sets
● Supports large clusters of servers
● Designed to run on pre existing low cost hardware
● Allows for fragmentation of work over cluster
● Allows for fragmentation of storage over cluster
● Provides resiliance via automatic failure handling
3. Hadoop - Architecture
Hadoop consists of
● Hadoop Common
Common utilities for Hadoop module support
● Hadoop MapReduce
Parallel processing of Hadoop data
● Hadoop Yarn
Scheduler and resource manager
● Hadoop Distributed File System (HDFS)
A Master/Slave file system which spreads the Hadoop data over a very
large cluster of slave data nodes controlled by a single name node.
5. Hadoop – Related Projects
● Pig - for analysing large data sets
● Hive – data warehouse system for Hadoop
● Mahout – machine learning and data mining
● Avro – a data serialization system
● Zoo Keeper – helps build distributed applications
● Chukwa – data collection and analysis
6. Hadoop – Related Projects
● Hue – Hadoop user interface
● Oozie – work flow scheduler
● Hama – bulk synchronous parallel framework
– For massive scientific computations
● Nutch – web crawler
● Hbase – Non relational database
7. Hadoop – Large Users
● Yahoo
– 10,000 core Linux cluster
● Facebook
– 100 Petabytes, growing at .5 Petabytes a day
● Amazon
– Its possible to run Hadoop on Amazon's EC2 and S3
8. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems