1. Next generation technologies
(The best way to jump into a parade is to jump in front of one that is already going)
We are going to talk about
the framework that backs
up the technological
infrastructure of the
biggest players of internet
world, some of them are
embedded in the
following image:
These are just some
biggest name; there are
lots more in this list.
Here we are talking about
next generation computer
technology, which has
scalability, tolerance and
much more features. The
term cloud will not unheard for you but here I am going to talk about a super technological terms
that will be back bone of cloud or distributed computing. Now you may be thing what is that
technology right? The technology that we are going to discuss is called “Hadoop”. The best thing
about the technology is its open source and readily available where you can contribute, experiment,
and use.
As apache web site says “The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using a simple programming model. It is designed to
scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
Let’s talk about some best features first:
High scalability.
High availability.
High performance.
Handling Multi-dimensional data storage.
Handling Distributed storage.
Let’s first look on some scenarios in the internet world:
2. How much data can you think of which need to process by a internet player? Do you know how
much data twitter process daily? It about 7 Tb per day. How much time will it take to process this
much of data for a general computer
About 4 hr. that is just for reading, not processing, can you think about processing all twitter data
will not it take years.
So here comes Hadoop in play which sorts a
petabyte in 16.25Hr and a terabyte of data in 62 seconds. Is not it good choice yes sure it is. Likewise
think about the amount of data Facebook, Google, amazon need to process daily.
The best thing about Hadoop setup is, you don’t need special costly and high end servers rather you
can make a cluster out of Hadoop using commodity computers. Keep adding computers and keep
increasing storage and processing power.
So ultimately here are some point for “Why Hadoop?”
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
– Workloads are IO bound and not CPU bound
Hadoop basically depends of following concept:
1. Hadoop – common (Base)
Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop
Common includes FileSystem, RPC, and serialization libraries.
3. 2. HDFS ( Hadoop File System)(File System)
Hadoop Distributed File System (HDFS™) is the primary storage system used by
Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid computations.
3. Map-Reduce (Code)
Hadoop Map-Reduce is a programming model and software framework for writing
applications that rapidly process vast amounts of data in parallel on large clusters of
compute nodes.
So what it is used for:
1. Internet scale data :
a. Web Logs: years of logs Terabytes per day.
b. Web search- all the webpages present on this earth.
c. Social data- all the data, messages, images, tweets, scraps, wall posts generated on
Facebook, Twitter, and other social media.
2. Cutting edge analytics:
a. Machine learning, data mining.
3. Enterprise applications:
a. Network instrumentation, mobile logs.
b. Video and audio processing.
c. Text mining.
4. And lots more.
Let's see the timeline:
4. References:
http://hadoop.apache.org, http://developer.yahoo.com/hadoop/
This is the best place where you can find all information about Hadoop. On this website you'll find
lots of wiki pages links and ongoing links, from which you can get lot of information about Hadoop
on how to get started with Hadoop, and all how where how to questions and their answers.
Just visit this site is explore it and experiment with the next-generation technology that is going to
be the backbone of Internet.
In the next coming articles, we'll talk about some other technologies related Hadoop likeHBase, Hive,
Avro, Cassandra, Chukwa, Mahout, Pig, Zookeeper.
∞
Shashwat Shriparv