2. 90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST
THREE YEARS ALONE, AND IT IS GROWING
AT EVEN A MORE RAPID RATE.
BIG DATA
The world has been exponential data growth, due to social media,
mobility, E-commerce and other factors.
• Volume
• Variety
• Velocity
3. “Big Data is like teenage sex;
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it”
Dan Ariely, Duke University
7. The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be prone
to failures.
11. Configuring Hadoop :
a. hadoop-env.sh
b. core-site.xml
c. mapred-site.xml
d. hdfs-site.xml
12. Hadoop comes with several web interfaces which are by
default available at these locations:
• http://localhost:50070/ – web UI of the NameNode daemon
• http://localhost:50030/ – web UI of the JobTracker daemon
• http://localhost:50060/ – web UI of the TaskTracker daemon
Hadoop Web Interfaces
14. • Scalable – New nodes can be added as needed, and added without
needing to change data formats, how data is loaded, how jobs are
written, or the applications on top.
• Economical – Hadoop brings massively parallel computing to
commodity servers. The result is a sizeable decrease in the cost per
terabyte of storage, which in turn makes it affordable to model all
your data.
15. • Flexible – Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources. Data from multiple
sources can be joined and aggregated in arbitrary ways enabling
deeper analyses than any one system can provide.
• Reliable – When you lose a node, the system redirects work to
another location of the data and continues processing without missing
a beat
18. • HDFS is designed to store a very large amount of information
(terabytes or petabytes). This requires spreading the data across a
large number of machines.
• HDFS stores data reliably. If individual machines in the cluster fail,
data is still being available with data redundancy.
Hadoop Distributed File
System (HDFS):
19. • HDFS provides fast, scalable access to the information loaded on the
clusters. It is possible to serve a larger number of clients by simply
adding more machines to the cluster.
• HDFS integrate well with Hadoop MapReduce, allowing data to be
read and computed upon locally whenever needed.
• HDFS was originally built as infrastructure for the Apache Nutch
web search engine project
20. Hadoop does not require expensive, highly reliable hardware. It is
designed to run on clusters of commodity hardware, an HDFS instance
may consist of hundreds or thousands of server machines, each storing
part of the file system’s data. The fact that there are a huge number of
components and that each component has a non-trivial probability of
failure means that some component of HDFS is always non-functional.
Therefore, detection of faults and quick, automatic recovery from them
is a core architectural goal of HDFS.
Commodity Hardware Failure:
21. Applications that run on HDFS need continuous access to their data
sets. HDFS is designed more for batch processing rather than interactive
use by users. The emphasis is on high throughput of data access rather
than low latency of data access.
Continuous Data Access:
22. Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support
large files.
It is also worth examining the
applications for which using HDFS
does not work so well. While this
may change in the future, these are
areas where HDFS is not a good fit
today:
Very Large Data Files:
23. • Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
24. • Pig is an open-source high-level dataflow
system.
• It provides a simple language for queries and
data manipulation Pig Latin, that is compiled
into MapReduce jobs that are run on Hadoop.
• Why is it important?
- Companies like Yahoo, Google and Microsoft
are collecting vast sets in the form of click
steams, search logs, and web crawls.
- Some form of ad-hoc processing and analysis
of all of this information is required.
What is Pig
25. • An ad-hoc way of creating and executing MapReduce jobs on very
large data sets
• Rapid Development
• No Java is required
• Developed byYahoo!
Why was Pig created?
27. • Pig is a data flow language. It is at the top of Hadoop and makes it
possible to create complex jobs to process large volumes of data
quicly and efficiently.
• It will consume any data that you feed it: Structured, semi-structured,
or unstructured.
• Pig provides the common data operations (filters, joins, ordering) and
nested data types (tuple, bags, and maps) which are missing in
MapReduce.
• PIG scripts are easier and faster to write than standard Java Hadoop
jobs and PIG has lot of clever optimizations like multi query
execution, which can make your complex queries execute quiker.
Where I should Use PIG
28. • Hive is a data warehouse infrastructure built
on top of Hadoop.
• It facilitates querying large datasets residing
on a distributed storage.
• It provides a mechanism to project structure
on to the data and query the data using a
SQL-like query language called “HiveQL”.
What is Hive
29. • Hive was developed by Facebook and was open sourced in 2008 .
• Data stored in Hadoop is inaccessible to business users.
• High level languages like Pig, Cascading etc are geared towards
developers.
• SQL is a common language that is known to many. Hive was
developed to give access to data stored in HadoopJon, translating
SQL like queries into map reduce jobs.
Why hive was developed