3. Outline
Introduction to Big Data
Traditional Data Processing
Introduction to Hadoop
HadoopArchitecture
Hadoop and RDBMS
Hadoop Distributions
4. What is Big Data?
Collection of large datasets that cannot be processed using traditional
computing techniques.
Not a single technique or a tool, rather a complete subject. Various tools,
techniques and frameworks.
It’s not the amount of data that’s important. It’s what organisations do with
the data that matters.
Big data can be analysed for insights that lead to better decisions and
strategic business moves.
5. Traditional Data Processing
For storage purpose, the programmers will take the help of their choice of database
vendors such as Oracle, IBM and others.
An enterprise will have a computer to store and process big data.
The user interacts with the application, which in turn handles the part of data
storage and analysis.
6. Characteristics - Vs of Big Data
• Quality of the
data
• Structured, Un-
structured,
Semi-structured
• Periodic, Near-
time, Real-time
• Terabytes of
data,
Transactions,
Files
Volume Velocity
VeracityVariety
7. Characteristics - Vs of Big Data
Volume - Size of data
• Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
Velocity
• There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is
an increase of 22% year-over-year.
8. Characteristics - Vs of Big Data
Variety
• Data can be structured, semi-structured or unstructured in the form of images,
audios, videos, sensor data.
• Variety of data creates problems in capturing, storage, mining and analyzing the
data.
Veracity – data uncertainty, data inconsistency and incompleteness
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they
use to make decisions.
9. Characteristics - Vs of Big Data
Veracity
• Poor data quality costs the US economy around $3.1 trillion a year.
Value
• Adding to the benefits of the organizations. Is the organization working on Big
Data achieving high ROI?
11. What Comes Under Big Data?
• Black Box Data: Voices of the flight crew, performance information of the aircraft.
• Social Media Data: Posts and views by millions of people across the globe.
• Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the
stock exchange.
• Power Grid Data: Electricity consumed by a particular node with respect to a base
station.
• Transport Data: Shipping / freight data.
12. Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to recommend
products.
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors.
13. Applications of Big Data
• Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance.
• Telecom: Reduce data packet loss. Offer personalized plans.
• Retail: Recommendation engines - Suggestion based on the browsing history of the
consumer.
• Traffic control: managing traffic better via effective use of data and sensors.
• Manufacturing: reduce component defects, improve product quality, increase
efficiency, and save time and money.
• Search Quality: Personalised search results based on previous searches.
14. AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go.
AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously
learned knowledge.
It gains knowledge by machine learning, specifically by an artificial neural network from
extensive training, both from human and computer play.
In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for
the first time.
15. Google Photos
Google implements different forms of machine learning into the Photos service,
particularly in recognition of photo contents.
People: Photos app collects all the photos containing faces. It doesn’t identify these
people, but just collects them for quick access
Places: The relies on landmarks. It can correctly identify well-known places like Taj
Mahal.
Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and
cats.There are many more categories, including screenshots, posters and castles.
Not everything in the garden is rosy!
16. Challenges with Big Data
• Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to
the companies each year in the United States.
• Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and
insights are very difficult.
• Storage: The more data an organization has, the more complex the problem of
managing it.
• Lack of Talent: A developers, data scientists and analysts who also have sufficient
amount of domain knowledge.
17. Summary of Big Data
Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
18. Introduction to Hadoop
Hadoop is an open source framework from Apache.
Used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical
processing).
Used for batch/offline processing.
20. Modules of Hadoop
Hadoop Distributed File System: States that the files will be broken into blocks
and stored in nodes over the distributed architecture.
Yet another Resource Negotiator: Used for job scheduling and manage the
cluster.
Map Reduce: Framework which helps programs to do the parallel computation
on data using key value pair.
21. Modules of Hadoop
Map Reduce: The Map task takes input data and converts it into a data set which
can be computed in Key value pair.The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.
Hadoop Common:These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
26. Hadoop Architecture
NameNode
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds,etc. Does not use transaction for whole
blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode
failure.
27. Hadoop Architecture
DataNode
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
28. Hadoop Architecture
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. In response, NameNode provides metadata to Job
Tracker.
Task Tracker
It works as a slave node for Job Tracker. It receives task and code from Job Tracker
and applies that code on the file. This process can also be called as a Mapper.
29. Hadoop vs RDBMS
Until recently many applications utilized RDBMS for batch processing – Oracle,
Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many architectures would
benefit from both Hadoop and a Relational product(s).
30. Hadoop vs RDBMS
RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
31. Comparison to RDBMS
Hadoop was not designed for real-time or low latency queries
Products that do provide low latency queries such as HBase have limited query
functionality
Hadoop performs best for offline batch processing on large amounts of data
RDBMS is best for online transactions and low-latency queries
Hadoop is designed to stream large files and large amounts of data
RDBMS works best with small records
32. Hadoop Distributions
Let’s say you go download Hadoop’s HDFS and MapReduce.
At first it works great but then you decide to start using Hbase.
No problem, just download HBase and point it to your existing HDFS.
But you find that HBase can only work with a previous version of HDFS.
You go downgrade HDFS and everything still works great.
33. Hadoop Distributions
Hadoop Distributions aim to resolve version incompatibilities.
DistributionVendor will do the following:
1. IntegrationTest a set of Hadoop products.
2. Distributions may provide additional scripts to execute Hadoop
3. Package Hadoop products in various installation formats
1. Linux Packages, tarballs
35. Distribution Vendors
Cloudera Hadoop Distribution
MapR Hadoop Distribution
AmazonWeb Services Elastic MapReduce Hadoop Distribution
Hortonworks Hadoop Distribution
IBM Infosphere BigInsights Hadoop Distribution
Microsoft Azure's HDInsight Cloud based Hadoop Distribution
36. Cloudera Distribution for Hadoop
Most popular distribution
Cloudera has taken the lead on providing Hadoop Distribution
CDH is provided in various formats
Linux Packages,Virtual Machine Images, andTarballs
AmazonWeb Services Elastic MapReduce Hadoop Distribution.
Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig,
Sqoop,Whirr, Zookeeper, Flume
37. References
• McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review.
• Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset
Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data