Hadoop for Java Professionals

Hadoop for
Java Professionals
View Hadoop Courses at : www.edureka.in/hadoop
*
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions

www.edureka.in/hadoopSlide 2
Objectives of this Session
• Un
• Big Data and Hadoop
• Why Hadoop?
• Job Trends: Hadoop and Java
• Hadoop ecosystem
• MapReduce Programming and Java
• User Defined Functions (UDF) in Pig and Hive
• HBase and Java
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN

Big Data
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data

Unstructured Data is Exploding
 2,500 exabytes of new information in 2012 with internet as primary driver
 “Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year

Big Data - Challenges
Increasing Data Volumes New data sources and types
Email and documents
Social Media, Web Logs
Machine Device(Scientific)
Transactions,
OLTP, OLAP

www.edureka.in/hadoop
Job Trends: Hadoop and Java
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Jobs in Hadoop
Big Data has opened up the door to new job opportunities, to
name a few:
 Hadoop Developer
 Hadoop Architects
 Hadoop Engineers
 Hadoop Application Developer
 Data Analysts
 Data Scientists
 Business Intelligence (BI) Architects
 Big Data Engineer

Hadoop for Java Professionals
Hadoop is red-hot as it:
 allows distributed processing of large data sets across clusters
of computers using simple programming model.
 has become the de facto standard for storing, processing, and
analyzing hundreds of terabytes and petabytes of data.
 Is cheaper to use in comparison to other traditional proprietary
technologies such as Oracle, IBM etc. It can runs on low cost
commodity hardware.
 Can handle all types of data from disparate systems such server
logs, emails, sensor data, pictures, videos etc.

Hadoop for Java Professionals (Contd.)
Hadoop is Natural career progression for Java professionals.
 It is a Java-based framework and written entirely in Java.
 The combination of Hadoop and Java skills is the number one
combination in demand among all Hadoop Jobs.
 Java skills comes handy while writing code for the following in
Hadoop:
 MapReduce programming using Java
 User Defined Functions (UDFs) in PIG and Hive scripts of
Hadoop Applications
 Client Applications in HBase

Hadoop for Big Data
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.

Hadoop and MapReduce
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 highly fault-tolerant
 high throughput access to application data
 suitable for applications that have large data set
 Natively redundant
MapReduce (Processing)
 software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) in a reliable, fault-tolerant
manner
 Splits a task across processors
Map-Reduce
Key Value

HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Important Hadoop Eco-System components

What is Map - Reduce?
cloud
support
database
 Map - Reduce is a programming model
 It is neither platform- nor language-specific
 Record-oriented data processing (key and value)
 Task distributed across multiple nodes
 Where possible, each node processes data
stored on that node
 Consists of two phases
 Map
 Reduce
ValueKey
MapReduce

What is Map - Reduce? (Contd.)
cloud
support
database
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE

A Sample MapReduce program in Java

Problem – Data Processing

Huge Raw XML files
with unstructured
data line reviews
Map Reduce
HDFS
Category hash url +tive -tive total
Problem - Data Processing
Output

Other Applications of Java Skills in Hadoop – UDFs

Pig is a High-level, declarative data flow language.
 It is at the top of Hadoop and makes it possible to create complex jobs to
process large volumes of data quickly and efficiently.
 Similar to SQL query where the user specifies the “what” and leaves the “how”
to the underlying processing engine.
Hadoop
Pig
User Defined Functions (UDFs) in PIG

public class IsOfAge extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}
 A Program to create UDF:
Pig Latin – Creating UDF

How to call a UDF?
register myudf.jar;
X = filter A by IsOfAge(age);
Pig and UDF

Questions?
Buy Complete Course at : www.edureka.in/hadoop
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/hadoop
Interested in learning “Big-Data & Hadoop”?
Let us know by mailing us at sales@edureka.in

Hadoop for Java Professionals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop for Java Professionals

Similar to Hadoop for Java Professionals (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Hadoop for Java Professionals