With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Hadoop for Java Professionals
1. Slide 1
Hadoop for
Java Professionals
View Hadoop Courses at : www.edureka.in/hadoop
*
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
2. www.edureka.in/hadoopSlide 2
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Objectives of this Session
• Un
• Big Data and Hadoop
• Why Hadoop?
• Job Trends: Hadoop and Java
• Hadoop ecosystem
• MapReduce Programming and Java
• User Defined Functions (UDF) in Pig and Hive
• HBase and Java
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
3. www.edureka.in/hadoopSlide 3
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Big Data
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
4. www.edureka.in/hadoopSlide 4
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Unstructured Data is Exploding
2,500 exabytes of new information in 2012 with internet as primary driver
“Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year
5. www.edureka.in/hadoopSlide 5
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Big Data - Challenges
Increasing Data Volumes New data sources and types
Email and documents
Social Media, Web Logs
Machine Device(Scientific)
Transactions,
OLTP, OLAP
8. www.edureka.in/hadoopSlide 8
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Jobs in Hadoop
Big Data has opened up the door to new job opportunities, to
name a few:
Hadoop Developer
Hadoop Architects
Hadoop Engineers
Hadoop Application Developer
Data Analysts
Data Scientists
Business Intelligence (BI) Architects
Big Data Engineer
9. www.edureka.in/hadoopSlide 9
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Java Professionals
Hadoop is red-hot as it:
allows distributed processing of large data sets across clusters
of computers using simple programming model.
has become the de facto standard for storing, processing, and
analyzing hundreds of terabytes and petabytes of data.
Is cheaper to use in comparison to other traditional proprietary
technologies such as Oracle, IBM etc. It can runs on low cost
commodity hardware.
Can handle all types of data from disparate systems such server
logs, emails, sensor data, pictures, videos etc.
10. www.edureka.in/hadoopSlide 10
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Java Professionals (Contd.)
Hadoop is Natural career progression for Java professionals.
It is a Java-based framework and written entirely in Java.
The combination of Hadoop and Java skills is the number one
combination in demand among all Hadoop Jobs.
Java skills comes handy while writing code for the following in
Hadoop:
MapReduce programming using Java
User Defined Functions (UDFs) in PIG and Hive scripts of
Hadoop Applications
Client Applications in HBase
11. www.edureka.in/hadoopSlide 11
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Big Data
Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
12. www.edureka.in/hadoopSlide 12
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop and MapReduce
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
highly fault-tolerant
high throughput access to application data
suitable for applications that have large data set
Natively redundant
MapReduce (Processing)
software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) in a reliable, fault-tolerant
manner
Splits a task across processors
Map-Reduce
Key Value
13. www.edureka.in/hadoopSlide 13
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Important Hadoop Eco-System components
14. www.edureka.in/hadoopSlide 14
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is Map - Reduce?
cloud
support
database
Map - Reduce is a programming model
It is neither platform- nor language-specific
Record-oriented data processing (key and value)
Task distributed across multiple nodes
Where possible, each node processes data
stored on that node
Consists of two phases
Map
Reduce
ValueKey
MapReduce
15. www.edureka.in/hadoopSlide 15
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is Map - Reduce? (Contd.)
cloud
support
database
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE
18. www.edureka.in/hadoopSlide 18
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Huge Raw XML files
with unstructured
data line reviews
Map Reduce
HDFS
Category hash url +tive -tive total
Problem - Data Processing
Output
20. www.edureka.in/hadoopSlide 20
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Pig is a High-level, declarative data flow language.
It is at the top of Hadoop and makes it possible to create complex jobs to
process large volumes of data quickly and efficiently.
Similar to SQL query where the user specifies the “what” and leaves the “how”
to the underlying processing engine.
Hadoop
Pig
User Defined Functions (UDFs) in PIG
21. www.edureka.in/hadoopSlide 21
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
public class IsOfAge extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}
A Program to create UDF:
Pig Latin – Creating UDF
23. Slide 23
Questions?
Buy Complete Course at : www.edureka.in/hadoop
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/hadoop
Interested in learning “Big-Data & Hadoop”?
Let us know by mailing us at sales@edureka.in