SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Unit 3
SERIALIZATION
 THE PROCESS OF CONVERTING AN OBJECT
INTO BYTE STREAM IS KNOWN AS
SERIALIZATION
 Serialization is the process of turning structured
objects into a byte stream for transmission over a
network or for writing to persistent storage.
Deserialization is the reverse process of turning a
byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of
distributed data processing: for interprocess
communication and for persistent storage. In Hadoop,
interprocess communication between nodes in the
system is implemented using remote procedure calls
(RPCs). The RPC protocol uses serialization to
render the message into a binary stream to be sent to
the remote node, which then deserializes the binary
stream into the original message. In general, an RPC
serialization format is:
In general, an RPC serialization format is:
 Compact A compact format makes the best use of
network bandwidth, which is the most scarce resource in
a data center.
 Fast Interprocess communication forms the backbone for
a distributed system, so it is essential that there is as little
performance overhead as possible for the serialization
and deserialization process.
 Extensible Protocols change over time to meet new
requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and
servers.
 Interoperable For some systems, it is desirable to be
able to support clients that are written in different
languages to the server, so the format needs to be
designed to make this possible. Hadoop uses its own
The Writable Interface
 The Writable interface defines two methods: one for writing its state
to a DataOutput binary stream by using write(), and one for reading
its state from a DataInput binary stream by using readFields()as
below:
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
We will use IntWritable, a wrapper for a Java int. We can create one
and set its value using the set() method:
IntWritable writable = new IntWritable();
writable.set(163);
Equivalently, we can use the constructor that takes the integer value:
IntWritable writable = new IntWritable(163);
To examine the serialized form of the IntWritable, we write a small helper
method that wraps a java.io.ByteArrayOutputStream in a
java.io.DataOutputStream to capture the bytes in the serialized stream:
public static byte[] serialize(Writable writable) throws IOException
{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}
An integer is written using four bytes
byte[] bytes = serialize(writable);
assertThat(bytes.length, is(4));
The bytes are written in big-endian order and we can see their hexadecimal
representation by using a method on Hadoop’s StringUtils:
assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
Let’s try deserialization. Again, we create a helper
method to read a Writable object from a byte array:
public static byte[] deserialize(Writable writable, byte[]
bytes) throws IOException
{
ByteArrayInputStream in = new
ByteArrayInputStream(bytes);
DataInputStream dataIn = new DataInputStream(in);
writable.readFields(dataIn); dataIn.close();
return bytes;
}
We construct a new, value-less, IntWritable, then call deserialize() to read from
the output data that we just wrote.
Then we check that its value, retrieved using the get() method, is the original
value, 163:
IntWritable newWritable = new IntWritable();
deserialize(newWritable, bytes);
assertThat(newWritable.get(), is(163));
 Writable : Defines a de/serialization protocol.
Every data type in hadoop is writable.
 Writable Comparable:
Defines a sort order, All keys must be of this type(but
not values).
IntWritable,LongWritable and Text etc:
These are concrete classes for different java primitive
types.
Sequence Files:
Binary encoded of a sequence of key/values pairs.
Writable is used for data serialization in map-reduce it
has two methods one for writing it state to DataOutput and
other for reading from DataInput. Hadoop provides writ-
ables for java primitive types and for some java collections
as well. Before we move ahead we need to be clear about
where they are used in map reduce program. Writable is
used for warping key and value emit by mappers and key
and values received by reducers. So we can relate it as
mapper serialized it output with the help of writable and
reducer get input from network with the help of writable.
Hadoop provides a sufficient number of writables but
some time it is required to write own custom writable.
Below is a mapping between java
primitive types and hadoop writables.
primitive type writable #bytes
boolean BooleanWritable 1
byte ByteWritable 1
int IntWritable 4
VIntWritable 1–5
float FloatWritable 4
long LongWritable 8
VLongWritable 1–9
String Text decide at runtime
One importing thing which we can not ignore here
is, writables are mutable unlike java type, let say
java string which is immutable in nature. I
described the bytes details used by writables in
above table but for a programmer perceptive it
depends on the requirement either there is need
to go in this details or not.
NullWritable:
NullWritable is a special type of Writable, as it has a
zero-length serialization. No bytes are written to, or
read from, the stream. It is used as a placeholder for
example, in MapReduce, a key or a value can be
declared as a NullWritable when you don’t need to
use that position, it effectively stores a constant
empty value. NullWritable can also be useful as a key
in SequenceFile when you want to store a list of
values, as opposed to key-value pairs. It is an
immutable singleton, the instance can be retrieved by
calling NullWritable.get(). Above there is a term called
Placeholder, it means you want to retain the position
in the stream but do not want store any thing on that
position. After discussion this much theory about the
NullWritable now its time to move on the practicals
and purpose behind the NullWritable.
Purpose:
There could be cases when mappers is supposed to output data in un-
sorted way or user want to merge data from splits into a single files. This
could be achieved by setting same output key for all mappers. Now this
can be achived by two ways one is to set a same not null key or set
NullWritable as a key. We know that NullWritable does not take space in
stream so NullWritable will be beneficial to use. But this will slow down
the map reduce process.
If you increase the #reducers then process speed will be inversily
proportional to this, means it will take more time to complete the same
task.
Now question arises why this is happening with NullWritable. We know
that mappers output goes to partioner for sorting and shuffling and this
partioner is resposible to send the values with the same keys to same
reducer, since we have all records with NullWritable as a key so it will
generate same Hash which is zero(0) for all records, so all records will
reach to a single reducer for getting processed. Now question arises
what the others reducers are doing, they are doing nothing they just
started JVM and stopped it, that is why process is getting slow when
#reducers is getting up. Now we are missing one very importing thing
why sorting is getting to much of time even if there is a duplicake key for
all records. This is happening because map reduce uses QuickSort for
sorting data and all Know it is the worst case scenario for QuickSort.
do not use same key or NullWritable as key for all
mappers.
Ex:
Mapper output Key = NullWritable
Mapper output Value = Text
Reducer output = default
ObjectWritable and
GenericWritable
Use Cases:
1. Reduce records when keys in the records are of same type but values
types can vary.
2. When two sequence files, with same Key type but different Value types,
are mapped out to be reduced.
So far we learnt about how to write a mapper which has all keys of
same type and all values of same type and reducers iterate the over the
values of same type. But now the situation is different mappers have to
emit the pairs with same type of key but values on key can be of
different type and reducers have to merge these types of key-values
pairs.
Example pairs.
Pair1...<KeyType1, ValueType1>
Pair1...<KeyType1, ValueType2>
Pair1...<KeyType1, ValueType2>
This can be achieved by wrapping prmitive Writables. There are two
ways for this.
1. Wrap all values writables in ObjectWritable
ObjectWritable : It just wrap each type in
ObjectWritable. This class is much more effective,
because it appends the class declaration as a
String into the output file in every Key-Value
pair.For space point of view this could be a
drawback because it costs more space to writes
type name for serialization.
Extends GenericWritable : It uses a static array
of types to save space. Generic Writable
implements Configurable interface, so that it will
be configured by the framework. The
configuration is passed to the wrapped objects
implementing Configurable interface before
deserialization. It is best to use when all data
1. Write your own class, such as MyGenericWritable, which extends
GenericWritable.
2. Implements the abstract method getTypes(), defines the classes which will be
wrapped in MyGenericWritable in application. Only those classes which
implemets Writable interface could be defined in getTypes().
public class MyGenericWritable extends GenericWritable
{
private static Class[] types = {
TypeType1.class,
TypeType2.class,
TypeType3.class,
TypeType4.class,
........
};
protected Class[] getTypes()
{
return types;
}
}
Example For Object Writable:
// Assume input is comming in the form of "int String int"
// Assume input is comming in the form of "int String int"
public void map(LongWritable key, Text value, Context context) throws IOException
{
String[] tokens = value.toString().split("t");
IntWritable key = Integer.parseInt(tokens[0]);
String valueType1 = tokens[1];
IntWritable valueType2 = Integer.parseInt(tokens[2]);
context.write(key, new ObjectWritable(valueType1));
context.write(key, new ObjectWritable(valueType2));
..........
}

public void reduce(IntWritable key, Iterator < objectwritable > values, Context context) throws IOException
{
Object value = null;
int sum = 0;
String result = "";
while (values.hasNext()) {
value = ((ObjectWritable) values.next()).get();
if (value instanceof String) {
result += (String) value;
} else if (value instanceof IntWritable) {
sum += ((IntWritable) value).get();
}
}
......
}
Implementing a Custom Writable
Before we move to implementing a custom writable i have a question for you. Question is, when you heard
the word writable which type of image comes in your mind. If you are pretty clear about the concept of
writable then imagination should be like this, ya it is something which can be serialized, de-serialized and
values of this type will follow some particular order. Yes you are right we are going to define a new data type
for hadoop map reduce.
So for defining a new writable(like a data type for map reduce architecture) we have to emphasize on
following for maintaining performance by MAP REDUCE:
1. Binary Representation
Hadoop has a good stock of writables, and hadoop internally converts writables values into binary form for
serialization. I am assure binary conversion by hadoop will be very efficient because this binary form of data
which is stored in writable will be transfer to the network. But here we are introducing our custom writable so
we have to be very careful during binary conversion of our custom writables. Now you will have your own
control over binary representation.
2. Serialize
Define a good way of serialization for your custom writable like you can write your data with the help of java
output stream.
3. Deserialize
Define a good way of de-serialization for your custom writable like you can read your data with the help of java
input stream.
4. CompareTo
You know hadoop stock writables implements WritableComparable, it is because hadoop
MapReduce sort records. So for sorting, records should follow some order and this can be
achieve by implementing WritableComparable into your custom writable and by defining a
comareTo method into your custom writable.
5. Data Partition by Map Reduce
Hadoop partition data with the help of hash of each comparable value. For calculating hash
for its stock writables it has a default hashCode function. But here you are defining your own
custom writable so its your responsibility to define a good hashCode function for your
writable which can help hadoop for data partitioning.
6. Define string representation
Some input and output format use string representation(toString) of writable for their internal
processing like TextOutputFormat calls
toString() on keys and values for their output representation. So define toString() in your
comparable.
7. Define equals()
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable < TextPair > {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}@Override - serialize
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}@Override - deserialize
public void readFields(DataInput in ) throws IOException {
first.readFields( in );
second.readFields( in );
}@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}@Override
public String toString() {
return first + "t" + second;
}
}@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}

Weitere ähnliche Inhalte

Was ist angesagt?

Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayer
Rahul Hada
 

Was ist angesagt? (20)

Web services
Web servicesWeb services
Web services
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Mobile IP
Mobile IPMobile IP
Mobile IP
 
Mobile transportlayer
Mobile transportlayerMobile transportlayer
Mobile transportlayer
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Distributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communicationDistributed System-Multicast & Indirect communication
Distributed System-Multicast & Indirect communication
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Routing in Mobile Ad hoc Networks
Routing in Mobile Ad hoc NetworksRouting in Mobile Ad hoc Networks
Routing in Mobile Ad hoc Networks
 
AODV routing protocol
AODV routing protocolAODV routing protocol
AODV routing protocol
 
Manet
ManetManet
Manet
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mac protocols
Mac protocolsMac protocols
Mac protocols
 
TCP- Transmission Control Protocol
TCP-  Transmission Control Protocol TCP-  Transmission Control Protocol
TCP- Transmission Control Protocol
 
Ipv4 & ipv6
Ipv4 & ipv6Ipv4 & ipv6
Ipv4 & ipv6
 
Evolution of Cloud Computing
Evolution of Cloud ComputingEvolution of Cloud Computing
Evolution of Cloud Computing
 

Ähnlich wie Unit 3

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The Wild
Sergio Bossa
 

Ähnlich wie Unit 3 (20)

Pig
PigPig
Pig
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
js.pptx
js.pptxjs.pptx
js.pptx
 
MuleSoft Nashik Virtual Meetup#3 - Deep Dive Into DataWeave and its Module
MuleSoft Nashik Virtual  Meetup#3 - Deep Dive Into DataWeave and its ModuleMuleSoft Nashik Virtual  Meetup#3 - Deep Dive Into DataWeave and its Module
MuleSoft Nashik Virtual Meetup#3 - Deep Dive Into DataWeave and its Module
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Tapestry
TapestryTapestry
Tapestry
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Java OOP Concepts 1st Slide
Java OOP Concepts 1st SlideJava OOP Concepts 1st Slide
Java OOP Concepts 1st Slide
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The Wild
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
Scala+data
Scala+dataScala+data
Scala+data
 

Mehr von vishal choudhary

Mehr von vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Kürzlich hochgeladen

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 

Unit 3

  • 2. SERIALIZATION  THE PROCESS OF CONVERTING AN OBJECT INTO BYTE STREAM IS KNOWN AS SERIALIZATION  Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the reverse process of turning a byte stream back into a series of structured objects. Serialization appears in two quite distinct areas of distributed data processing: for interprocess communication and for persistent storage. In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. In general, an RPC serialization format is:
  • 3. In general, an RPC serialization format is:  Compact A compact format makes the best use of network bandwidth, which is the most scarce resource in a data center.  Fast Interprocess communication forms the backbone for a distributed system, so it is essential that there is as little performance overhead as possible for the serialization and deserialization process.  Extensible Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and servers.  Interoperable For some systems, it is desirable to be able to support clients that are written in different languages to the server, so the format needs to be designed to make this possible. Hadoop uses its own
  • 4. The Writable Interface  The Writable interface defines two methods: one for writing its state to a DataOutput binary stream by using write(), and one for reading its state from a DataInput binary stream by using readFields()as below: package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; } We will use IntWritable, a wrapper for a Java int. We can create one and set its value using the set() method: IntWritable writable = new IntWritable(); writable.set(163); Equivalently, we can use the constructor that takes the integer value: IntWritable writable = new IntWritable(163);
  • 5. To examine the serialized form of the IntWritable, we write a small helper method that wraps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream to capture the bytes in the serialized stream: public static byte[] serialize(Writable writable) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); DataOutputStream dataOut = new DataOutputStream(out); writable.write(dataOut); dataOut.close(); return out.toByteArray(); } An integer is written using four bytes byte[] bytes = serialize(writable); assertThat(bytes.length, is(4)); The bytes are written in big-endian order and we can see their hexadecimal representation by using a method on Hadoop’s StringUtils: assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
  • 6. Let’s try deserialization. Again, we create a helper method to read a Writable object from a byte array: public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException { ByteArrayInputStream in = new ByteArrayInputStream(bytes); DataInputStream dataIn = new DataInputStream(in); writable.readFields(dataIn); dataIn.close(); return bytes; } We construct a new, value-less, IntWritable, then call deserialize() to read from the output data that we just wrote. Then we check that its value, retrieved using the get() method, is the original value, 163: IntWritable newWritable = new IntWritable(); deserialize(newWritable, bytes); assertThat(newWritable.get(), is(163));
  • 7.  Writable : Defines a de/serialization protocol. Every data type in hadoop is writable.  Writable Comparable: Defines a sort order, All keys must be of this type(but not values). IntWritable,LongWritable and Text etc: These are concrete classes for different java primitive types. Sequence Files: Binary encoded of a sequence of key/values pairs.
  • 8. Writable is used for data serialization in map-reduce it has two methods one for writing it state to DataOutput and other for reading from DataInput. Hadoop provides writ- ables for java primitive types and for some java collections as well. Before we move ahead we need to be clear about where they are used in map reduce program. Writable is used for warping key and value emit by mappers and key and values received by reducers. So we can relate it as mapper serialized it output with the help of writable and reducer get input from network with the help of writable. Hadoop provides a sufficient number of writables but some time it is required to write own custom writable.
  • 9. Below is a mapping between java primitive types and hadoop writables. primitive type writable #bytes boolean BooleanWritable 1 byte ByteWritable 1 int IntWritable 4 VIntWritable 1–5 float FloatWritable 4 long LongWritable 8 VLongWritable 1–9 String Text decide at runtime
  • 10. One importing thing which we can not ignore here is, writables are mutable unlike java type, let say java string which is immutable in nature. I described the bytes details used by writables in above table but for a programmer perceptive it depends on the requirement either there is need to go in this details or not.
  • 11. NullWritable: NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to, or read from, the stream. It is used as a placeholder for example, in MapReduce, a key or a value can be declared as a NullWritable when you don’t need to use that position, it effectively stores a constant empty value. NullWritable can also be useful as a key in SequenceFile when you want to store a list of values, as opposed to key-value pairs. It is an immutable singleton, the instance can be retrieved by calling NullWritable.get(). Above there is a term called Placeholder, it means you want to retain the position in the stream but do not want store any thing on that position. After discussion this much theory about the NullWritable now its time to move on the practicals and purpose behind the NullWritable.
  • 12. Purpose: There could be cases when mappers is supposed to output data in un- sorted way or user want to merge data from splits into a single files. This could be achieved by setting same output key for all mappers. Now this can be achived by two ways one is to set a same not null key or set NullWritable as a key. We know that NullWritable does not take space in stream so NullWritable will be beneficial to use. But this will slow down the map reduce process. If you increase the #reducers then process speed will be inversily proportional to this, means it will take more time to complete the same task. Now question arises why this is happening with NullWritable. We know that mappers output goes to partioner for sorting and shuffling and this partioner is resposible to send the values with the same keys to same reducer, since we have all records with NullWritable as a key so it will generate same Hash which is zero(0) for all records, so all records will reach to a single reducer for getting processed. Now question arises what the others reducers are doing, they are doing nothing they just started JVM and stopped it, that is why process is getting slow when #reducers is getting up. Now we are missing one very importing thing why sorting is getting to much of time even if there is a duplicake key for all records. This is happening because map reduce uses QuickSort for sorting data and all Know it is the worst case scenario for QuickSort.
  • 13. do not use same key or NullWritable as key for all mappers. Ex: Mapper output Key = NullWritable Mapper output Value = Text Reducer output = default
  • 14. ObjectWritable and GenericWritable Use Cases: 1. Reduce records when keys in the records are of same type but values types can vary. 2. When two sequence files, with same Key type but different Value types, are mapped out to be reduced. So far we learnt about how to write a mapper which has all keys of same type and all values of same type and reducers iterate the over the values of same type. But now the situation is different mappers have to emit the pairs with same type of key but values on key can be of different type and reducers have to merge these types of key-values pairs. Example pairs. Pair1...<KeyType1, ValueType1> Pair1...<KeyType1, ValueType2> Pair1...<KeyType1, ValueType2> This can be achieved by wrapping prmitive Writables. There are two ways for this. 1. Wrap all values writables in ObjectWritable
  • 15. ObjectWritable : It just wrap each type in ObjectWritable. This class is much more effective, because it appends the class declaration as a String into the output file in every Key-Value pair.For space point of view this could be a drawback because it costs more space to writes type name for serialization. Extends GenericWritable : It uses a static array of types to save space. Generic Writable implements Configurable interface, so that it will be configured by the framework. The configuration is passed to the wrapped objects implementing Configurable interface before deserialization. It is best to use when all data
  • 16. 1. Write your own class, such as MyGenericWritable, which extends GenericWritable. 2. Implements the abstract method getTypes(), defines the classes which will be wrapped in MyGenericWritable in application. Only those classes which implemets Writable interface could be defined in getTypes(). public class MyGenericWritable extends GenericWritable { private static Class[] types = { TypeType1.class, TypeType2.class, TypeType3.class, TypeType4.class, ........ }; protected Class[] getTypes() { return types; } }
  • 17. Example For Object Writable: // Assume input is comming in the form of "int String int" // Assume input is comming in the form of "int String int" public void map(LongWritable key, Text value, Context context) throws IOException { String[] tokens = value.toString().split("t"); IntWritable key = Integer.parseInt(tokens[0]); String valueType1 = tokens[1]; IntWritable valueType2 = Integer.parseInt(tokens[2]); context.write(key, new ObjectWritable(valueType1)); context.write(key, new ObjectWritable(valueType2)); .......... } 
  • 18. public void reduce(IntWritable key, Iterator < objectwritable > values, Context context) throws IOException { Object value = null; int sum = 0; String result = ""; while (values.hasNext()) { value = ((ObjectWritable) values.next()).get(); if (value instanceof String) { result += (String) value; } else if (value instanceof IntWritable) { sum += ((IntWritable) value).get(); } } ...... }
  • 19. Implementing a Custom Writable Before we move to implementing a custom writable i have a question for you. Question is, when you heard the word writable which type of image comes in your mind. If you are pretty clear about the concept of writable then imagination should be like this, ya it is something which can be serialized, de-serialized and values of this type will follow some particular order. Yes you are right we are going to define a new data type for hadoop map reduce. So for defining a new writable(like a data type for map reduce architecture) we have to emphasize on following for maintaining performance by MAP REDUCE: 1. Binary Representation Hadoop has a good stock of writables, and hadoop internally converts writables values into binary form for serialization. I am assure binary conversion by hadoop will be very efficient because this binary form of data which is stored in writable will be transfer to the network. But here we are introducing our custom writable so we have to be very careful during binary conversion of our custom writables. Now you will have your own control over binary representation. 2. Serialize Define a good way of serialization for your custom writable like you can write your data with the help of java output stream. 3. Deserialize Define a good way of de-serialization for your custom writable like you can read your data with the help of java input stream.
  • 20. 4. CompareTo You know hadoop stock writables implements WritableComparable, it is because hadoop MapReduce sort records. So for sorting, records should follow some order and this can be achieve by implementing WritableComparable into your custom writable and by defining a comareTo method into your custom writable. 5. Data Partition by Map Reduce Hadoop partition data with the help of hash of each comparable value. For calculating hash for its stock writables it has a default hashCode function. But here you are defining your own custom writable so its your responsibility to define a good hashCode function for your writable which can help hadoop for data partitioning. 6. Define string representation Some input and output format use string representation(toString) of writable for their internal processing like TextOutputFormat calls toString() on keys and values for their output representation. So define toString() in your comparable. 7. Define equals()
  • 21. import java.io.*; import org.apache.hadoop.io.*; public class TextPair implements WritableComparable < TextPair > { private Text first; private Text second; public TextPair() { set(new Text(), new Text()); } public TextPair(String first, String second) { set(new Text(first), new Text(second)); } public TextPair(Text first, Text second) { set(first, second); } public void set(Text first, Text second) { this.first = first; this.second = second; } public Text getFirst() { return first; }
  • 22. public Text getSecond() { return second; }@Override - serialize public void write(DataOutput out) throws IOException { first.write(out); second.write(out); }@Override - deserialize public void readFields(DataInput in ) throws IOException { first.readFields( in ); second.readFields( in ); }@Override public int hashCode() { return first.hashCode() * 163 + second.hashCode(); }@Override public boolean equals(Object o) { if (o instanceof TextPair) { TextPair tp = (TextPair) o; return first.equals(tp.first) && second.equals(tp.second); } return false; }@Override public String toString() { return first + "t" + second; } }@Override public int compareTo(TextPair tp) { int cmp = first.compareTo(tp.first); if (cmp != 0) { return cmp; } return second.compareTo(tp.second); }