2. SERIALIZATION
THE PROCESS OF CONVERTING AN OBJECT
INTO BYTE STREAM IS KNOWN AS
SERIALIZATION
Serialization is the process of turning structured
objects into a byte stream for transmission over a
network or for writing to persistent storage.
Deserialization is the reverse process of turning a
byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of
distributed data processing: for interprocess
communication and for persistent storage. In Hadoop,
interprocess communication between nodes in the
system is implemented using remote procedure calls
(RPCs). The RPC protocol uses serialization to
render the message into a binary stream to be sent to
the remote node, which then deserializes the binary
stream into the original message. In general, an RPC
serialization format is:
3. In general, an RPC serialization format is:
Compact A compact format makes the best use of
network bandwidth, which is the most scarce resource in
a data center.
Fast Interprocess communication forms the backbone for
a distributed system, so it is essential that there is as little
performance overhead as possible for the serialization
and deserialization process.
Extensible Protocols change over time to meet new
requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and
servers.
Interoperable For some systems, it is desirable to be
able to support clients that are written in different
languages to the server, so the format needs to be
designed to make this possible. Hadoop uses its own
4. The Writable Interface
The Writable interface defines two methods: one for writing its state
to a DataOutput binary stream by using write(), and one for reading
its state from a DataInput binary stream by using readFields()as
below:
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
We will use IntWritable, a wrapper for a Java int. We can create one
and set its value using the set() method:
IntWritable writable = new IntWritable();
writable.set(163);
Equivalently, we can use the constructor that takes the integer value:
IntWritable writable = new IntWritable(163);
5. To examine the serialized form of the IntWritable, we write a small helper
method that wraps a java.io.ByteArrayOutputStream in a
java.io.DataOutputStream to capture the bytes in the serialized stream:
public static byte[] serialize(Writable writable) throws IOException
{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}
An integer is written using four bytes
byte[] bytes = serialize(writable);
assertThat(bytes.length, is(4));
The bytes are written in big-endian order and we can see their hexadecimal
representation by using a method on Hadoop’s StringUtils:
assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
6. Let’s try deserialization. Again, we create a helper
method to read a Writable object from a byte array:
public static byte[] deserialize(Writable writable, byte[]
bytes) throws IOException
{
ByteArrayInputStream in = new
ByteArrayInputStream(bytes);
DataInputStream dataIn = new DataInputStream(in);
writable.readFields(dataIn); dataIn.close();
return bytes;
}
We construct a new, value-less, IntWritable, then call deserialize() to read from
the output data that we just wrote.
Then we check that its value, retrieved using the get() method, is the original
value, 163:
IntWritable newWritable = new IntWritable();
deserialize(newWritable, bytes);
assertThat(newWritable.get(), is(163));
7. Writable : Defines a de/serialization protocol.
Every data type in hadoop is writable.
Writable Comparable:
Defines a sort order, All keys must be of this type(but
not values).
IntWritable,LongWritable and Text etc:
These are concrete classes for different java primitive
types.
Sequence Files:
Binary encoded of a sequence of key/values pairs.
8. Writable is used for data serialization in map-reduce it
has two methods one for writing it state to DataOutput and
other for reading from DataInput. Hadoop provides writ-
ables for java primitive types and for some java collections
as well. Before we move ahead we need to be clear about
where they are used in map reduce program. Writable is
used for warping key and value emit by mappers and key
and values received by reducers. So we can relate it as
mapper serialized it output with the help of writable and
reducer get input from network with the help of writable.
Hadoop provides a sufficient number of writables but
some time it is required to write own custom writable.
9. Below is a mapping between java
primitive types and hadoop writables.
primitive type writable #bytes
boolean BooleanWritable 1
byte ByteWritable 1
int IntWritable 4
VIntWritable 1–5
float FloatWritable 4
long LongWritable 8
VLongWritable 1–9
String Text decide at runtime
10. One importing thing which we can not ignore here
is, writables are mutable unlike java type, let say
java string which is immutable in nature. I
described the bytes details used by writables in
above table but for a programmer perceptive it
depends on the requirement either there is need
to go in this details or not.
11. NullWritable:
NullWritable is a special type of Writable, as it has a
zero-length serialization. No bytes are written to, or
read from, the stream. It is used as a placeholder for
example, in MapReduce, a key or a value can be
declared as a NullWritable when you don’t need to
use that position, it effectively stores a constant
empty value. NullWritable can also be useful as a key
in SequenceFile when you want to store a list of
values, as opposed to key-value pairs. It is an
immutable singleton, the instance can be retrieved by
calling NullWritable.get(). Above there is a term called
Placeholder, it means you want to retain the position
in the stream but do not want store any thing on that
position. After discussion this much theory about the
NullWritable now its time to move on the practicals
and purpose behind the NullWritable.
12. Purpose:
There could be cases when mappers is supposed to output data in un-
sorted way or user want to merge data from splits into a single files. This
could be achieved by setting same output key for all mappers. Now this
can be achived by two ways one is to set a same not null key or set
NullWritable as a key. We know that NullWritable does not take space in
stream so NullWritable will be beneficial to use. But this will slow down
the map reduce process.
If you increase the #reducers then process speed will be inversily
proportional to this, means it will take more time to complete the same
task.
Now question arises why this is happening with NullWritable. We know
that mappers output goes to partioner for sorting and shuffling and this
partioner is resposible to send the values with the same keys to same
reducer, since we have all records with NullWritable as a key so it will
generate same Hash which is zero(0) for all records, so all records will
reach to a single reducer for getting processed. Now question arises
what the others reducers are doing, they are doing nothing they just
started JVM and stopped it, that is why process is getting slow when
#reducers is getting up. Now we are missing one very importing thing
why sorting is getting to much of time even if there is a duplicake key for
all records. This is happening because map reduce uses QuickSort for
sorting data and all Know it is the worst case scenario for QuickSort.
13. do not use same key or NullWritable as key for all
mappers.
Ex:
Mapper output Key = NullWritable
Mapper output Value = Text
Reducer output = default
14. ObjectWritable and
GenericWritable
Use Cases:
1. Reduce records when keys in the records are of same type but values
types can vary.
2. When two sequence files, with same Key type but different Value types,
are mapped out to be reduced.
So far we learnt about how to write a mapper which has all keys of
same type and all values of same type and reducers iterate the over the
values of same type. But now the situation is different mappers have to
emit the pairs with same type of key but values on key can be of
different type and reducers have to merge these types of key-values
pairs.
Example pairs.
Pair1...<KeyType1, ValueType1>
Pair1...<KeyType1, ValueType2>
Pair1...<KeyType1, ValueType2>
This can be achieved by wrapping prmitive Writables. There are two
ways for this.
1. Wrap all values writables in ObjectWritable
15. ObjectWritable : It just wrap each type in
ObjectWritable. This class is much more effective,
because it appends the class declaration as a
String into the output file in every Key-Value
pair.For space point of view this could be a
drawback because it costs more space to writes
type name for serialization.
Extends GenericWritable : It uses a static array
of types to save space. Generic Writable
implements Configurable interface, so that it will
be configured by the framework. The
configuration is passed to the wrapped objects
implementing Configurable interface before
deserialization. It is best to use when all data
16. 1. Write your own class, such as MyGenericWritable, which extends
GenericWritable.
2. Implements the abstract method getTypes(), defines the classes which will be
wrapped in MyGenericWritable in application. Only those classes which
implemets Writable interface could be defined in getTypes().
public class MyGenericWritable extends GenericWritable
{
private static Class[] types = {
TypeType1.class,
TypeType2.class,
TypeType3.class,
TypeType4.class,
........
};
protected Class[] getTypes()
{
return types;
}
}
17. Example For Object Writable:
// Assume input is comming in the form of "int String int"
// Assume input is comming in the form of "int String int"
public void map(LongWritable key, Text value, Context context) throws IOException
{
String[] tokens = value.toString().split("t");
IntWritable key = Integer.parseInt(tokens[0]);
String valueType1 = tokens[1];
IntWritable valueType2 = Integer.parseInt(tokens[2]);
context.write(key, new ObjectWritable(valueType1));
context.write(key, new ObjectWritable(valueType2));
..........
}
18. public void reduce(IntWritable key, Iterator < objectwritable > values, Context context) throws IOException
{
Object value = null;
int sum = 0;
String result = "";
while (values.hasNext()) {
value = ((ObjectWritable) values.next()).get();
if (value instanceof String) {
result += (String) value;
} else if (value instanceof IntWritable) {
sum += ((IntWritable) value).get();
}
}
......
}
19. Implementing a Custom Writable
Before we move to implementing a custom writable i have a question for you. Question is, when you heard
the word writable which type of image comes in your mind. If you are pretty clear about the concept of
writable then imagination should be like this, ya it is something which can be serialized, de-serialized and
values of this type will follow some particular order. Yes you are right we are going to define a new data type
for hadoop map reduce.
So for defining a new writable(like a data type for map reduce architecture) we have to emphasize on
following for maintaining performance by MAP REDUCE:
1. Binary Representation
Hadoop has a good stock of writables, and hadoop internally converts writables values into binary form for
serialization. I am assure binary conversion by hadoop will be very efficient because this binary form of data
which is stored in writable will be transfer to the network. But here we are introducing our custom writable so
we have to be very careful during binary conversion of our custom writables. Now you will have your own
control over binary representation.
2. Serialize
Define a good way of serialization for your custom writable like you can write your data with the help of java
output stream.
3. Deserialize
Define a good way of de-serialization for your custom writable like you can read your data with the help of java
input stream.
20. 4. CompareTo
You know hadoop stock writables implements WritableComparable, it is because hadoop
MapReduce sort records. So for sorting, records should follow some order and this can be
achieve by implementing WritableComparable into your custom writable and by defining a
comareTo method into your custom writable.
5. Data Partition by Map Reduce
Hadoop partition data with the help of hash of each comparable value. For calculating hash
for its stock writables it has a default hashCode function. But here you are defining your own
custom writable so its your responsibility to define a good hashCode function for your
writable which can help hadoop for data partitioning.
6. Define string representation
Some input and output format use string representation(toString) of writable for their internal
processing like TextOutputFormat calls
toString() on keys and values for their output representation. So define toString() in your
comparable.
7. Define equals()
21. import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable < TextPair > {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
22. public Text getSecond() {
return second;
}@Override - serialize
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}@Override - deserialize
public void readFields(DataInput in ) throws IOException {
first.readFields( in );
second.readFields( in );
}@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}@Override
public String toString() {
return first + "t" + second;
}
}@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}