1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
(Big) Data Serialization with Avro and
Protobuf
Guido Schmutz
Munich – 7.11.2018
@gschmutz guidoschmutz.wordpress.com
2. Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
6. What is Serialization / Deserialization ?
Serialization is the process of turning structured in-memory objects into a byte stream
for transmission over a network or for writing to persistent storage
Deserialization is the reverse process from a byte stream back to a series of structured
in-memory objects
When selecting a data serialization format, the following characteristics should be
evaluated:
• Schema support and Schema evolution
• Code generation
• Language support / Interoperability
• Transparent compression
• Splitability
• Support in Big Data / Fast Data Ecosystem
7. Where do we need Serialization / Deserialization ?
Service / Client
Logic
Event Hub
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
8. Sample Data Structured used in this presentation
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
{
"id":"1",
"firstName":"Peter",
"lastName":"Sample",
"title":"mr",
"emailAddress":"peter.sample@somecorp.com",
"phoneNumber":"+41 79 345 34 44",
"faxNumber":"+41 31 322 33 22",
"dateOfBirth":"1995-11-10",
"addresses":[
{
"id":"1",
"streetAndNr":"Somestreet 10",
"zipAndCity":"9332 Somecity"
}
]
}
https://github.com/gschmutz/various-demos/tree/master/avro-vs-protobuf
10. https://developers.google.com/protocol-buffers/
Protocol buffers (protobuf) are Google's language-neutral, platform-neutral, extensible
mechanism for serializing structured data
• like XML, but smaller, faster, and simpler
Schema is needed to generate code
and read/write data
Support generated code in Java, Python,
Objective-C, C++, Go, Ruby, and C#
Two different versions: proto2 and proto3
Presentation based on proto3
Google Protocol Buffers
11. Apache Avro
http://avro.apache.org/docs/current/
Apache Avro™ is a compact, fast, binary data serialization system invented by the
makers of Hadoop
Avro relies on schemas. When data
is read, the schema used when writing
it is always present
container file for storing persistent data
Works both with code generation as well
as in a dynamic manner
Latest version: 1.8.2
15. Defining Schema - IDL
@namespace("com.trivadis.avro.person.v1")
protocol PersonIdl {
import idl "Address-v1.avdl";
enum TitleEnum {
Unknown, Mr, Ms, Mrs
}
record Person {
int id;
string firstName;
string lastName;
TitleEnum title;
union { null, string } emailAddress;
union { null, string } phoneNumber;
union { null, string } faxNumber;
date dateOfBirth;
array<com.trivadis.avro.address.v1.Address> addresses;
}
}
@namespace("com.trivadis.avro.address.v1")
protocol AddressIdl {
record Address {
int id;
string streetAndNr;
string zipAndCity;
}
}
Note: JSON Schema can be
generated from IDL Schema using
Avro Tools
address-v1.avdl
Person-v1.avdl
https://avro.apache.org/docs/current/idl.html
16. Defining Schema - Specification
Multiple message types can be defined in
single proto file
Field Numbers – each field in the
message has a unique number
• used to identify the fields in the message
binary format
• should not be changed once message type
is in use
• 1 – 15 uses single byte, 16 – 2047 uses two
bytes to encode
Default values are type-specific
Schema can either be represented as
JSON or by using the IDL
Avro specifies two serialization
encodings: binary and JSON
Encoding is done in order of fields
defined in record
schema used to write data always needs
to be available when the data is read
• Schema can be serialized with the data or
• Schema is made available through registry
18. Defining Schema - Style Guides
Use CamelCase (with an initial capital) for
message names
Use underscore_separated_names
for field names
Use CamelCase (with an initial capital) for
enum type names
Use CAPITALS_WITH_UNDERSCORES for
enum value names
Use java-style comments for
documenting
Use CamelCase (with an initial capital) for
record names
Use Camel Case for field names
Use CamelCase (with an initial capital) for
enum type names
Use CAPITALS_WITH_UNDERSCORES for
enum value names
Use java-style comments (IDL) or doc
property (JSON) for documenting
20. Using Protobuf and Avro from Java
if you are using Maven, add the following dependency to your POM:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.6.1</version>
</dependency>
21. With Code Generation – Generate the code
Run the protocol buffer compiler
One compiler for all supported languages
Produces classes for the given language
Run the specific tool for the given
language
• For Java
• For C++
• For C#
protoc -I=$SRC_DIR --java_out=$DST_DIR
$SRC_DIR/person-v1.proto
java -jar /path/to/avro-tools-1.8.2.jar
compile schema Person-v1.avsc .
avrogencpp -i cpx.json -o cpx.hh -n c
Microsoft.Hadoop.Avro.Tools codegen
/i:C:SDKsrcMicrosoft.Hadoop.Avro.Tool
sSampleJSONSampleJSONSchema.avsc /o:
22. With Code Generation – Using Maven
Use protobuf-maven-plugin for
generating code at maven build
• Generates to target/generated-
sources
• Scans all project dependencies for .proto
files
• protoc has to be installed on machine
Use avro-maven-plugin for
generating code at maven build
• Generates to target/generated-
sources
24. With Code Generation – Serializing
FileOutputStream fos = new FileOutputStream(BIN_FILE_NAME_V1));
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<Person> writer = new
SpecificDatumWriter<Person>(Person.getClassSchema());
writer.write(person, EncoderFactory.get().binaryEncoder(out, null));
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
fos.write(serializedBytes);
FileOutputStream output = new
FileOutputStream(BIN_FILE_NAME_V2);
person.writeTo(output);
25. With Code Generation – Deserializing
DatumReader<Person> datumReader = new
SpecificDatumReader<Person>(Person.class);
byte[] bytes = Files.readAllBytes(new File(BIN_FILE_NAME_V1).toPath());
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
Person person = datumReader.read(null, decoder);
System.out.println(person.getFirstName());
PersonWrapper.Person person =
PersonWrapper.Person.parseFrom(new
FileInputStream(BIN_FILE_NAME_V1));
System.out.println(person.getFirstName());
26. Encoding
• Field position (tag) are used as keys
• Variable length for int32 and int64
• + zig zag for sint32 and sint64
• data is serialized in field order of schema
Variable length, zig-zag for int and long,
fixed length for float and double
Variable length encoding: a method of serializing integers using one or more bytes
Zig Zag encoding: more efficient for negative numbers
27. Without code generation
final String schemaLoc = "src/main/avro/Person-v1.avsc";
final File schemaFile = new File(schemaLoc);
final Schema schema = new Schema.Parser().parse(schemaFile);
GenericRecord person1 = new GenericData.Record(schema);
person1.put("id", 1);
person1.put("firstName", "Peter");
person1.put("lastName", "Muster");
person1.put("title", "Mr");
person1.put("emailAddress", "peter.muster@somecorp.com");
person1.put("phoneNumber", "+41 79 345 34 44");
person1.put("faxNumber", "+41 31 322 33 22");
person1.put("dateOfBirth", new LocalDate("1995-11-10"));
28. Serializing to Object Container File
file has schema and all
objects stored in the file
must be according to
that schema
Objects are stored in
blocks that may be
compressed
final DatumWriter<Person> datumWriter = new
SpecificDatumWriter<>(Person.class);
final DataFileWriter<Person> dataFileWriter =
new DataFileWriter<>(datumWriter);
// use snappy compression
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(persons.get(0).getSchema(),
new File(CONTAINER_FILE_NAME_V1));
// specify block size
dataFileWriter.setSyncInterval(1000);
persons.forEach(person -> {
dataFileWriter.append(person);
});
30. Schema Evolution
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
Person (1.1)
• id : integer
• firstName : text
• middleName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• addresses : array<Addresss>
• birthDate : date
Address (1.0)
• streetAndNr : text
• zipAndCity : text
V1.0 to V1.1
• Adding middleName
• Renaming dateOfBirth
to birthdate
• Move birthdate after
addresses
• Remove faxNumber
38. Avro and Kafka – Producing Avro to Kafka
@Configuration
public class KafkaConfig {
private String bootstrapServers;
private String schemaRegistryURL;
@Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL);
return props;
}
@Bean
public ProducerFactory<String, Customer> producerFactory() { .. }
@Bean
public KafkaTemplate<String, Customer> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
@Component
public class CustomerEventProducer {
@Autowired
private KafkaTemplate<String, Person> kafkaTemplate;
@Value("${kafka.topic.person}")
String kafkaTopic;
public void produce(Person person) {
kafkaTemplate.send(kafkaTopic, customer.getId().toString(), person);
}
}
39. Avro and Big Data
Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive,
Sqoop, Apache Spark, …
Spark Avro DataSource for Apache Spark supports using Avro as a source for
DataFrames: https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("Person")
val subPersonDF =
spark.sql("select * from Person where firstName like 'G%'")
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
40. Column-oriented: Apache Parquet and ORC
A logical table can be translated using
• Row-based layout (Avro, Protobuf,
JSON, …)
• Column-oriented layout (Parquet,
ORC, …)
Apache Parquet
• collaboration between Twitter and Cloudera
• Support in Hadoop, Hive, Spark, Apache
NiFi, StreamSets, Apache Pig, …
Apache ORC
• was created by Facebook and Hortonworks
• Support in Hadoop, Hive, Spark, Apache
NiFi, Apache Pig, Presto, …
A B C
A1 B1 C1
A2 B2 C2
A2 B2 C2
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3
42. Protobuf and gRPC
https://grpc.io/
Google's high performance, open-source
universal RPC framework
layering on top of HTTP/2 and using protocol
buffers to define messages
Support for Java, C#, C++, Python, Go, Ruby,
Node.js, Objective-C, …
Source: https://grpc.io
44. Serialization / Deserialization
Service / Client
Logic
Event Hub
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
45. Summary - Avro vs. Protobuf
Protobuf Avro
Schema Evolution Field Tag Schema
Compatibility Support great good
Code Generation yes yes
Dynamic Support no yes
Compactness of Encoding good great
Persistence Support no yes
Supports Compression no yes
RPC Support no (yes with gRPC) yes
Big Data Support no yes
Supported Languages Java, C++, C#, Python,
Objective-C, Go, …
Java, C++, C#, Python, Go, …
46. Technology on its own won't help you.
You need to know how to use it properly.