(Big) Data Serialization with Avro and Protobuf

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
(Big) Data Serialization with Avro and
Protobuf
Guido Schmutz
Munich – 7.11.2018
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz

COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers

Agenda
1. Introduction
2. Avro vs. Protobuf
3. Avro and Big Data & Fast Data
4. Protobuf and gRPC
5. Summary

What is Serialization / Deserialization ?
Serialization is the process of turning structured in-memory objects into a byte stream
for transmission over a network or for writing to persistent storage
Deserialization is the reverse process from a byte stream back to a series of structured
in-memory objects
When selecting a data serialization format, the following characteristics should be
evaluated:
• Schema support and Schema evolution
• Code generation
• Language support / Interoperability
• Transparent compression
• Splitability
• Support in Big Data / Fast Data Ecosystem

Where do we need Serialization / Deserialization ?
Service / Client
Logic
Event Hub
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize

Sample Data Structured used in this presentation
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
{
"id":"1",
"firstName":"Peter",
"lastName":"Sample",
"title":"mr",
"emailAddress":"peter.sample@somecorp.com",
"phoneNumber":"+41 79 345 34 44",
"faxNumber":"+41 31 322 33 22",
"dateOfBirth":"1995-11-10",
"addresses":[
{
"id":"1",
"streetAndNr":"Somestreet 10",
"zipAndCity":"9332 Somecity"
}
]
}
https://github.com/gschmutz/various-demos/tree/master/avro-vs-protobuf

https://developers.google.com/protocol-buffers/
Protocol buffers (protobuf) are Google's language-neutral, platform-neutral, extensible
mechanism for serializing structured data
• like XML, but smaller, faster, and simpler
Schema is needed to generate code
and read/write data
Support generated code in Java, Python,
Objective-C, C++, Go, Ruby, and C#
Two different versions: proto2 and proto3
Presentation based on proto3
Google Protocol Buffers

Apache Avro
http://avro.apache.org/docs/current/
Apache Avro™ is a compact, fast, binary data serialization system invented by the
makers of Hadoop
Avro relies on schemas. When data
is read, the schema used when writing
it is always present
container file for storing persistent data
Works both with code generation as well
as in a dynamic manner
Latest version: 1.8.2

Overview
.avdl
file
Serialized
Data
Specific Record
Generic Record
Generator
.avsc
file
Java
C#
Python
Go
…
Java
C#
Python
Go
…
Serialized
Data
Message
Generator
.proto
file
Java
C#
Python
Go
…

Defining Schema - IDL
syntax = "proto3";
package com.trivadis.protobuf.person.v1;
import "address-v1.proto";
import "title-enum-v1.proto";
import "google/protobuf/timestamp.proto";
option java_outer_classname = "PersonWrapper";
message Person {
int32 id = 1;
string first_name = 2;
string last_name = 3;
com.Trivadis.protobuf.lov.Title title = 4;
string email_address = 5;
string phone_number = 6;
string fax_number = 7;
google.protobuf.Timestamp date_of_birth = 8;
repeated trivadis.protobuf.address.v1.Addresss
addresses = 9;
}
syntax = "proto3";
package com.trivadis.protobuf.lov;
enum Title {
UNKNOWN = 0;
MR = 1;
MRS = 2;
MS = 3;
}
syntax = "proto3";
package
com.trivadis.protobuf.address.v1;
message Addresss {
int32 id = 1;
string street_and_nr = 2;
string zip_and_city = 3;
}
person-v1.proto title-v1.proto
address-v1.proto
https://developers.google.com/protocol-buffers/docs/proto3

Defining Schema – JSON
Person-v1.avsc
{
"type" : "record",
"namespace" : "com.trivadis.avro.person.v1",
"name" : "Person",
"description" : "the representation of a person",
"fields" : [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name" : "title", "type" : {
"type" : "enum", "name" : "TitleEnum",
"symbols" : ["Unknown", "Mr", "Mrs", "Ms"]
}
},
{ "name": "emailAddress", "type": ["null","string"] },
{ "name": "phoneNumber", "type": ["null","string"] },
{ "name": "faxNumber", "type": ["null","string"] },
{ "name": "dateOfBirth", "type": {
"type": "int", "logicalType": "date" } },
{ "name" : "addresses", ... }
]
}
https://avro.apache.org/docs/current/spec.html

Defining Schema - IDL
@namespace("com.trivadis.avro.person.v1")
protocol PersonIdl {
import idl "Address-v1.avdl";
enum TitleEnum {
Unknown, Mr, Ms, Mrs
}
record Person {
int id;
string firstName;
string lastName;
TitleEnum title;
union { null, string } emailAddress;
union { null, string } phoneNumber;
union { null, string } faxNumber;
date dateOfBirth;
array<com.trivadis.avro.address.v1.Address> addresses;
}
}
@namespace("com.trivadis.avro.address.v1")
protocol AddressIdl {
record Address {
int id;
string streetAndNr;
string zipAndCity;
}
}
Note: JSON Schema can be
generated from IDL Schema using
Avro Tools
address-v1.avdl
Person-v1.avdl
https://avro.apache.org/docs/current/idl.html

Defining Schema - Specification
Multiple message types can be defined in
single proto file
Field Numbers – each field in the
message has a unique number
• used to identify the fields in the message
binary format
• should not be changed once message type
is in use
• 1 – 15 uses single byte, 16 – 2047 uses two
bytes to encode
Default values are type-specific
Schema can either be represented as
JSON or by using the IDL
Avro specifies two serialization
encodings: binary and JSON
Encoding is done in order of fields
defined in record
schema used to write data always needs
to be available when the data is read
• Schema can be serialized with the data or
• Schema is made available through registry

Defining Schema - Data Types
• Scalar Types
• double, float, int32, int64, uint32, uint64,
sint32, sint64, fixed32, fixed64, sfixed32,
sfixed64
• bool
• string
• bytes
• Embedded Messages
• Enumerations
• Repeated
• Scalar Types
• null
• int, long, float, double
• boolean
• string
• bytes
• Records
• Map (string, Schema)
• Arrays (Schema)
• Enumerations
• Union
• Logical Types

Defining Schema - Style Guides
Use CamelCase (with an initial capital) for
message names
Use underscore_separated_names
for field names
enum type names
Use CAPITALS_WITH_UNDERSCORES for
enum value names
Use java-style comments for
documenting
record names
Use Camel Case for field names
enum type names
Use CAPITALS_WITH_UNDERSCORES for
enum value names
Use java-style comments (IDL) or doc
property (JSON) for documenting

IDE Support
Eclipse
• https://marketplace.eclipse.org/content/prot
obuf-dt
IntelliJ
• https://plugins.jetbrains.com/plugin/8277-
protobuf-support
Eclipse
• https://marketplace.eclipse.org/content/avro
clipse
IntelliJ
• https://plugins.jetbrains.com/plugin/7971-
apache-avro-support

Using Protobuf and Avro from Java
if you are using Maven, add the following dependency to your POM:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</dependency>

With Code Generation – Generate the code
Run the protocol buffer compiler
One compiler for all supported languages
Produces classes for the given language
Run the specific tool for the given
language
• For Java
• For C++
• For C#
protoc -I=$SRC_DIR --java_out=$DST_DIR
$SRC_DIR/person-v1.proto
java -jar /path/to/avro-tools-1.8.2.jar
compile schema Person-v1.avsc .
avrogencpp -i cpx.json -o cpx.hh -n c
Microsoft.Hadoop.Avro.Tools codegen
/i:C:SDKsrcMicrosoft.Hadoop.Avro.Tool
sSampleJSONSampleJSONSchema.avsc /o:

With Code Generation – Using Maven
Use protobuf-maven-plugin for
generating code at maven build
• Generates to target/generated-
sources
• Scans all project dependencies for .proto
files
• protoc has to be installed on machine
Use avro-maven-plugin for
generating code at maven build
• Generates to target/generated-
sources

With Code Generation – Create an instance
addresses.add(Address.newBuilder()
.setId(1)
.setStreetAndNr("Somestreet 10")
.setZipAndCity("9332 Somecity").build())
Person person = Person.newBuilder()
.setId(1)
.setFirstName("Peter")
.setLastName("Muster")
.setEmailAddress("peter.muster@somecorp.com")
.setPhoneNumber("+41 79 345 34 44")
.setFaxNumber("+41 31 322 33 22")
.setTitle(TitleEnum.Mr)
.setDateOfBirth(LocalDate.parse("1995-11-10"))
.setAddresses(addresses).build();
addresses.add(Addresss.newBuilder()
.setId(1)
.setStreetAndNr("Somestreet 10")
.setZipAndCity("9332 Somecity").build());
Instant time = Instant
.parse("1995-11-10T00:00:00.00Z");
Timestamp timestamp = Timestamp.newBuilder()
.setSeconds(time.getEpochSecond())
.setNanos(time.getNano()).build();
Person person = Person.newBuilder()
.setId(1)
.setFirstName("Peter")
.setLastName("Muster")
.setEmailAddress("peter.muster@somecorp.com")
.setPhoneNumber("+41 79 345 34 34")
.setFaxNumber("+41 31 322 33 22")
.setTitle(TitleEnumWrapper.Title.MR)
.setDateOfBirth(timestamp)
.addAllAddresses(addresses).build();

With Code Generation – Serializing
FileOutputStream fos = new FileOutputStream(BIN_FILE_NAME_V1));
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<Person> writer = new
SpecificDatumWriter<Person>(Person.getClassSchema());
writer.write(person, EncoderFactory.get().binaryEncoder(out, null));
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
fos.write(serializedBytes);
FileOutputStream output = new
FileOutputStream(BIN_FILE_NAME_V2);
person.writeTo(output);

With Code Generation – Deserializing
DatumReader<Person> datumReader = new
SpecificDatumReader<Person>(Person.class);
byte[] bytes = Files.readAllBytes(new File(BIN_FILE_NAME_V1).toPath());
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
Person person = datumReader.read(null, decoder);
System.out.println(person.getFirstName());
PersonWrapper.Person person =
PersonWrapper.Person.parseFrom(new
FileInputStream(BIN_FILE_NAME_V1));
System.out.println(person.getFirstName());

Encoding
• Field position (tag) are used as keys
• Variable length for int32 and int64
• + zig zag for sint32 and sint64
• data is serialized in field order of schema
Variable length, zig-zag for int and long,
fixed length for float and double
Variable length encoding: a method of serializing integers using one or more bytes
Zig Zag encoding: more efficient for negative numbers

Without code generation
final String schemaLoc = "src/main/avro/Person-v1.avsc";
final File schemaFile = new File(schemaLoc);
final Schema schema = new Schema.Parser().parse(schemaFile);
GenericRecord person1 = new GenericData.Record(schema);
person1.put("id", 1);
person1.put("firstName", "Peter");
person1.put("lastName", "Muster");
person1.put("title", "Mr");
person1.put("emailAddress", "peter.muster@somecorp.com");
person1.put("phoneNumber", "+41 79 345 34 44");
person1.put("faxNumber", "+41 31 322 33 22");
person1.put("dateOfBirth", new LocalDate("1995-11-10"));

Serializing to Object Container File
file has schema and all
objects stored in the file
must be according to
that schema
Objects are stored in
blocks that may be
compressed
final DatumWriter<Person> datumWriter = new
SpecificDatumWriter<>(Person.class);
final DataFileWriter<Person> dataFileWriter =
new DataFileWriter<>(datumWriter);
// use snappy compression
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(persons.get(0).getSchema(),
new File(CONTAINER_FILE_NAME_V1));
// specify block size
dataFileWriter.setSyncInterval(1000);
persons.forEach(person -> {
dataFileWriter.append(person);
});

Serializing to Object Container File
00000000: 4f62 6a01 0216 6176 726f 2e73 6368 656d 61f8 0d7b 2274 7970 6522 3a22 7265 636f 7264 Obj...avro.schema..{"type":"record
00000022: 222c 226e 616d 6522 3a22 5065 7273 6f6e 222c 226e 616d 6573 7061 6365 223a 2263 6f6d ","name":"Person","namespace":"com
00000044: 2e74 7269 7661 6469 732e 6176 726f 2e70 6572 736f 6e2e 7631 222c 2266 6965 6c64 7322 .trivadis.avro.person.v1","fields"
00000066: 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 222c 2264 6f63 223a :[{"name":"id","type":"int","doc":
00000088: 2269 6422 7d2c 7b22 6e61 6d65 223a 2266 6972 7374 4e61 6d65 222c 2274 7970 6522 3a22 "id"},{"name":"firstName","type":"
000000aa: 7374 7269 6e67 222c 2264 6f63 223a 2246 6972 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 string","doc":"First Name"},{"name
000000cc: 223a 226c 6173 744e 616d 6522 2c22 7479 7065 223a 2273 7472 696e 6722 2c22 646f 6322 ":"lastName","type":"string","doc"
000000ee: 3a22 4c61 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 223a 2274 6974 6c65 222c 2274 7970 :"Last Name"},{"name":"title","typ
00000110: 6522 3a7b 2274 7970 6522 3a22 656e 756d 222c 226e 616d 6522 3a22 5469 746c 6545 6e75 e":{"type":"enum","name":"TitleEnu
00000132: 6d22 2c22 646f 6322 3a22 5661 6c69 6420 7469 746c 6573 222c 2273 796d 626f 6c73 223a m","doc":"Valid titles","symbols":
00000154: 5b22 556e 6b6e 6f77 6e22 2c22 4d72 222c 224d 7273 222c 224d 7322 5d7d 2c22 646f 6322 ["Unknown","Mr","Mrs","Ms"]},"doc"
00000176: 3a22 7468 6520 7469 746c 6520 7573 6564 227d 2c7b 226e 616d 6522 3a22 656d 6169 6c41 :"the title used"},{"name":"emailA
00000198: 6464 7265 7373 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 7374 7269 6e67 225d 2c22 ddress","type":["null","string"],"
000001ba: 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 7068 6f6e 654e 756d 6265 7222 2c22 7479 doc":""},{"name":"phoneNumber","ty
000001dc: 7065 223a 5b22 6e75 6c6c 222c 2273 7472 696e 6722 5d2c 2264 6f63 223a 2222 7d2c 7b22 pe":["null","string"],"doc":""},{"
000001fe: 6e61 6d65 223a 2266 6178 4e75 6d62 6572 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 name":"faxNumber","type":["null","
00000220: 7374 7269 6e67 225d 2c22 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 6461 7465 4f66 string"],"doc":""},{"name":"dateOf
00000242: 4269 7274 6822 2c22 7479 7065 223a 7b22 7479 7065 223a 2269 6e74 222c 226c 6f67 6963 Birth","type":{"type":"int","logic
00000264: 616c 5479 7065 223a 2264 6174 6522 7d2c 2264 6f63 223a 2244 6174 6520 6f66 2042 6972 alType":"date"},"doc":"Date of Bir
00000286: 7468 227d 2c7b 226e 616d 6522 3a22 6164 6472 6573 7365 7322 2c22 7479 7065 223a 5b22 th"},{"name":"addresses","type":["
000002a8: 6e75 6c6c 222c 7b22 7479 7065 223a 2261 7272 6179 222c 2269 7465 6d73 223a 7b22 7479 null",{"type":"array","items":{"ty
000002ca: 7065 223a 2272 6563 6f72 6422 2c22 6e61 6d65 223a 2241 6464 7265 7373 222c 2266 6965 pe":"record","name":"Address","fie
000002ec: 6c64 7322 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 227d 2c7b lds":[{"name":"id","type":"int"},{
0000030e: 226e 616d 6522 3a22 7374 7265 6574 416e 644e 7222 2c22 7479 7065 223a 2273 7472 696e "name":"streetAndNr","type":"strin
00000330: 6722 7d2c 7b22 6e61 6d65 223a 227a 6970 416e 6443 6974 7922 2c22 7479 7065 223a 2273 g"},{"name":"zipAndCity","type":"s
00000352: 7472 696e 6722 7d5d 7d7d 5d7d 5d2c 2264 6573 6372 6970 7469 6f6e 223a 2274 6865 2072 tring"}]}}]}],"description":"the r
00000374: 6570 7265 7365 6e74 6174 696f 6e20 6f66 2061 2070 6572 736f 6e22 7d00 111d 965a be54 epresentation of a person"}....Z.T
00000396: 3682 1242 1863 02c2 982c 12f2 0f02 0a50 6574 6572 0c53 616d 706c 6502 0232 7065 7465 6..B.c...,.....Peter.Sample..2pete
000003b8: 722e 7361 6d70 6c65 4073 6f6d 6563 6f72 702e 636f 6d02 202b 3431 2037 3920 3334 3520 r.sample@somecorp.com. +41 79 345
000003da: 3334 2034 3402 202b 3431 2033 3120 3332 3220 3333 2032 32c8 9301 0202 021a 536f 6d65 34 44. +41 31 322 33 22.......Some
Avro Container File contains a header with the
Avro Schema used when writing the data
Synchronization markers are used between
data blocks to permit efficient splitting of files

Schema Evolution
Person (1.0)
• id : integer
• lastName : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
Person (1.1)
• id : integer
• middleName : text
• lastName : text
• addresses : array<Addresss>
• birthDate : date
Address (1.0)
V1.0 to V1.1
• Adding middleName
• Renaming dateOfBirth
to birthdate
• Move birthdate after
addresses
• Remove faxNumber

Schema Evolution
message Person {
int32 id = 1;
com.Trivadis.protobuf.lov.Title title = 4;
string fax_number = 7;
google.protobuf.Timestamp
date_of_birth = 8;
addresses = 9;
}
person-v1.proto
person-v1.1.proto
message Person {
reserved 7;
reserved "fax_number","date_of_birth";
int32 id = 1;
string middle_name = 10;
com.trivadis.protobuf.lov.Title title = 4;
// string fax_number = 7;
google.protobuf.Timestamp
birth_date = 8;
addresses = 9;
}

Schema Evolution
1 1
2 Peter
3 Sample
4 MR
5 peter.sample@somecorp.com
6 +41 79 345 34 44
7 +41 31 322 33 22
8 1995-11-10
9 1 1
9 2 Somestreet 10
9 3 9332 Somecity
1 1
2 Peter
3 Sample
4 MR
5 peter.sample@somecorp.com
6 +41 79 345 34 44
8 1995-11-10
9 1 1
9 2 Somestreet 10
9 3 9332 Somecity
10 Paul
unknown fields
7 +41 31 322 33 22
V1.0
V1.1
V1.0 to V1.1

Person-v1.avsc
Schema Evolution
Person-v1.1.avsc
{
"type" : "record",
"namespace" : "com.trivadis.avro.person.v1",
"name" : "Person",
"description" : "the representation of a person",
"fields" : [
{ "name" : "title", "type" : {
"type" : "enum", "name" : "TitleEnum",
"symbols" : ["Unknown", "Mr", "Mrs", "Ms"]
}
},
{ "name": "faxNumber", "type": ["null","string"] },
{ "name": "dateOfBirth", "type": {
"type": "int", "logicalType": "date" } },
{ "name" : "addresses", ... }
]
}
{
"type" : "record",
...
"fields" : [
{ "name": "middleName",
"type": ["null","string"], "default":null },
{ "name" : "title", ... },
{ "name" : "addresses", ... },
{ "name": "birthDate", "aliases": ["dateOfBirth"],
"type": { "type": "int", "logicalType": "date" } }

Schema Evolution
id 1
firstName Peter
lastName Sample
title MR
emailAddress peter.sample@somecorp.com
phoneNumber +41 79 345 34 44
faxNumber +41 31 322 33 22
dateOfBirth 1995-11-10
addresses.id 1
addresses.streetAndNr Somestreet 10
addresses.zipAndCity 9332 Somecity
V1.0 to V1.1
V1.0 V1.1
id 1
firstName Peter
middleName Paul
lastName Sample
title MR
emailAddress peter.sample@somecorp.com
phoneNumber +41 79 345 34 44
addresses.id 1
addresses.streetAndNr Somestreet 10
addresses.zipAndCity 9332 Somecity
birthDate 1995-11-10

Avro and Kafka
Source
Connector
Kafka Broker
Sink
Connector
Stream
Processing
Schema
Registry
Kafka Kafka

Avro and Kafka – Schema Registry
<plugin>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-maven-plugin</artifactId>
<configuration>
<schemaRegistryUrls>
<param>http://172.16.10.10:8081</param>
</schemaRegistryUrls>
<subjects>
<person-v1-value>src/main/avro/Person-v1.avsc</person-v1-value>
</subjects>
</configuration>
<goals>
<goal>register</goal>
</goals>
</plugin>
mnv schema-registry:register
curl -X "GET" "http://172.16.10.10:8081/subjects"

Avro and Kafka – Producing Avro to Kafka
@Configuration
public class KafkaConfig {
private String bootstrapServers;
private String schemaRegistryURL;
@Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL);
return props;
}
@Bean
public ProducerFactory<String, Customer> producerFactory() { .. }
@Bean
public KafkaTemplate<String, Customer> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
@Component
public class CustomerEventProducer {
@Autowired
private KafkaTemplate<String, Person> kafkaTemplate;
@Value("${kafka.topic.person}")
String kafkaTopic;
public void produce(Person person) {
kafkaTemplate.send(kafkaTopic, customer.getId().toString(), person);
}
}

Avro and Big Data
Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive,
Sqoop, Apache Spark, …
Spark Avro DataSource for Apache Spark supports using Avro as a source for
DataFrames: https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("Person")
val subPersonDF =
spark.sql("select * from Person where firstName like 'G%'")
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"

Column-oriented: Apache Parquet and ORC
A logical table can be translated using
• Row-based layout (Avro, Protobuf,
JSON, …)
• Column-oriented layout (Parquet,
ORC, …)
Apache Parquet
• collaboration between Twitter and Cloudera
• Support in Hadoop, Hive, Spark, Apache
NiFi, StreamSets, Apache Pig, …
Apache ORC
• was created by Facebook and Hortonworks
• Support in Hadoop, Hive, Spark, Apache
NiFi, Apache Pig, Presto, …
A B C
A1 B1 C1
A2 B2 C2
A2 B2 C2
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3

Protobuf and gRPC
https://grpc.io/
Google's high performance, open-source
universal RPC framework
layering on top of HTTP/2 and using protocol
buffers to define messages
Support for Java, C#, C++, Python, Go, Ruby,
Node.js, Objective-C, …
Source: https://grpc.io

Serialization / Deserialization
Service / Client
Logic
Event Hub
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize

Summary - Avro vs. Protobuf
Protobuf Avro
Schema Evolution Field Tag Schema
Compatibility Support great good
Code Generation yes yes
Dynamic Support no yes
Compactness of Encoding good great
Persistence Support no yes
Supports Compression no yes
RPC Support no (yes with gRPC) yes
Big Data Support no yes
Supported Languages Java, C++, C#, Python,
Objective-C, Go, …
Java, C++, C#, Python, Go, …

Technology on its own won't help you.
You need to know how to use it properly.

(Big) Data Serialization with Avro and Protobuf

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie (Big) Data Serialization with Avro and Protobuf

Ähnlich wie (Big) Data Serialization with Avro and Protobuf (20)

Mehr von Guido Schmutz

Mehr von Guido Schmutz (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

(Big) Data Serialization with Avro and Protobuf