Evolving Streaming Applications

Building streaming
applications
and evolve them with the business changes

• Bol.com
• Personalization
• Streaming systems
• Change …
• Getting the message across
• Serialization
• Escaping
• Encoding
• Defining
• Evolving
• Conclusions
Agenda

TU-Delft Computer Science (MSc)
Nyenrode Business School (MSc)
Software Developer (USoft)
Research Scientist (NLR)
Infra Architect (NLR)
WebAnalytics Architect (Moniforce)
Lead IT Architect (Bol.com)
Contributor for Apache Hadoop,
Pig, HBase, Storm, Flink, Parquet, …
Apache Avro Committer & PMC
Niels Basjes
nbasjes@bol.com
@nielsbasjes
https://github.com/nielsbasjes

> 18 million products for sale
~ 60 million in catalog
> 8.9 million active customers
> 55 million visits per month
> 6000 million
pageviews/year
Season 2017
~16.000.000 presents

Remember
Related
New service in my region
Wishlist
Overall campaign
Based on previous behavior/purchases
Personalization

Data relevance decay
Age of the data
Valueofthedata
Days WeeksMinutes

• Create the best possible
interaction data
• More details on youtube
niels basjes bbuzz 2016
Measuring 2.0

AnonymizedPersonal
Measuring 2.0WebShopBrowser
Rendering
Measuring2.0
Measuring2.0
JavaScript
HTML
Measure
endpoint
Kafka
Sessionizer
Files
Kafka
Files
Kafka
Anonymize
Search Suggest, RECO,
Analytics, …
Fraud prevention
~ 800M events/day
~ 1.5 TiB/day

Using the Measurements
Measuring 2.0
Personlization
Search Suggestion
Recommendations
Fraud prevention
Kafka

Future of services
• Many will do what Measuring 2.0 is doing today.
• Streaming interfaces
• Low latency
• Very large (extreme) volume
• Today ~ 10.000 messages/sec
• Next year > 100.000 messages/sec

How do we build such an interface?

Streaming applications
Data producer Streaming Interface Data consumers
Data consumers
Data consumers
Data consumers
The real
payload is
“byte array”

Kafka producer API
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++) {
producer.send(new ProducerRecord<String, String>("my-topic",
Integer.toString(i),
Integer.toString(i)));
}
producer.close();
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
… instruct how to turn the key and
value objects … into bytes.

PubSub producer API
https://cloud.google.com/pubsub/docs/quickstart-client-libraries#pubsub-client-libraries-java

So we need something to
• serialize records into bytes

Records: never a “Just a string”
Person = {
name = ‘Niels Basjes <"Hacker">’
occupation = ‘IT-Architect’
home = {
city = ‘Amstelveen’
}
company = {
name = ‘bol.com’
city = ‘Utrecht’
}
}

JSON?
{
"person": {
"name": "Niels Basjes <"Hacker">",
"occupation": "IT - Architect",
"home": {
"city": "Amstelveen"
},
"company": {
"name": "bol.com",
"city": "Utrecht"
}
}
}
Escaping !

XML?
<person>
<name>Niels Basjes <"Hacker"></name>
<occupation>IT-Architect</occupation>
<home>
<city>Amstelveen</city>
</home>
<company>
<name>bol.com</name>
<city>Utrecht</city>
</company>
</person>
Escaping !

CSV?
Real production example: The Omniture datafeed:
• <tab> separated record (~635 columns)
• The product_list column is a , separated list of products.
• A product is a ; separated record of fields.
• One of those fields is a | separated list of
• = separated key=value entries
• If it the key is eVar8 then the value is a _ separated pair of product id and title.
;9200000010474211;;;;eVar27=not shown|eVar3=tools| eVar35=1-
1|eVar39=2:BB:P|eVar47=PGT|eVar60=d:P| eVar72=10003747|eVar73=80001655|
eVar8=9200000010474211_Product title|
eVar9=seller 1019104 MaQui

I simplified the previous example…

The real mess we have:
;9200000010474211;;;;eVar27=not
shown|eVar3=tools| eVar35=1-
1|eVar39=2:BB:P|eVar47=PGT|eVar60=d:P|
eVar72=10003747|eVar73=80001655|
eVar8=9200000010474211_Heller borenset - 25-
delig - 1|15|2|25|3|35|4|45|5|55|6|65
|7|75|8|85|9|95|10|105|11|115|12|125|13 mm - niet
voor intensief gebruik|
eVar9=seller 1019104 MaQui
Somebody forgot
the escaping !

Putting a string into a byte[]
• Did you assume US-ASCII ?
• Or the MS-Dos 3.3 codepage 437 !
• Or was it codepage 850?
• EBCDIC
• ASCII
• CP-1252
• ISO 8859-1 (Latin1)
• ISO 8859-5
• Unicode
• UTF-7
• UTF-8
• UTF-16
• UTF-32
• High endian UCS-2
• Low endian UCS-2
• UCS-4
Bol.com
standard:
UTF-8

READ THIS!
Google: “joel spolsky unicode”

Use binary formats!
• Google Protobuf
• Apache Avro
• …

• make serializing records easy and reliable

Data types
• String
• Integer
• Floating point
• Collection
• List
• Map
• Enumeration

• supports data types (and exposes them in the API)

Defining a schema
• Names
• Types
• Optional / mandatory
• Default values
• Nesting

Defining a schema
• CSV
• Too bad: There is no schema.
• Manually write schema code
• Json
• Too bad: There is no schema
• Manually write schema code
• XML
• XSD
Q4 2018: Still draft
https://json-schema.org/

We all love defining an XSD …
<xs:schema attributeFormDefault="unqualified"
elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="name"/>
<xs:element type="xs:string" name="occupation"/>
<xs:element name="home">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="city"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="company">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="name"/>
<xs:element type="xs:string" name="city"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>

Defining a schema
• CSV
• Manually write serde
• Json
• XML
• XSD
• Protobuf
• IDL

Defining a schema
• CSV
• Json
• XML
• XSD
• Protobuf
• IDL
• Avro
• Json
• IDL

Apache Avro (IDL Schema)
Code generation

Production example
• Import statements
• Consumer comments

• make defining and using records easy

Applications change!
• New business
• New insights
• New wishes
• New scope
• New …
The records will
•  get new fields
•  have obsolete fields

• make serializing records easy
• make defining and distributing new versions easy

Kafka persists messages
• A message is retained until the TTL expired.
• So a topic will contain several message versions!
• With different fields
V1 V2
V3 V4

Rolling upgrades
• During producer upgrade
• New data in multiple versions are created at the same time
• During consumer upgrade
• Multiple ‘expected’ versions in a single consumer
• Multiple consumers, multiple versions

Creating a new version of a schema
• Assume separate jar library with the compiled schema code.
• Scenario 1:
• Producer get’s upgraded to V2 and produces
• Consumer (V1 compiled) reads V2 message.
• Scenario 2:
• Kafka with existing V1 and V2 records
• New consumer (V2 compiled) reads from start.
• Requirement:
• V1 and V2 must be 2-way compatible.

• make serializing records easy
• make defining and distributing new versions easy
• allow evolving to new schema versions easy

Evolving Protobuf
• Fields are tagged with a number
• Evolution is ‘number’ based.
• Schema evolution is ‘easy’ if you can do
that.
• Making 2-way compatible is TOO HARD.

Evolving Avro
• Fields are tagged by NAME
• Evolution is ‘name’ based.
• You can add new fields anywhere
• Making it 2-way compatible is easy

Simple rules for evolving a schema
1. Field is mandatory and will never be removed
• type field;
2. Field is optional and will never be removed
• union { null, type } field;
3. Field is newly defined and/or can be changed/removed
• type field = “default”;
4. Field is optional and newly defined and/or can be
changed/removed
• union { null, type } field = “default”;
• union { null, type } field = null;
5. Enum
• Avoid enums because these are number based

Avro Message format
• Needs a schema registry
• Fingerprint  JSon Schema
Only the 64bit id of the schema is in the message
We need a Schema Database/Registry

Always check if you did it right.

Producing from Flink into Kafka

Produce from Flink into Kafka
• …

In practice
• AVRO schema evolution works great for streaming
• There is no NEED to upgrade all consumers
• Schema evolution also used to limit the loaded fields
• Avoid needless Garbage Collections
• Applicable to any ‘single record’ storage
• HBase columns.

Thanks
till next bol.com
Niels Basjes
nbasjes@bol.com

Evolving Streaming Applications

Evolving Streaming Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Evolving Streaming Applications

Ähnlich wie Evolving Streaming Applications (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Evolving Streaming Applications

Hinweis der Redaktion