The reality of software systems is that the business needs they intend to service will change over time. So applications must be created that are able to evolve and follow these changing needs.
Welcome to the world where we are building high-volume distributed streaming applications using systems like Apache Flink, Spark and Kafka. Applications that are assumed to run 'forever' never go down.
So what happens if a business need changes? How can you make a streaming application that can evolve without breaking all the downstream applications that depend on it? Roll out the new producers first? Roll out the new consumers first? How do I avoid going down? But wait! Systems like Kafka persist records for weeks; so how do you handle the fact that there can be several different schemas in the Kafka topic at a certain point in time? Can you deploy a new application that reads both formats?
In this presentation Niels Basjes (Avro PMC) will go into the ways bol.com has chosen to handle these effects in a practical way. He will describe how the "Message" format and the schema evolution features of Apache Avro are used in conjuction with Apache Flink to make applications really 'evolvable'.
How do we make sure all applications are able to find the schema specifications, what can we do to ensure schemas stay 'evolvable,' and what were the pitfalls we ran into? Join us and find out.
6. > 18 million products for sale
~ 60 million in catalog
> 8.9 million active customers
> 55 million visits per month
> 6000 million
pageviews/year
Season 2017
~16.000.000 presents
14. Future of services
• Many will do what Measuring 2.0 is doing today.
• Streaming interfaces
• Low latency
• Very large (extreme) volume
• Today ~ 10.000 messages/sec
• Next year > 100.000 messages/sec
16. Streaming applications
Data producer Streaming Interface Data consumers
Data consumers
Data consumers
Data consumers
The real
payload is
“byte array”
17. Kafka producer API
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++) {
producer.send(new ProducerRecord<String, String>("my-topic",
Integer.toString(i),
Integer.toString(i)));
}
producer.close();
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
… instruct how to turn the key and
value objects … into bytes.
19. So we need something to
• serialize records into bytes
20. Records: never a “Just a string”
Person = {
name = ‘Niels Basjes <"Hacker">’
occupation = ‘IT-Architect’
home = {
city = ‘Amstelveen’
}
company = {
name = ‘bol.com’
city = ‘Utrecht’
}
}
23. CSV?
Real production example: The Omniture datafeed:
• <tab> separated record (~635 columns)
• The product_list column is a , separated list of products.
• A product is a ; separated record of fields.
• One of those fields is a | separated list of
• = separated key=value entries
• If it the key is eVar8 then the value is a _ separated pair of product id and title.
;9200000010474211;;;;eVar27=not shown|eVar3=tools| eVar35=1-
1|eVar39=2:BB:P|eVar47=PGT|eVar60=d:P| eVar72=10003747|eVar73=80001655|
eVar8=9200000010474211_Product title|
eVar9=seller 1019104 MaQui
25. The real mess we have:
;9200000010474211;;;;eVar27=not
shown|eVar3=tools| eVar35=1-
1|eVar39=2:BB:P|eVar47=PGT|eVar60=d:P|
eVar72=10003747|eVar73=80001655|
eVar8=9200000010474211_Heller borenset - 25-
delig - 1|15|2|25|3|35|4|45|5|55|6|65
|7|75|8|85|9|95|10|105|11|115|12|125|13 mm - niet
voor intensief gebruik|
eVar9=seller 1019104 MaQui
Somebody forgot
the escaping !
26. Putting a string into a byte[]
• Did you assume US-ASCII ?
• Or the MS-Dos 3.3 codepage 437 !
• Or was it codepage 850?
• EBCDIC
• ASCII
• CP-1252
• ISO 8859-1 (Latin1)
• ISO 8859-5
• Unicode
• UTF-7
• UTF-8
• UTF-16
• UTF-32
• High endian UCS-2
• Low endian UCS-2
• UCS-4
Bol.com
standard:
UTF-8
29. So we need something to
• serialize records into bytes
• make serializing records easy and reliable
30. Data types
• String
• Integer
• Floating point
• Collection
• List
• Map
• Enumeration
31. So we need something to
• serialize records into bytes
• make serializing records easy and reliable
• supports data types (and exposes them in the API)
33. Defining a schema
• CSV
• Too bad: There is no schema.
• Manually write schema code
• Json
• Too bad: There is no schema
• Manually write schema code
• XML
• XSD
Q4 2018: Still draft
https://json-schema.org/
35. Defining a schema
• CSV
• Too bad: There is no schema.
• Manually write serde
• Json
• Too bad: There is no schema
• Manually write serde
• XML
• XSD
• Protobuf
• IDL
37. Defining a schema
• CSV
• Too bad: There is no schema.
• Manually write serde
• Json
• Too bad: There is no schema
• Manually write serde
• XML
• XSD
• Protobuf
• IDL
• Avro
• Json
• IDL
41. So we need something to
• serialize records into bytes
• make serializing records easy and reliable
• supports data types (and exposes them in the API)
• make defining and using records easy
42. Applications change!
• New business
• New insights
• New wishes
• New scope
• New …
The records will
• get new fields
• have obsolete fields
43. So we need something to
• serialize records into bytes
• make serializing records easy
• supports data types (and exposes them in the API)
• make defining and using records easy
• make defining and distributing new versions easy
44. Kafka persists messages
• A message is retained until the TTL expired.
• So a topic will contain several message versions!
• With different fields
V1 V2
V3 V4
45. Rolling upgrades
• During producer upgrade
• New data in multiple versions are created at the same time
• During consumer upgrade
• Multiple ‘expected’ versions in a single consumer
• Multiple consumers, multiple versions
46. Creating a new version of a schema
• Assume separate jar library with the compiled schema code.
• Scenario 1:
• Producer get’s upgraded to V2 and produces
• Consumer (V1 compiled) reads V2 message.
• Scenario 2:
• Kafka with existing V1 and V2 records
• New consumer (V2 compiled) reads from start.
• Requirement:
• V1 and V2 must be 2-way compatible.
47. So we need something to
• serialize records into bytes
• make serializing records easy
• supports data types (and exposes them in the API)
• make defining and using records easy
• make defining and distributing new versions easy
• allow evolving to new schema versions easy
48. Evolving Protobuf
• Fields are tagged with a number
• Evolution is ‘number’ based.
• Schema evolution is ‘easy’ if you can do
that.
• Making 2-way compatible is TOO HARD.
49. Evolving Avro
• Fields are tagged by NAME
• Evolution is ‘name’ based.
• You can add new fields anywhere
• Making it 2-way compatible is easy
51. Simple rules for evolving a schema
1. Field is mandatory and will never be removed
• type field;
2. Field is optional and will never be removed
• union { null, type } field;
3. Field is newly defined and/or can be changed/removed
• type field = “default”;
4. Field is optional and newly defined and/or can be
changed/removed
• union { null, type } field = “default”;
• union { null, type } field = null;
5. Enum
• Avoid enums because these are number based
52. Avro Message format
• Needs a schema registry
• Fingerprint JSon Schema
Only the 64bit id of the schema is in the message
We need a Schema Database/Registry
59. In practice
• AVRO schema evolution works great for streaming
• There is no NEED to upgrade all consumers
• Schema evolution also used to limit the loaded fields
• Avoid needless Garbage Collections
• Applicable to any ‘single record’ storage
• HBase columns.
To aid the developers and acceptance testers in validating if the measurements have been done correctly we intend to create a plugin or overlay that when in the bol.com office (or connected via VPN) you can validate which measurements have actually been recorded for the page you are looking at right now (i.e ONLY the page YOU are looking at).