Amihay Zer-Kavod discusses how LivePerson uses Apache Avro to maintain consistent data across services. Avro provides a unified event schema and tools for serialization, enabling events to be sent between services and stored in Hadoop. LivePerson's use of an event-driven system with a common Avro schema allows over 320,000 events per second to be processed and over 2TB of data to be stored daily.
Gen AI in Business - Global Trends Report 2024.pdf
Apache Avro in LivePerson [Hebrew]
1. Apache Avro in LivePerson
Collecting and saving data is easy
keeping it consistent is tough
Sandwich club, Sep 2014
Amihay Zer-Kavod, Software Architect
2. Who am I?
Amihay Zer-Kavod
Software Architect
Been in software Since 1989
4. Communication & Meaning
● Consistent but decoupled communication
between services, such as:
o Monitoring, Interaction
o Predictive, Sentiment
o RT Reporting & Analysis
o Visitor History
event
evento
事件
घटना
حدث
ארוע
событие
● Consistent meaning over time
o BigData Store (Hadoop)
o Offline Reporting & Analysis
5. What shouldn’t we use?
Don’t use Direct APIs!
They are completely wrong for this subject:
• They produce too much coupling between services
• APIs are synchronous by nature
• Adds irrelevant complexity to the called service
6. What is needed?
The Message is the API!
● A unified event model (schema) for all reported events
● Management tools for the unified schema
● Tools for sending events over the wire
● Tools for reading/writing event in big data
● Backward and forward compatibility
7. The Event model
From generic to specific structure with:
• Common header - all common data to all events
• Logical Entities - common header to all logical entities
(such as Visitor)
• Dynamic Specific headers
• Specific Event body
8. Apache Avro to the rescue
● Avro - a schema based serialization/deserialization
framework
● Avro idl - schema definition language
● Avro file - Hadoop integration
● Avro schema resolution
● Apache Avro created by Doug Cutting
11. Avro 101 - Avro IDL Schema
@namespace("com.liveperson.example")
enum Color { NO_COLOR, BLUE, BLACK, WHITE, PINK }
/**
Example event
*/
@namespace("com.liveperson.example")
record Event {
string id = “Unknown”;
long time = -1;
Color color = "NO_COLOR";
}
12. Avro 101 - Serialization
● JSON Serialization
● Binary serialization
○ int, long - variable length, Zig-zag encoding
○ float, double - 4,8 bytes respectively
○ string - long followed by UTF-8 bytes
○ map, array - unlimited size, use blocks
○ Unions - long index of the type
14. Avro 101 - Schema Resolution
● Writer schema must be always provided for decoding
● Reader can use its own schema
● Allows the reader and writer schema to evolve
independently
15. Avro vs...
Technologies Protobuf Thrift Avro
Created 2001 (2008) 2007 2009
Creator / Maintainer Google / Google Facebook / Apache
Doug cutting /
Apache
Schema evolution Field Tag Field Tag Schema
Static/Dynamic Yes/No Yes/No Yes/Yes
Hadoop support No No Yes
RPC No Yes Yes
Used by Google Facebook, Cassandra Hadoop, Liveperson
Lang support Good Great Good
16. Backward & Forward Compatibility
Avro schema evolution
● Avro supports resolution between two schemes
● Need to follow a set of rules:
● Every field must have a default value
● A field can be added (make sure to put a default value)
● Field types can not be changed (add a new field
instead)
● enum symbols can be added but never removed
17. Avro IDL - LivePerson Event
/** Base for all LivePerson Events
*/
@namespace("com.liveperson.global")
record LPEvent {
/** Common Header of the event */
CommonHeader header = null;
/** Logical entity details participating in this event - Visitor, Agent, etc... */
array<Participant> participants = null;
/** Holding specific platform info as node name (machine) cluster Id etc... */
PlatformHeader platformSpecificHeader = null;
/** Auditing Header, Optional - adds data for auditing of the events flow in the platform*/
union {null, AuditingHeader } auditingHeader = null;
/** The event body */
EventBody eventBody = null;
}
19. How good does it work?
● Cyber Monday 2013 (one day)
o More than 320,000 events per second
o 7 Storm topologies consuming the events seconds from
real time
o 2TB of data saved to Hadoop
● 2014 preparation:
o x2 number of events per second to ~640,000
20. So how did we do it?
1. Use an event driven system, don’t use direct APIs
2. Create a unified schema for all events
3. Use Avro to implement the schema
4. Add some supporting infrastructure