SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
mypipe: Buffering and consuming
MySQL changes via Kafka
with
-=[ Scala - Avro - Akka ]=-
Hisham Mardam-Bey
Github: mardambey
Twitter: @codewarrior
Overview
● Who is this guy? + Quick Mate1 Intro
● Quick Tech Intro
● Motivation and History
● Features
● Design and Architecture
● Practical applications and usages
● System diagram
● Future work
● Q&A
Who is this guy?
● Linux and OpenBSD user and developer
since 1996
● Started out with C followed by Ruby
● Working with the JVM since 2007
● “Lately” building and running distributed
systems, and doing Scala
Github: mardambey
Twitter: @codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially a team of 3, around 30 now
● Engineering team has 12 geeks / geekettes
○ Always looking for talent!
● We own and run our own hardware
○ fun!
○ mostly…
https://github.com/mate1
Super Quick Tech Intro
● MySQL: relational database
● Avro: data serialization system
● Kafka: publish-subscribe messaging
rethought as a distributed commit log
● Akka: toolkit and runtime simplifying the
construction of concurrent and distributed
applications
● Actors: universal primitives of concurrent
computation using message passing
● Schema repo / registry: holds versioned
Avro schemas
Motivation
● Initially, wanted:
○ MySQL triggers outside the DB
○ MySQL fan-in or fan-out replication (data cubes)
○ MySQL to “Hadoop”
● And then:
○ Cache or data store consistency with DB
○ Direct integration with big-data systems
○ Data schema evolution support
○ Turning MySQL inside out
■ Bootstrapping downstream data systems
History
● 2010: Custom Perl scripts to parse binlogs
● 2011/2012: Guzzler
○ Written in Scala, uses mysqlbinlog command
○ Simple to start with, difficult to maintain and control
● 2014: Enter mypipe!
○ Initial prototyping begins
Feature Overview (1/2)
● Emulates MySQL slave via binary log
○ Writes MySQL events to Kafka
● Uses Avro to serialize and deserialize data
○ Generically via a common schema for all tables
○ Specifically via per-table schema
● Modular by design
○ State saving / loading (files, MySQL, ZK, etc.)
○ Error handling
○ Event filtering
○ Connection sources
Feature Overview (2/2)
● Transaction and ALTER TABLE support
○ Includes transaction information within events
○ Refreshes schema as needed
● Can publish to any downstream system
○ Currently, we have have Kafka
○ Initially, we started with Cassandra for the prototype
● Can bootstrap a MySQL table into Kafka
○ Transforms entire table into Kafka events
○ Useful with Kafka log compaction
● Configurable
○ Kafka topic names
○ whitelist / blacklist support
● Console consumer, Dockerized dev env
Project Structure
● mypipe-api: API for MySQL binlogs
● mypipe-avro: binary protocol, mutation
serialization and deserialization
● mypipe-producers: push data downstream
● mypipe-kafka: Serializer & Decoder
implementations
● mypipe-runner: pipes and console tools
● mypipe-snapshotter: import MySQL tables
(beta)
MySQL Binary Logging
● Foundation of MySQL replication
● Statement or Row based
● Represents a journal / change log of data
● Allows applications to spy / tune in on
MySQL changes
MySQLBinaryLogConsumer
● Uses behavior from abstract class
● Modular design, in this case, uses config
based implementations
● Uses Hocon for ease and availability
case class MySQLBinaryLogConsumer(config: Config)
extends AbstractMySQLBinaryLogConsumer
with ConfigBasedConnectionSource
with ConfigBasedErrorHandlingBehaviour
with ConfigBasedEventSkippingBehaviour
with CacheableTableMapBehaviour
AbstractMySQLBinaryLogConsumer
● Maintains connection to MySQL
● Primarily handles
○ TABLE_MAP
○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)
○ XID
○ Mutations (INSERT, UPDATE, DELETE)
● Provides an enriched binary log API
○ Looks up table metadata and includes it
○ Scala friendly case class and option-driven(*) API for
speaking MySQL binlogs
(*) constant work in progress (=
TABLE_MAP and table metadata
● Provides table metadata
○ Precedes mutation events
○ But no column names!
● MySQLMetadataManager
○ One actor per database
○ Uses “information_schema”
○ Determines column metadata and primary key
● TableCache
○ Wraps metadata actor providing a cache
○ Refreshes tables “when needed”
Mutations
case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)
case class PrimaryKey(columns: List[ColumnMetadata])
case class Column(metadata: ColumnMetadata, value: java.io.Serializable)
case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:
Option[PrimaryKey])
case class Row(table: Table, columns: Map[String, Column])
case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)
case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
● Fully enriched with table metadata
● Contain column types, data and txid
● Mutations can be serialized and deserialized
from and to Avro
Kafka Producers
● Two modes of operation:
○ Generic Avro beans
○ Specific Avro beans
● Producers decoupled from SerDE
○ Recently started supporting Kafka serializers and
decoders
○ Currently we only support: http://schemarepo.org/
○ Very soon we can integrate with systems such as
Confluent Platform’s schema registry.
Kafka Message Format
-----------------
| MAGIC | 1 byte |
|-----------------|
| MTYPE | 1 byte |
|-----------------|
| SCMID | N bytes |
|-----------------|
| DATA | N bytes |
-----------------
● MAGIC: magic byte, for protocol version
● MTYPE: mutation type, a single byte
○ indicating insert (0x1), update (0x2), or delete (0x3)
● SCMID: Avro schema ID, N bytes
● DATA: the actual mutation data as N bytes
Generic Message Format
3 Avro beans
○ InsertMutation, DeleteMutation, UpdateMutation
○ Hold data for new and old columns (for updates)
○ Groups data by type into Avro maps
{
"name": "old_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "new_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "old_strings",
"type": {"type": "map", "values": "string"}
},
{
"name": "new_strings",
"type": {"type": "map", "values": "string"}
} ...
Specific Message Format
Requires 3 Avro beans per table
○ Insert, Update, Delete
○ Specific fields can be used in the schema
{
"name": "UserInsert",
"fields": [
{
"name": "id",
"type": ["null", "int"]
},
{
"name": "username",
"type": ["null", "string"]
},
{
"name": "login_date",
"type": ["null", "long"]
},...
]
},
ALTER table support
● ALTER table queries intercepted
○ Producers can handle this event specifically
● Kafka serializer and deserializer
○ They inspect Avro beans and refresh schema if
needed
● Avro evolution rules must be respected
○ Or mypipe can’t properly encode / decode data
Pipes
● Join consumers to producers
● Use configurable time based checkpointing
and flushing
○ File based, MySQL based, ZK based, Kafka based
schema-repo-client = "mypipe.avro.schema.SchemaRepo"
consumers {
localhost {
# database "host:port:user:pass" array
source = "localhost:3306:mypipe:mypipe"
}
}
producers {
stdout {
class = "mypipe.kafka.producer.stdout.StdoutProducer"
}
kafka-generic {
class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer"
}
}
pipes {
stdout {
consumers = ["localhost"]
producer { stdout {} }
binlog-position-repo {
#class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository"
class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository"
config {
file-prefix = "stdout-00" # required if binlog-position-repo is specifiec
data-dir = "/tmp/mypipe/data"
}
}
}
kafka-generic {
enabled = true
consumers = ["localhost"]
producer {
kafka-generic {
metadata-brokers = "localhost:9092"
}
}
}
Practical Applications
● Cache coherence
● Change logging and auditing
● MySQL to:
○ HDFS
○ Cassandra
○ Spark
● Once Confluent Schema Registry integrated
○ Kafka Connect
○ KStreams
● Other reactive applications
○ Real-time notifications
Pipe 2
Pipe 1
Kafka
System Diagram
Hadoop Cassandra
MySQL
BinaryLog
Consumer
Dashboards
Binary Logs
Select
Consumer
MySQL
Kafka
Producer
Schema
Registry
Kafka
Producer
db2_tbl1
db2_tbl2
db1_tbl1
db1_tbl2
Event
Consumers
Users
Pipe N
MySQL
BinaryLog
Consumer
Kafka
Producer
Future Work
● Finish MySQL -> Kafka snapshot support
● Move to Kafka 0.10
● MySQL global transaction identifier (GTID)
support
● Publish to Maven
● More tests, we have a good amount, but you
can’t have enough!
Fin!
That’s all folks (=
Thanks!
Questions?
https://github.com/mardambey/mypipe

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
shimi_k
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
Konstantin Osipov
 

Was ist angesagt? (19)

9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
Sphinx && Perl  Houston Perl Mongers - May 8th, 2014Sphinx && Perl  Houston Perl Mongers - May 8th, 2014
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexFOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
 
Tuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormTuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache Storm
 

Ähnlich wie mypipe: Buffering and consuming MySQL changes via Kafka

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 

Ähnlich wie mypipe: Buffering and consuming MySQL changes via Kafka (20)

Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Mito, a successor of Integral
Mito, a successor of IntegralMito, a successor of Integral
Mito, a successor of Integral
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafka
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 

Kürzlich hochgeladen

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 

Kürzlich hochgeladen (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 

mypipe: Buffering and consuming MySQL changes via Kafka

  • 1. mypipe: Buffering and consuming MySQL changes via Kafka with -=[ Scala - Avro - Akka ]=- Hisham Mardam-Bey Github: mardambey Twitter: @codewarrior
  • 2. Overview ● Who is this guy? + Quick Mate1 Intro ● Quick Tech Intro ● Motivation and History ● Features ● Design and Architecture ● Practical applications and usages ● System diagram ● Future work ● Q&A
  • 3. Who is this guy? ● Linux and OpenBSD user and developer since 1996 ● Started out with C followed by Ruby ● Working with the JVM since 2007 ● “Lately” building and running distributed systems, and doing Scala Github: mardambey Twitter: @codewarrior
  • 4. Mate1: quick intro ● Online dating, since 2003, based in Montreal ● Initially a team of 3, around 30 now ● Engineering team has 12 geeks / geekettes ○ Always looking for talent! ● We own and run our own hardware ○ fun! ○ mostly… https://github.com/mate1
  • 5. Super Quick Tech Intro ● MySQL: relational database ● Avro: data serialization system ● Kafka: publish-subscribe messaging rethought as a distributed commit log ● Akka: toolkit and runtime simplifying the construction of concurrent and distributed applications ● Actors: universal primitives of concurrent computation using message passing ● Schema repo / registry: holds versioned Avro schemas
  • 6. Motivation ● Initially, wanted: ○ MySQL triggers outside the DB ○ MySQL fan-in or fan-out replication (data cubes) ○ MySQL to “Hadoop” ● And then: ○ Cache or data store consistency with DB ○ Direct integration with big-data systems ○ Data schema evolution support ○ Turning MySQL inside out ■ Bootstrapping downstream data systems
  • 7. History ● 2010: Custom Perl scripts to parse binlogs ● 2011/2012: Guzzler ○ Written in Scala, uses mysqlbinlog command ○ Simple to start with, difficult to maintain and control ● 2014: Enter mypipe! ○ Initial prototyping begins
  • 8. Feature Overview (1/2) ● Emulates MySQL slave via binary log ○ Writes MySQL events to Kafka ● Uses Avro to serialize and deserialize data ○ Generically via a common schema for all tables ○ Specifically via per-table schema ● Modular by design ○ State saving / loading (files, MySQL, ZK, etc.) ○ Error handling ○ Event filtering ○ Connection sources
  • 9. Feature Overview (2/2) ● Transaction and ALTER TABLE support ○ Includes transaction information within events ○ Refreshes schema as needed ● Can publish to any downstream system ○ Currently, we have have Kafka ○ Initially, we started with Cassandra for the prototype ● Can bootstrap a MySQL table into Kafka ○ Transforms entire table into Kafka events ○ Useful with Kafka log compaction ● Configurable ○ Kafka topic names ○ whitelist / blacklist support ● Console consumer, Dockerized dev env
  • 10. Project Structure ● mypipe-api: API for MySQL binlogs ● mypipe-avro: binary protocol, mutation serialization and deserialization ● mypipe-producers: push data downstream ● mypipe-kafka: Serializer & Decoder implementations ● mypipe-runner: pipes and console tools ● mypipe-snapshotter: import MySQL tables (beta)
  • 11. MySQL Binary Logging ● Foundation of MySQL replication ● Statement or Row based ● Represents a journal / change log of data ● Allows applications to spy / tune in on MySQL changes
  • 12. MySQLBinaryLogConsumer ● Uses behavior from abstract class ● Modular design, in this case, uses config based implementations ● Uses Hocon for ease and availability case class MySQLBinaryLogConsumer(config: Config) extends AbstractMySQLBinaryLogConsumer with ConfigBasedConnectionSource with ConfigBasedErrorHandlingBehaviour with ConfigBasedEventSkippingBehaviour with CacheableTableMapBehaviour
  • 13. AbstractMySQLBinaryLogConsumer ● Maintains connection to MySQL ● Primarily handles ○ TABLE_MAP ○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER) ○ XID ○ Mutations (INSERT, UPDATE, DELETE) ● Provides an enriched binary log API ○ Looks up table metadata and includes it ○ Scala friendly case class and option-driven(*) API for speaking MySQL binlogs (*) constant work in progress (=
  • 14. TABLE_MAP and table metadata ● Provides table metadata ○ Precedes mutation events ○ But no column names! ● MySQLMetadataManager ○ One actor per database ○ Uses “information_schema” ○ Determines column metadata and primary key ● TableCache ○ Wraps metadata actor providing a cache ○ Refreshes tables “when needed”
  • 15. Mutations case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean) case class PrimaryKey(columns: List[ColumnMetadata]) case class Column(metadata: ColumnMetadata, value: java.io.Serializable) case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey: Option[PrimaryKey]) case class Row(table: Table, columns: Map[String, Column]) case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID) case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID) case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID) ● Fully enriched with table metadata ● Contain column types, data and txid ● Mutations can be serialized and deserialized from and to Avro
  • 16. Kafka Producers ● Two modes of operation: ○ Generic Avro beans ○ Specific Avro beans ● Producers decoupled from SerDE ○ Recently started supporting Kafka serializers and decoders ○ Currently we only support: http://schemarepo.org/ ○ Very soon we can integrate with systems such as Confluent Platform’s schema registry.
  • 17. Kafka Message Format ----------------- | MAGIC | 1 byte | |-----------------| | MTYPE | 1 byte | |-----------------| | SCMID | N bytes | |-----------------| | DATA | N bytes | ----------------- ● MAGIC: magic byte, for protocol version ● MTYPE: mutation type, a single byte ○ indicating insert (0x1), update (0x2), or delete (0x3) ● SCMID: Avro schema ID, N bytes ● DATA: the actual mutation data as N bytes
  • 18. Generic Message Format 3 Avro beans ○ InsertMutation, DeleteMutation, UpdateMutation ○ Hold data for new and old columns (for updates) ○ Groups data by type into Avro maps { "name": "old_integers", "type": {"type": "map", "values": "int"} }, { "name": "new_integers", "type": {"type": "map", "values": "int"} }, { "name": "old_strings", "type": {"type": "map", "values": "string"} }, { "name": "new_strings", "type": {"type": "map", "values": "string"} } ...
  • 19. Specific Message Format Requires 3 Avro beans per table ○ Insert, Update, Delete ○ Specific fields can be used in the schema { "name": "UserInsert", "fields": [ { "name": "id", "type": ["null", "int"] }, { "name": "username", "type": ["null", "string"] }, { "name": "login_date", "type": ["null", "long"] },... ] },
  • 20. ALTER table support ● ALTER table queries intercepted ○ Producers can handle this event specifically ● Kafka serializer and deserializer ○ They inspect Avro beans and refresh schema if needed ● Avro evolution rules must be respected ○ Or mypipe can’t properly encode / decode data
  • 21. Pipes ● Join consumers to producers ● Use configurable time based checkpointing and flushing ○ File based, MySQL based, ZK based, Kafka based
  • 22. schema-repo-client = "mypipe.avro.schema.SchemaRepo" consumers { localhost { # database "host:port:user:pass" array source = "localhost:3306:mypipe:mypipe" } } producers { stdout { class = "mypipe.kafka.producer.stdout.StdoutProducer" } kafka-generic { class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer" } }
  • 23. pipes { stdout { consumers = ["localhost"] producer { stdout {} } binlog-position-repo { #class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository" class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository" config { file-prefix = "stdout-00" # required if binlog-position-repo is specifiec data-dir = "/tmp/mypipe/data" } } }
  • 24. kafka-generic { enabled = true consumers = ["localhost"] producer { kafka-generic { metadata-brokers = "localhost:9092" } } }
  • 25. Practical Applications ● Cache coherence ● Change logging and auditing ● MySQL to: ○ HDFS ○ Cassandra ○ Spark ● Once Confluent Schema Registry integrated ○ Kafka Connect ○ KStreams ● Other reactive applications ○ Real-time notifications
  • 26. Pipe 2 Pipe 1 Kafka System Diagram Hadoop Cassandra MySQL BinaryLog Consumer Dashboards Binary Logs Select Consumer MySQL Kafka Producer Schema Registry Kafka Producer db2_tbl1 db2_tbl2 db1_tbl1 db1_tbl2 Event Consumers Users Pipe N MySQL BinaryLog Consumer Kafka Producer
  • 27. Future Work ● Finish MySQL -> Kafka snapshot support ● Move to Kafka 0.10 ● MySQL global transaction identifier (GTID) support ● Publish to Maven ● More tests, we have a good amount, but you can’t have enough!
  • 28. Fin! That’s all folks (= Thanks! Questions? https://github.com/mardambey/mypipe