How To Troubleshoot Collaboration Apps for the Modern Connected Worker
Introduction to Apache Kafka and why it matters - Madrid
1. 11
Introduction to Apache Kafka
... and why it matters!
Wednesday, 18th July 2018
Auditorio Google Campus Madrid
Calle de Manzanares, 1 - Madrid
https://www.meetup.com/apachekafkamadrid
2. 22
A Sale An Invoice A Trade A Customer
Experience
An immutable record that something
at some point has happened
Events
4. 44
Top 10 Strategic Technology Trends for 2018: Event Driven Model
Digital businesses are event-driven
According to Gartner, legacy application
architectures lack support for continuous
innovation and global scale.
Organizations need to invest in event-
centric design practices and technologies
to exploit digital business moments.
https://www.confluent.io/resources/top-10-strategic-technology-trends-for-2018-event-driven-model
5. 55
How Organizations Handle Data Flows: a Giant Mess
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Logging
Bloomberg
…any sink/source
Web Custom Apps Microservices Monitoring Analytics
…and more
OLTP
ActiveMQ
App App
Caches
OLTP OLTPAppAppApp
6. 66
Apache Kafka™: A Distributed Streaming Platform
Apache Kafka
Offline Batch (+1 Hour)Near-Real Time (>100s ms)Real Time (0-100 ms)
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Twitter
Bloomberg
…any sink/source …any sink/source
…and more
Web Custom Apps Microservices Monitoring Analytics
7. 77
More than 1
petabyte of data
in Kafka
Over 1.2 trillion
messages per
day
Thousands of
data streams
Source of all
data warehouse
& Hadoop data
Over 300 billion
user-related
events per day
8. 88
Jay Kreps: Co-Creator of Apache Kafka and Co-Founder of Confluent
Apache Kafka and the four challenges
of production machine learning
systems
Untangling data pipelines with a
streaming platform.
[…] This is our driving vision behind Kafka: to
make a central nervous system for data that
allows intelligent applications to be easily
integrated with normal event-driven
microservices and other data pipelines.
https://www.oreilly.com/ideas/apache-kafka-and-the-four-challenges-of-production-machine-learning-systems
9. 99
Over 35% of Fortune 500’s are using Apache Kafka™
6 of top 10
Travel
7 of top 10
Global banks
8 of top 10
Insurance
9 of top 10
Telecom
10. 1010
Royal Bank of Canada (RBC)
https://www.confluent.io/customers/rbc
■ 16 Million Clients
■ 35 Countries
■ 30+ Use-cases
■ 50+ apps
■ 10+ lines of businesses
“Kafka transforms even the most basic of initiatives. Adoption of Kafka at
RBC has been massive and organic. Within the first six weeks after our
launch of Kafka, we had 37 teams asking to use Kafka for various projects
and initiatives.”
-- Kerry Joel, Senior Director,
Product Innovation, Data and Analytics
Digital
Marketi
ng
Securi
ty
Consumer
Credit
Services
SaaS
Corporate
Real Estate
Investor
Services
Treasury
Services
….
FraudData
Wareho
use
Microservices
https://www.youtube.com/watch?v=WTxmHHJcHRc
11. 1111
Industry Trends… and why Apache Kafka matters!
1. From ‘big data’ (batch) to ‘fast data’ (stream processing)
2. Internet of Things (IoT) and sensor data
3. Microservices and asynchronous communication (coordination
messages and data streams) between loosely coupled and fine-
grained services
14. 1414
Apache Kafka API – ETL Analogy
Source SinkConnectAPI
ConnectAPI
Streams API
Extract Transform Load
15. 1515
The Connect API of Apache Kafka®
ü Centralized management and configuration
ü Support for hundreds of technologies
including RDBMS, Elasticsearch, HDFS, S3
ü Supports CDC ingest of events from RDBMS
ü Preserves data schema
ü Fault tolerant and automatically load balanced
ü Extensible API
ü Single Message Transforms
ü Part of Apache Kafka, included in
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
16. 1616
The Streams API of Apache Kafka®
ü No separate processing cluster required
ü Develop on Mac, Linux, Windows
ü Deploy to containers, VMs, bare metal, cloud
ü Powered by Kafka: elastic, scalable,
distributed, battle-tested
ü Perfect for small, medium, large use cases
ü Fully integrated with Kafka security
ü Exactly-once processing semantics
ü Part of Apache Kafka, included in
Confluent Open Source
Write standard Java applications and microservices
to process your data in real-time
KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic");
KTable<Windowed<User>, Long> viewsPerUserSession = pageViews
.groupByKey()
.count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views");
https://docs.confluent.io/current/streams/
17. 1717
KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent
ü No coding required, all you need is SQL
ü No separate processing cluster required
ü Powered by Kafka: elastic, scalable,
distributed, battle-tested
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.userid
WHERE u.level = 'Platinum';
KSQL is the simplest way to process streams of data in real-time
ü Perfect for streaming ETL, anomaly detection,
event monitoring, and more
ü Part of Confluent Open Source
https://github.com/confluentinc/ksql
18. 1818
KSQL: in less than 5 minutes!
https://www.youtube.com/watch?v=A45uRzJiv7I
21. 2121
Physical Architecture
Rack 1
Kafka Broker #1
ToR Switch
ToR Switch
Schema Registry #1
Kafka Connect #1
Zookeeper #1
REST Proxy #1
Kafka Broker #4
Zookeeper #4
Rack 2
Kafka Broker #2
ToR Switch
ToR Switch
Schema Registry #2
Kafka Connect #2
Zookeeper #2
Kafka Broker #5
Zookeeper #5
Rack 3
Kafka Broker #3
ToR Switch
ToR Switch
Kafka Connect #3
Zookeeper #3
Core Switch Core Switch
REST Proxy #2
Load Balancer Load Balancer
Control Center #1 Control Center #2
22. 2222
Big Data and Fast Data Ecosystems
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
Apache Kafka
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time
Analytics
NoSQL
Stream
Processing
Apache Hadoop
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
Confluent HDFS Connector
(exactly once semantics)
https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
23. 2323
Building a Microservices Ecosystem with Kafka Streams and KSQL
https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
https://github.com/confluentinc/kafka-streams-examples/tree/master/src/main/java/io/confluent/examples/streams/microservices
24. 2424
Apache Kafka: PMC members and committers
https://kafka.apache.org/committers
PMC
PMC PMC PMCPMC PMC PMC PMC
PMC PMC PMC
25. 2525
10 Hot IT Skills in 2018
https://business.udemy.com/blog/10-hot-it-skills-2018/
Udemy has analysed patterns
of 20+ million learners to see
what hot IT skills are trending in
2018.
#1 Apache Kafka is at the top of
the list!
26. 2626
Books: Available for Free in PDF
https://www.confluent.io/apache-kafka-stream-processing-book-bundle
27. 2727
Linux is not just Linux,
it’s RedHat
Hadoop is not just Hadoop,
it’s Cloudera or Hortonworks
Kafka is not just Kafka,
it’s Confluent
28. 2828
Confluent Completes Kafka
Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise
Apache Kafka
High throughput, low latency, high availability, secure distributed streaming
platform
Kafka Connect API Advanced API for connecting external sources/destinations into Kafka
Kafka Streams API
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, .NET and several others
REST Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC, Elasticsearch, Amazon S3 and other connectors fully certified
and supported by Confluent
JMS Client
Support for legacy Java Message Service (JMS) applications consuming
and producing directly from Kafka
Confluent Control
Center
Enables easy connector management, monitoring and alerting for a Kafka
cluster
Auto Data Balancer Rebalancing data across cluster to remove bottlenecks
Replicator Multi-datacenter replication simplifies and automates MDC Kafka clusters
Support
Enterprise class support to keep your Kafka environment running at top
performance Community Community 24x7x365
29. 2929
Confluent: Deploy Apache Kafka without the Hassle
Steps we take:
• Bundled for easy
script-driven
installation
Pre-built packages:
• RPM
• Deb
• Tar.gz
• Docker Images
• Mesos
Tests we run:
• Regressions
• Cluster performance
• Stress tests
• Broker death
• Upgrade tests
• Compatibility tests
30. 3030
Confluent Platform Demo
• Process real-time edits to Wikipedia
pages with Confluent, KSQL, Kafka
Connect, Elasticsearch and Kibana.
• Use Docker to deploy a Kafka
streaming ETL using KSQL for
stream processing and Confluent
Control Center for monitoring.
• All the components in the Confluent
platform have security enabled end-
to-end.
• Run the demo yourself with the
playbook and video tutorials.
• Do not use this demo in production,
it's a demo!
Monitoring Kafka streaming ETL deployments
https://docs.confluent.io/current/tutorials/cp-demo/docs/index.html
https://github.com/confluentinc/cp-demo