The Many Faces of Apache Kafka: Leveraging real-time data at scale

•

6 gefällt mir•1,753 views

Since it was open sourced, Apache Kafka has been adopted very widely from web companies like Uber, Netflix, LinkedIn to more traditional enterprises like Cerner, Goldman Sachs and Cisco. At these companies, Kafka is used in a variety of ways - as a pipeline for collecting high-volume log data for load into Hadoop, a means for collecting operational metrics to feed monitoring and alerting applications, for low latency messaging use cases and to power near realtime stream processing.

Ingenieurwesen

The Many Faces of
Apache Kafka: Leveraging
Real-time Data at Scale
Neha Narkhede, Conﬂuent

• Mission: Make Kafka-based stream data platform a
practical reality everywhere
• Product
• Stream processing integration for Kafka
• Connectors for streaming data ﬂow for common systems
• Monitor end-to-end data ﬂow
• Schemas and metadata management
• First release - Conﬂuent Platform 1.0
conﬂuent

Thank you
@nehanarkhede
http://conﬂuent.io/careers

Empfohlen

Apache Kafka 0.8 basic training - VerisignMichael Noll

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Design Patterns for working with Fast DataMapR Technologies

Capture the Streams of Database Changesconfluent

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

I Heart Log: Real-time Data and Apache KafkaJay Kreps

What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent

Fundamentals of Apache KafkaChhavi Parasher

Empfohlen

Apache Kafka 0.8 basic training - VerisignMichael Noll

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Design Patterns for working with Fast DataMapR Technologies

Capture the Streams of Database Changesconfluent

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

I Heart Log: Real-time Data and Apache KafkaJay Kreps

What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent

Fundamentals of Apache KafkaChhavi Parasher

Introduction to apache kafkaDimitris Kontokostas

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

kafka for db as postgresPivotalOpenSourceHub

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Power of the Log: LSM & Append Only Data Structuresconfluent

Current and Future of Apache KafkaJoe Stein

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

Apache kafkaLong Nguyen

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Apache kafkaNexThoughts Technologies

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Kafka Connect by DatioDatio Big Data

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Architecture of a Kafka camus infrastructuremattlieber

Apache kafkaZeeshan Khan

Apache kafkathe100rabh

Introduction to Apache Kafka- Part 1Knoldus Inc.

Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Fieldconfluent

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll

Enterprise Kafka: Kafka as a ServiceTodd Palino

ETL Is Dead, Long-live StreamsC4Media

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to apache kafkaDimitris Kontokostas

Kafka connect-london-meetup-2016Gwen (Chen) Shapira

kafka for db as postgresPivotalOpenSourceHub

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Power of the Log: LSM & Append Only Data Structuresconfluent

Current and Future of Apache KafkaJoe Stein

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

Apache kafkaLong Nguyen

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Apache kafkaNexThoughts Technologies

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Kafka Connect by DatioDatio Big Data

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Architecture of a Kafka camus infrastructuremattlieber

Apache kafkaZeeshan Khan

Apache kafkathe100rabh

Introduction to Apache Kafka- Part 1Knoldus Inc.

Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Fieldconfluent

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll

Was ist angesagt? (20)

Introduction to apache kafka

Kafka connect-london-meetup-2016

kafka for db as postgres

Confluent building a real-time streaming platform using kafka streams and k...

Power of the Log: LSM & Append Only Data Structures

Current and Future of Apache Kafka

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Real time Messages at Scale with Apache Kafka and Couchbase

Apache kafka

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

Apache kafka

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

Kafka Connect by Datio

PostgreSQL + Kafka: The Delight of Change Data Capture

Architecture of a Kafka camus infrastructure

Apache kafka

Introduction to Apache Kafka- Part 1

Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...

Andere mochten auch

Enterprise Kafka: Kafka as a ServiceTodd Palino

ETL Is Dead, Long-live StreamsC4Media

Apache Big Data EU 2015 - PhoenixNick Dimiduk

how to setup Adwords conversion tracking OM Maurya

Monitoring using Open source technologiesUTKARSH BHATNAGAR

Protecting your data at rest with Apache Kafka by Confluent and Vormetricconfluent

Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent

Azure Data platformMostafa

Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll

Theories mediaTatiitat

Monitoring as codeIcinga

Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent

Icinga Camp Berlin 2017 - Integrations all the wayIcinga

Building a Replicated Logging System with Apache KafkaGuozhang Wang

ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le

PapercreteManish_An2d

Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingIlyas F ☁☁☁

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services

Andere mochten auch (18)

Enterprise Kafka: Kafka as a Service

ETL Is Dead, Long-live Streams

Apache Big Data EU 2015 - Phoenix

how to setup Adwords conversion tracking

Monitoring using Open source technologies

Protecting your data at rest with Apache Kafka by Confluent and Vormetric

Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...

Azure Data platform

Being Ready for Apache Kafka - Apache: Big Data Europe 2015

Theories media

Monitoring as code

Introduction To Streaming Data and Stream Processing with Apache Kafka

Icinga Camp Berlin 2017 - Integrations all the way

Building a Replicated Logging System with Apache Kafka

ELK at LinkedIn - Kafka, scaling, lessons learned

Papercrete

Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

Kürzlich hochgeladen

Double Revolving field theory-how the rotor develops torqueBhangaleSonal

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7Call Girls in Nagpur High Profile Call Girls

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsArindam Chakraborty, Ph.D., P.E. (CA, TX)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

University management System project report..pdfKamal Acharya

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti

Unit 2- Effective stress & Permeability.pdfRagavanV2

notes on Evolution Of Analytic Scalability.pptMsecMca

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Online banking management system project.pdfKamal Acharya

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7Call Girls in Nagpur High Profile Call Girls

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya

Kürzlich hochgeladen (20)

Double Revolving field theory-how the rotor develops torque

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

Unit 1 - Soil Classification and Compaction.pdf

University management System project report..pdf

chapter 5.pptx: drainage and irrigation engineering

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

Unit 2- Effective stress & Permeability.pdf

notes on Evolution Of Analytic Scalability.ppt

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...

Online banking management system project.pdf

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

The Many Faces of Apache Kafka: Leveraging real-time data at scale

1. The Many Faces of Apache Kafka: Leveraging Real-time Data at Scale Neha Narkhede, Conﬂuent

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58. • Mission: Make Kafka-based stream data platform a practical reality everywhere • Product • Stream processing integration for Kafka • Connectors for streaming data flow for common systems • Monitor end-to-end data flow • Schemas and metadata management • First release - Confluent Platform 1.0 confluent

59. Thank you @nehanarkhede http://conﬂuent.io/careers

Hinweis der Redaktion

When I joined LinkedIn, I started on the Search team and we were building distributed real-time search infrastructure. In my spare time, I dabbled in understanding LinkedIn’s data infrastructure and got involved part-time in the WVMP product while doing so. And realized that the way data got into the Search indexes was very different from the way data got generated for serving WVMP results and so I got very interested in looking into how data flows through all the different systems we had. And it turns out that we had millions of different data pipelines between each different type of system and had a different way of handling each data source.
As with any web site, we had a few relational databases. We had one way for loading relational database data into our relational data warehouse.
And a different way to load data from relational databases into Hadoop which happened to be fragile CSV dumps.
And yet a separate way of subscribing to relational database changes and updating various caches which happened to be a tool we wrote at LinkedIn.
And that was just relational databases. We also had NoSQL stores that over time that got used pretty heavily at LinkedIn. Since the ETL scripts for relational databases were really only custom, we wrote separate ones to pull data from the NoSQL stores into Hadoop. The interesting thing to note here is that there wasn’t just a use case for pulling NoSQL data into Hadoop for analytics but also a use case for publishing data from Hadoop back into the NoSQL stores. Basically there were a bunch of LinkedIn products that built derived data models in Hadoop and published them to the NoSQL store for serving. For example, PYMK where the relationship ranking between people needed to be updated with updates to LinkedIn’s connection graph and WVMP where profile views needed to be aggregated. And so there was a different mechanism for publishing this sort of derived data from Hadoop into the NoSQL stores.
The data in data stores or databases is resident data, but a lot of useful data is to do with finding out “what happened”. This is especially true for relevance where you want to find out how users are using the products. Event data like user activity is critical for building the right products. Back then, our pipeline for collecting activity data was pretty ad-hoc too. User activity data got logged into minute or hourly files on individual frontend machines. These files were then aggregated, synced through NFS and loaded into HDFS as well as the relational DWH using custom scripts.
A bunch of our real-time use cases were built on top of ActiveMQ. The problem with these legacy messaging systems is also that these aren’t really scalable and so
every new use case ends up adding a separate ActiveMQ cluster making it very operationally expensive.
Then there was a separate pipeline for getting application log4j logs into Splunk
And yet a separate pipeline for getting our application jmx metrics to our company wide monitoring system.
And pretty soon, this is what our data pipeline looked like. Each pipeline was sufficient if you just focus on that part of the picture, but any use case where you required access to broader data was very cumbersome. This was a problem since there were lot of data rich products that we aspired to build and the process of getting anything like that off the ground was more like a research project since there wasn’t a uniform way to get access to all the disparate data sources. For example, the business activity data was available in the data warehouse but not in the company-wide real-time monitoring system. So there was a way to get daily reports but no way to do real-time alerting on important business metrics. There was a way to monitor application metrics in real-time but no way to ask deeper questions like what are the top applications with the largest latencies affecting page load times. And so on...
We really had millions of little disparate data pipelines between every type of specialized system and every data source. In fact, we really had millions of little data pipelines between each type of system and a different way for getting every new type of data to it. Making any change to one part broke many other parts of the pipeline. In short, a giant mess.
So we took a step back and thought hard about the problem. Definitely adding point to point connections between every new data source and system in an ad-hoc way wasn’t a great long-term solution. What we really envisioned was a stream data platform that acted as a central warehouse for streaming data, where every new data source was registered and stored and any system that needed access to data sources could just plug in and subscribe to it. And if this happened, then we could build a rich ecosystem of near real-time applications around such a stream data platform.
And so being infrastructure engineers, we thought that there ought to be an infrastructure solution to this problem. It was basically the problem of ensuring that data ends up in all the right places. And this wasn’t a new problem. Enterprise messaging systems existed to move data around, so that’s where we started.
We were running ActiveMQ in some parts of the website and we also looked into RabbitMQ at that time. It turned out that there were a bunch of problems.
Throughput requirement for these large volume data sources like logs turned out to be way too expensive on these messaging systems. The integration with batch systems just wasn’t what most of these messaging systems were designed for. Supporting batch systems requires ability to maintain a large backlog of unconsumed data and these messaging systems got slower and slower as the amount of unconsumed data accumulated over time and eventually died. These were mostly a result of persistence not being taken seriously in these systems, in fact, it is an after thought and an option you can turn on, not something that is expected to be on by default for high performance We also wanted to be able to do stream processing on our data and for that being able to rewind and reprocess data was a huge requirement. Messaging systems delete data the moment it has been consumed by a subscriber so they really weren’t designed to handle reprocessing of any kind. And then there was the problem of ensuring ordering guarantees. Why do you need ordering guarantees? If you publish profile updates to a search index, then you don’t want the latter update to precede the older one leading to a permanently inconsistent index.
As we thought more about these problems, we found ourselves re-architecting these systems, and so we started from scratch. Which lead to Kafka.
Kafka is essentially a publish-subscribe messaging system. You have producers. They send messages or events or records to a central Kafka cluster and you have a consumer that subscribes to these events. It is different from a lot of messaging systems out there in a number of ways- 1. Distributed from ground up ( So your producers are spread over many machines, your Kafka brokers are spread over many machines, and so on. And the consumer itself may be a distributed system, like Hadoop, also distributed over many machines.) 2. Persistent 3. Multi-subscriber (which means you can have 0 people consuming a feed or n people consuming a feed)
The central abstraction that Kafka provides is a log. It is worth talking a little bit about what I mean by log. A log is an abstract data structure that has some properties. It is like a structured array or feed of messages, so it is ordered. It is append-only and hence immutable, so the records don’t change once they are written. And the records are ordered by time. So the newer records are appended towards the end and each record is addressable using an index. It is sometimes known as the log sequence number or in Kafka we call it an offset. Each record in this log has data and I haven’t talked about how the data is structured and for this talk, I’m going to ignore that detail. But imagine that data is just some byte array that was serialized from JSON or AVRO or protocol buffers. So that is the logical view of a log.
So you might think, hmm I understand this log abstraction but how does this help a messaging system. It turns out that logs are actually a fantastic mechanism for pub-sub systems. I’ve shown a logical view of the log here and a source system that appends new data to the logical log and 2 destination systems, system A and system B, that have subscribed to the log. And there is only one log no matter how many subscribing systems there are. So I’m not maintaining a queue per reader. And everybody sees the data in the same order, which is the log order. And most importantly, the log has a log sequence number. It applies the notion of time to the log, which is not physical time, like it’s 9am. But logical time. So if system A has read up to record 6, it is at time 6 and system B is at time 2. This ends up being really important since you can reason about the state of system A and the state of system B by how much of this log they have consumed, what log sequence number they are unto.
Physically, if you wanted to scale out a log, you can shard it into multiple partitions. And if you did that, then that is exactly the backend of Kafka where logically a log is a topic or category of data and physically, the log or topic lives in partitions on several machines or brokers. As events occur, we continually append it to the partition’s log. And we have a policy for maintaining a window of the log. Either it is based on time (say retain for a week or 2) or based on size (retain 1TB or less).
As I mentioned before, Kafka is multi-subscriber where the same topic can be consumed by multiple groups of consumers where each consumer group can subscribe to read the full copy of data. Furthermore, every consumer group can have multiple consumer processes distributed over several machines and Kafka takes care of assigning the partitions of the subscribed topics evenly amongst the consumer processes in a group so that at all times, every partition of a subscribed topic is being consumed by some consumer process within the group.
If one fails, the other ones automatically rebalance to pick up the load of the failed consumer instance. So it is operationally cheap to consume large amounts of data.
Our Kafka adoption at LinkedIn initially started with user activity data and log data and application metrics and later expanded to include database data. Essentially all data sources ended up being published to Kafka.
It fed the Hadoop clusters and the relational data warehouse.
And any system that needed access to a data source merely subscribed to Kafka and asked for the related topic. What’s cool is that for any new data source to be accessible company-wide, the investment involved was a just one-time integration with Kafka instead of the previous manual integration work for every new system that needed access to that data source.
With Kafka becoming a hub for all streaming data at LinkedIn, it also led to people writing a lot of stream processing applications using Kafka as the backend. Kafka not only an input but also output for streaming data.
Today, practically all data at LinkedIn is available as Kafka stream. In total, LinkedIn’s Kafka clusters take 500 Billion writes and 2.5 trillion reads of messages per day.
In the last couple years, Kafka has been adopted beyond LinkedIn at 100s of companies worldwide - from web companies like LinkedIn, Netflix and Uber to the larger enterprises like Cerner, Cisco and Goldman Sachs. Kafka is used in a variety of different ways, but if you were just starting off, here are a few examples of how Kafka is adopted in real production deployments. Or what are some practical use cases with which you can start your Kafka adoption.
The most obvious one making data available in real-time applications as well as in batch systems. Similar to what we started with at LinkedIn. There are quite a few data examples of this pattern of adopting Kafka. But if I were to give a few, I’d start with monitoring data.
You typically have monitoring data like jmx metrics from all your web apps, OS metrics from all machines that need to be routed to your monitoring application so that you can monitor the health of your IT infrastructure.
But really monitoring is more than just reactive analysis. What you also want to do is longer look-back analytics or be able to spot week-over-week trends, and to be able to do that at scale, you also want to make sure the data reaches Hadoop for querying through Hive or other Hadoop query engines. The older clunky way of doing that is to write a custom pipeline for getting the metrics data from every application machine to the monitoring system and another one, say Flume, for piping data into Hadoop.
But the obvious problem with that is that you end up doing these double writes. Double writes are bad for several reasons - Firstly, the operational overhead is high of having 2 different pipelines for the same data. But secondly, you no longer have the same view of the data across the various systems that consume it. For example, the data that reaches the monitoring application may not be exactly what reaches Hadoop, especially when there are intermediate failures.
Foursquare, LinkedIn and a bunch of companies have moved over their monitoring data pipelines to Kafka where every machine has an agent that sends all metrics data directly to Kafka topics. The monitoring application merely consumes from the Kafka topic in real-time while the Kafka-HDFS connector, like Camus, loads data from the Kafka topics into HDFS for querying.
Same is true for log collection. Logs are a critical part of any system, they give you insight into what a system is doing as well what happened. Virtually every process in a company generates logs in some form or another.
Usually, these logs are written to files on local disks. When your system grows to multiple hosts, managing the logs and accessing them can get complicated. A common processing step with logs is to load it into some sort of search system to be able to search for a particular error across hundreds of log files, especially for quick troubleshooting.
And Elasticsearch is great for searching through recent history but if you wanted to do analytics on longer time window, you’d want the same logs to also reach Hadoop. And you’d probably end up with Flume or something to pull data from all your machines into Hadoop.
double writes again
The advantage of moving to Kafka is that it helps integrate all such use cases that require shipping data to 2 systems with very different consumption profiles.
However, this pattern of copying the same data multiple times isn’t just limited to monitoring or log data, but happens to be a general and widespread problem within a company. Where over time applications evolves in a way where you end up double writing its state in different ways, mainly for the purposes of denormalization.
Let’s take an example to show this denormalization problem and its effect on data integration more concrete. Let’s say you had this stereotypical web app architecture for serving a LinkedIn profile. When the user updates her profile, the web app writes those to the database. This could be a relational database, HBase, Cassandra or something else. This is great, but as the number of users on the website increases, you want to be able serve the user profile data much faster without overloading your database.
So you throw up a cache, something like memcached. And now when the user updates their profile, you add some more logic to the web app to not only write updates to the database, but also update the cache. This is also great, but then you later decide that you also want to enable the search feature on the website.
So you throw a search system up, this could be Elasticsearch or Solr or a custom one. And now the web app has to route the profile updates to yet a third place, which is the search index in addition to the database and the cache.
And pretty soon, you realize that your web app is slowing down. With every new system that the data has to reach to, if you add custom logic to the web app, it eventually slows down the process of synchronously updating the user’s profile. So you want to be able to feed these various systems the profile update asynchronously in the background. And so you introduce some sort of queuing system and since the data needs to be subscribed to by multiple systems, you need a pub sub messaging system, not just a queue.
This is where Kafka comes in. Kafka is not just scalable where sharded topics can be consumed in parallel by multiple subscribers, the key value that you get from using Kafka for data integration is that it allows you to decouple the process of applying the primary updates and updating various derived denormalized views. By introducing Kafka in the picture, now your web app is very simple. It merely writes to Kafka and waits for the write to be committed to Kafka. And the write can be applied to each one of these systems asynchronously in the background.
once you have solved the problem of making data easily accessible through all the systems in the company in real-time, people find ways of processing it in real-time.
What we’ve observed is that Since Kafka is storage layer for streaming data, it is a foundational building block for scalable real-time processing.
Let’s take a concrete example to see why - This is a simplistic credit card fraud protection example. Let’s say you have a stream of credit card transactions where every message is a credit card transaction made by some user and let’s say you have some code that determines if a transaction should be flagged, maybe because the size of the txn doesn’t match the user’s profile. Depending on the result of that check, you either want to put the user’s transaction on hold until verified or you want to pass it along for further processing. So you write some code that subscribes to the CreditCardTxns topic and produces 2 topics - ValidTxns and TxnsOnHold. And then you might have some further processing code that subscribes to the TxnsOnHold topic and triggers the verification steps. This is great for protecting individual credit card users but you also want some ability to flag large scale fraud.
Let’s say for that, monitoring the rate at which multiple users txns get flagged is a good metric for detecting large scale fraud. For computing rate on infinite streams, you need to window the data. Let’s say the window here is 15 seconds. So you have some code that subscribes to the TxnsOnHold streams, buffers for 15 seconds and publishes a new stream. Now you might have some code which for every window, counts the number of transactions by unique users and determines if the flagged transactions is higher than the determined safe threshold. If so, it publishes a message to an alert stream.
If you notice through this stream processing example that it is nothing but some code that processes streams of data to produce more streams of data. With Kafka as the means of representing these streams, this often just boils down to writing some code that uses a Kafka consumer to subscribe to topics, process data and produce resulting data to one or more Kafka topics
However, if you have enough of stream processing applications that do this, you might consider using one of several frameworks that make it easy to deploy and manage your stream processing code on a cluster of machines. And there are pros and cons of each that I won’t get into the details of. Each one of these frameworks integrate closely with Kafka as the backend for deploying user written processing code on streaming data.
Putting all of this together, this really boils down to a storage layer for streaming data along with various applications and systems that evolve around it to process streaming data. And real-time or streaming data means different things for different companies - retail, finance, IoT, websites. The point of the stream data platform is to make disparate data sources available in a uniform way no matter how they are generated. And, any place in the company that has all your data ends up building an ecosystem around it. For example, Hadoop has a bunch of useful systems that integrate with it for higher level processing and analytics. There is an equivalent ecosystem that is evolving around Kafka, making it the central hub for streaming data in your company. We were able to realize this vision of building a stream data platform at LinkedIn, several companies have followed suit since then making it a trend in data processing.
The trends in data platforms have evolved over time. Back in the 1990s, data warehousing was the most important ecosystem within the company. Several analytical and reporting tools were built around it.
Around 2008, few years after Google wrote about map reduce, there were 2 fundamental changes in the data world - Distributed computing became much more mainstream and that coupled with the re-emergence of open source, led to a number of very successful open source big data technologies to evolve. And Hadoop became the new gold standard for cheap data warehousing and a number of companies have deployed Hadoop to complement their relational data warehouses.
From then until now, the most interesting change is that the variety of data that people are interested in is rapidly changing and is realtime in nature - event data, IoT, logs. People realized that there is value in moving away from getting daily reports of business metrics to more continuous realtime reporting and complementing batch processing with realtime stream processing. Making Kafka the next big wave in data processing trends.
A few months ago, some of us who created Kafka left LinkedIn to start Confluent. Our mission is to make the stream data platform and Kafka a practical reality everywhere. In terms of the product, this means - having first class stream processing integration for Kafka connectors to common systems to enable streaming data flow also a product to help you monitor end-to-end data flow and a meaningful metadata management and schemas story that works with the connectors We had a first release of our product, the Confluent Platform 1.0 and will have a second release coming up soon.