Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Machine Learning for
Real Time Streaming Data with
Kafka and TensorFlow
Yong Tang
Maintainer & SIG IO Lead,
TensorFlow
Git...
Stream Data Processing & Machine Learning
Apache Kafka & Event Driven Microservices
Service
Service
Service
Service
Service
TensorFlow & 2.0
• Most popular machine learning framework
• 1.x
• 2.0
• Keras high level API
• Eager execution
• Data pro...
Training Workflow (TensorFlow 2.0)
Read and Preprocess
Data
Model Building Training Saving
tf.data
tf.keras
tf.estimator
e...
# <= read data into tf.data
dataset = …. , test_dataset = ….
# <= build keras model
model = tf.keras.models.Sequential([
t...
Steam Data + Machine Learning: Challenges
Streaming & big data
• Hadoop/Zookeeper
• Kafka, Spark, Flink
• Java/Scala
• Par...
TensorFlow TFRecord Format
• Sequence of binary strings
• Protocol buffer for serialization
• Flexible and simple
• ONLY u...
Infrastructure Solution
Service
Service
TensorFlow
Convert To
TFRecord File
Write Back to
Kafka
Persistent
Storage
/ File ...
tf.data in TensorFlow
• Part of the graph in TensorFlow
• Support Eager and Graph mode
• Focus on data transformation
tf.d...
KafkaDataset for TensorFlow
• Subclass of tf.data.Dataset
• Written in C++ (linked with librdkafka)
• Part of the graph in...
KafkaDataset for TensorFlow
• Part of tensorflow package in TF 1.x
• tf.contrib.kafka
• Part of tensorflow-io package in T...
$ pip install tensorflow-io
import tensorflow_io.kafka as kafka_io
dataset = kafka_io.KafkaDataset( 'topic', server='local...
# continue with inference…
# build keras callback
class OutputCallback(tf.keras.callbacks.Callback):
def __init__(self, ba...
KafkaDataset for TensorFlow
Service
Service
TensorFlow
Service
TensorFlow SIG I/O
• https://github.com/tensorflow/io
• Special Interest Group under TensorFlow org
• Focus on I/O, stream...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019
Nächste SlideShare
Wird geladen in …5
×

Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

226 Aufrufe

Veröffentlicht am

In mission-critical real time applications, using machine learning to analyze streaming data are gaining momentum. In those applications Apache Kafka is the most widely used framework to process the data streams. It typically works with other machine learning frameworks for model inference and training purposes. In this talk, our focus is to discuss the KafkaDataset module in TensorFlow. KafkaDataset processes Kafka streaming data directly to TensorFlow’s graph. As a part of Tensorflow (in ‘tf.contrib’), the implementation of KafkaDataset is mostly written in C++. The module exposes a machine learning friendly Python interface through Tensorflow’s ‘tf.data’ API. It could be directly feed to ‘tf.keras’ and other TensorFlow modules for training and inferencing purposes. Combined with Kafka streaming itself, the KafkaDataset module in TensorFlow removes the need to have an intermediate data processing infrastructure. This helps many mission-critical real time applications to adopt machine learning more easily. At the end of the talk we will walk through a concrete example with a demo to showcase the usage we described.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

  1. 1. Machine Learning for Real Time Streaming Data with Kafka and TensorFlow Yong Tang Maintainer & SIG IO Lead, TensorFlow GitHub: yongtang
  2. 2. Stream Data Processing & Machine Learning
  3. 3. Apache Kafka & Event Driven Microservices Service Service Service Service Service
  4. 4. TensorFlow & 2.0 • Most popular machine learning framework • 1.x • 2.0 • Keras high level API • Eager execution • Data processing: tf.data • Distribution Strategy
  5. 5. Training Workflow (TensorFlow 2.0) Read and Preprocess Data Model Building Training Saving tf.data tf.keras tf.estimator eager execution autograph distribution strategy saved model
  6. 6. # <= read data into tf.data dataset = …. , test_dataset = …. # <= build keras model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam’, loss='sparse_categorical_crossentropy’, metrics=['accuracy’]) # <= train the model model.fit(dataset, epochs=5) # <= inference predictions = model.predict(test_dataset, callbacks=[callback])
  7. 7. Steam Data + Machine Learning: Challenges Streaming & big data • Hadoop/Zookeeper • Kafka, Spark, Flink • Java/Scala • Parquet/Avro/Arrow Machine leaning • CPU/GPU/TPU • Python (C++, CUDA) • Numpy • TFRecord/CSV/Text/JPEG/PNG
  8. 8. TensorFlow TFRecord Format • Sequence of binary strings • Protocol buffer for serialization • Flexible and simple • ONLY used by TensorFlow • NO ecosystem support like parquet, etc.
  9. 9. Infrastructure Solution Service Service TensorFlow Convert To TFRecord File Write Back to Kafka Persistent Storage / File System
  10. 10. tf.data in TensorFlow • Part of the graph in TensorFlow • Support Eager and Graph mode • Focus on data transformation tf.data.TFRecordDataset tf.data.TextLineDataset tf.data.FixedLengthRecordDataset
  11. 11. KafkaDataset for TensorFlow • Subclass of tf.data.Dataset • Written in C++ (linked with librdkafka) • Part of the graph in TensorFlow • Support Eager and Graph mode
  12. 12. KafkaDataset for TensorFlow • Part of tensorflow package in TF 1.x • tf.contrib.kafka • Part of tensorflow-io package in TF 2.0 • KafkaDataset (Read) • KafkaOutputSequence (Write) • Python and R (other language support possible)
  13. 13. $ pip install tensorflow-io import tensorflow_io.kafka as kafka_io dataset = kafka_io.KafkaDataset( 'topic', server='localhost', group= '') # <= pre-process data (if needed), batch/repeat/etc. dataset = dataset.map( lambda x: ……) # <= build keras model model = tf.keras.models… model.compile(…) # <= train the model model.fit(dataset, epochs=5)
  14. 14. # continue with inference… # build keras callback class OutputCallback(tf.keras.callbacks.Callback): def __init__(self, batch_size, topic, servers): self._sequence = kafka_io.KafkaOutputSequence( topic=topic, servers=servers) self._batch_size = batch_size def on_predict_batch_end(self, batch, logs=None): ... self._sequence.setitem(index, class_names[np.argmax(output)]) # <= inference with callback for streaming input and output model.predict( test_dataset, callbacks=[OutputCallback(32, 'topic', 'localhost')])
  15. 15. KafkaDataset for TensorFlow Service Service TensorFlow Service
  16. 16. TensorFlow SIG I/O • https://github.com/tensorflow/io • Special Interest Group under TensorFlow org • Focus on I/O, streaming, and data formats • Streaming: • Apache Kafka, AWS Kinesis, Google Cloud PubSub • Memory, caching and serialization • Apache Ignite, Apache Arrow, Apache Parquet • File formats: • MNIST, WebP/TIFF/etc, Video, Audio • Cloud Storage • GCS, S3, Azure

×