Over the past few months, the Apache Flink and Apache Beam communities have been busy developing an industry leading solution to author batch and streaming pipelines with Python. This was made possible by a significant effort to revamp Beam’s portability framework, build the corresponding Flink Runner, and simplify Flink’s artifact distribution & deployment mechanisms.
What is the “killer big-data app” enabled by this integration: production TensorFlow pipelines. Building production machine learning pipelines that process large distributed data sets can get complex. In this talk, we will describe a set of open source libraries developed at Google, that simplify and unify pre and post processing stages for a production TensorFlow pipeline. These libraries are authored on Beam’s python SDK, and can be run on Apache Flink at scale.
Last, but not least, we will describe how Beam & Flink aim to bring the power of big-data to newer audiences, in particular, developers of the Go programming language.