5. The toolkit to process data streams on Akka actors
Describe processing pipeline as a graph
Easy to define complex pipeline
What is Akka Streams?
Source
Flow
SinkBroadcast
Flow
Merge
Input
Generating stream elements
Fetching stream elements from outside
Processing
Processing stream elements sent from
upstreams one by one
Output
To a File
To outer resources
6. Sample code!
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
// stream elements
// a sink that prints received stream elements
// a source send elements defined above
// a flow maps received element to the URL of Bucket A
// a flow maps received element to the URL of Bucket B
// a Junction that broadcasts received elements to 2 outlets
// a Junction that merge received elements from 2 inlets
// THIS IS GREAT FUNCTIONALITY OF GraphDSL
// easy to describe graph
// Run the graph!!!
// terminate actor system when the graph is completed
7. Easy to use without knowing the detail of Akka Actor
GOOD!
8. Akka Streams implicitly do everything
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
// dispatch threads to actors
// create actors
Materializer creates Akka Actors based on
the blueprint when called RunnableGraph#run
and processing is going!!!
9. Conclusion
Built a graph with
Source, Flow, Sink etc
Declare materializer with implicit
RunnableGraph ActorMaterializer Actors
Almost Automatically
working with actors!!!
10. Tips
implicit val system = ActorSystem()
implicit val dispatcher = system.dispatcher
implicit val mat = ActorMaterializer()
val s3Keys = List(“key1”, “key2”)
val sinkForeach = Sink.foreach(println)
val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) {
implicit builder: GraphDSL.Builder[Future[Done]] =>
sink: Sink[String, Future[Done]]#Shape =>
import GraphDSL.Implicits._
val src = Source(s3Keys)
val flowA = Flow[String].map(key => s“s3://bucketA/$key”)
val flowB = Flow[String].map(key => s"s3://bucketB/$key")
val broadcast = builder.add(Broadcast[String](2))
val merge = builder.add(Merge[String](2))
src ~> broadcast ~> flowA ~> merge ~> sink
broadcast ~> flowB ~> merge
ClosedShape
})
blueprint.run() onComplete { _ =>
Await.ready(system.terminate(), 10 seconds)
}
To return MaterializedValue using GraphDSL, the graph
component that create MaterializedValue to return has to
be passed to GrapDSL#create. So it must be defined
outside GraphDSL builer… orz
Process will not be completed till
terminate ActorSystem
Don’t forget to terminate it!!!
If not define materialized value, blueprint does not
Return completion future…
12. Asynchronous message passing
Efficient use of CPU
Back pressure
Remarkable of Akka Streams are…
Source Sink
① Request a next element
② send a element
Upstreams send elements only when
received requests from downstream.
Down streams’ buffer will not overflow
13. What is GraphStage?
Source Sink
① Request a next element
Every Graph Component is
GraphStage!!
Not found in Akka streams standard library?
But want backpressure???
Implement custom GraphStages!!!
② send a element
14. SourceStage that emits Fibonacci
class FibonacciSource(to: Int) extends GraphStage[SourceShape[Int]] {
val out: Outlet[Int] = Outlet("Fibonacci.out")
override val shape = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
var fn_2 = 0
var fn_1 = 0
var n = 0
setHandler(out, new OutHandler {
override def onPull(): Unit = {
val fn =
if (n == 0) 0
else if (n == 1) 1
else fn_2 + fn_1
if (fn >= to) completeStage()
else push(out, fn)
fn_2 = fn_1
fn_1 = fn
n += 1
}
})
}
}
Define a shape of Graph
SourceShape that has a outlet that emit int elements
// new instance is created every time
RunnableGraph#run is called
// terminate this stage with completion
// called when every time received a request
from downstream (backpressure)
So mutable state must be initizalized
within the GraphStageLogic
// send an element to the downstream
16. Connect S3 with Kafka
Docker Container
Direct connect
Put 2.5TB/day !!! Must be scalable
17. Our architecture
Direct connect
① Notify
Created Events
② Receive object
keys to ingest
…③ Download ④ Produce
Distribute object keys to containers
(Work as Load Balancer)
18. At least once
= Sometimes duplicate
Once an event is read, it becomes invisible and
basically any consumers does not receive
the same event until passed visibility timeout
Load Balancing
Elements are not deleted until sending Ack
It is retriable, by not sending Ack when a failure occurs
Amazon SQS
19. Alpakka (Implementation of GraphStages)
SQS Connector
• Read events from SQS
• Ack
S3 Connector
• Downloading content of a S3 object
Reactive Kafka
Produce content to Kafka
Various connector libraries!!
https://github.com/akka/alpakka/tree/master/sqs
https://github.com/akka/alpakka/tree/master/s3
https://github.com/akka/reactive-kafka
20. S3 → Kafka
val src: Source[ByteString, NotUsed] =
S3Client().download(bucket, key)
val decompress: Flow[ByteString, ByteString, NotUsed] =
Compression.gunzip()
val lineFraming: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(delimiter = ByteString("n"),
maximumFrameLength = 65536, allowTruncation = false)
val sink: Sink[ProducerMessage.Message[Array[Byte], Array[Byte], Any], Future[Done]] =
Producer.plainSink(producerSettings)
val blueprint: RunnableGraph[Future[String]] = src
.via(decompress)
.via(lineFraming)
.via(Flow[ByteString]
.map(_.toArray)
.map { record => ProducerMessage.Message[Array[Byte], Array[Byte], Null](
new ProducerRecord[Array[Byte], Array[Byte]](conf.topic, record), null
)})
.toMat(sink)(Keep.right)
.mapMaterializedValue { done =>
done.map(_ => objectLocation)
}
// alpakka S3Connector
// a built-in flow to decompress gzipped content
// a built-in flow to divide file content into lines
// ReactiveKafka Producer Sink
// to return a future of completed object
key when called blueprint.run()
// convert binary to ProducerRecord of Kafka
21. Overall
implicit val mat: Materializer = ActorMaterializer(
ActorMaterializerSettings(system).withSupervisionStrategy( ex => ex match {
case ex: Throwable =>
system.log.error(ex, "an error occurs - skip and resume")
Supervision.Resume
})
)
val src = SqsSource(queueUrl)
val sink = SqsAckSink(queueUrl)
val blueprint: RunnableGraph[Future[Done]] =
src
.via(Flow[Message].map(parse)
.mapAsyncUnordered(concurrency) { case (msg, events) =>
Future.sequence(
events.collect {
case event: S3Created =>
S3KafkaGraph(event.location).run() map { completedLocation =>
s3.deleteObject(completedLocation.bucket, completedLocation.key)
}
}
) map (_ => msg -> Ack())
}
.toMat(sink)(Keep.right)
// alpakka SqsSource
// alpakka SqsAckSink
// Parse a SQS message to
keys of S3 object to consume
Run S3 -> Kafka graph
Delete success fully produced file
// Ack to a successfully handled message
Workaround for duplication in SQS, with supervision Resume,
app keeps going with ignoring failed message
(Such messages become visible after
visibility timeout but deleted after retention period)
22. Efficiency
Handle 3TB/day data with 24cores!!
Direct connect
① Notify
Created Events
② Receive object
locations to ingest
…③ Download ④ Produce
25. A sample code of GraphDSL (First example)
FibonacciSource
FlowStage with Buffer (Not in this slide)
gists
https://gist.github.com/Saint1991/d2737721551bc908f48b08e15f0b12d4
https://gist.github.com/Saint1991/2aa5841eea5669e8b86a5eb2df8ecb15
https://gist.github.com/Saint1991/29d097f83942d52b598cda20372ad671