4. Medialets
• Largest deployment of rich media ads for mobile devices
• Installed on hundreds of millions of devices
• 3-4 TB of new data every day
• Thousands of services in production
• Hundreds of thousands of events received every second
• Response times are measured in microseconds
• Languages
– 55% JVM (70% Scala & 30% Java)
– 20% C/C++
– 13% Python
– 10% Ruby
– 2% Bash
4
6. A distributed publish-subscribe messaging system
• Originally created by LinkedIn, contributed to Apache in
July 2011 currently in incubation
• Kafka is written in Scala
• Multi-language support for Publish/Consumer API
(Scala, Java, Ruby, Python, C++, Go, PHP, etc)
7
8. Offline log aggregation and real-time messaging
Other “log-aggregation only” systems (e.g. Scribe and Flume) are
architected for “push” to drive the data.
– high performance and scale however:
• Expected end points are large (e.g. Hadoop)
• End points can’t have lots of business logic in real-time
because they have consume as fast as data is pushed to
them… excepting the data is their main job
• Messaging Systems (e.g. RabitMQ, ActiveMQ)
– Does not scale
• No API for batching, transactional (broker retains consumers
stream position)
• No message persistence means multiple consumers over
time are impossible limiting architecture
9
9. All-in-one system with one architecture and one API
• Kafka is a specialized system and overlaps uses cases for
both offline and real-time log processing.
10
11. Performance & Scale
• Producer Test:
– LinkedIn configured the broker in all systems to asynchronously
flush messages to its persistence store.
– For each system, they ran a single producer to publish a total of 10
million messages, each of 200 bytes.
– They configured the Kafka producer to send messages in batches of
size 1 and 50. ActiveMQ and RabbitMQ don’t seem to have an easy
way to batch messages and they assumed that it used a batch size
of 1.
– In the next slide, the x-axis represents the amount of data sent to
the broker over time in MB, and the y-axis corresponds to the
producer throughput in messages per second. On average, Kafka
can publish messages at the rate of 50,000 and 400,000 messages
per second for batch size of 1 and 50, respectively.
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
12
13. Performance & Scale
• Consumer Test:
– In the second experiment, LinkedIn tested the performance
of the consumer. Again, for all systems, they used a single
consumer to retrieve a total of 10 millions messages.
– They configured all systems so that each pull request
should prefetch approximately the same amount data---up
to 1000 messages or about 200KB.
– For both ActiveMQ and RabbitMQ, they set the consumer
acknowledge mode to be automatic. Since all messages fit
in memory, all systems were serving data from the page
cache of the underlying file system or some in-memory
buffers.
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
14
16. Performance & Scale
• Producing:
– Kafka producer currently doesn’t wait for acknowledgements from the broker
and sends messages as faster as the broker can handle.
• This is a valid optimization for the log aggregation case, as data must be
sent asynchronously to avoid introducing any latency into the live serving of
traffic. We note that without acknowledging the producer, there is no
guarantee that every published message is actually received by the broker.
• For many types of log data, it is desirable to trade durability for throughput,
as long as the number of dropped messages is relatively small
• Durability through replication is being addressed in 0.8
– Kafka has a very efficient storage format.
http://incubator.apache.org/kafka/design.html
– Batching
• Consuming:
– Kafka does not maintain the delivery state which means 0 writes for each
consumed message.
– Kafka uses the sendfile API making the transfer of bytes from socket to disk
through kernal space saving copies and calls between kernel user back to
kernel
17
18. Producer
core/src/main/scala/kafka/tools/ProducerShell.scala
/**
* Interactive shell for producing messages from the command line
*/
// config setup
val propsFile = options.valueOf(producerPropsOpt)
val producerConfig = new ProducerConfig(Utils.loadProps(propsFile))
val topic = options.valueOf(topicOpt)
val producer = new Producer[String, String](producerConfig)
val input = new BufferedReader(new InputStreamReader(System.in))
var done = false
while(!done) {
val line = input.readLine()
if(line == null) {
done = true
} else {
val message = line.trim
producer.send(new ProducerData[String, String](topic, message))
println("Sent: %s (%d bytes)".format(line, message.getBytes.length))
}
}
producer.close()
19
19. Consumer
core/src/main/scala/kafka/consumer/ConsoleConsumer.scala
/**
* Consumer that dumps messages out to standard out.
*/
val connector = Consumer.create(config) //kafka.consumer.ConsumerConnector
val stream = connector.createMessageStreamsByFilter(filterSpec).get(0)
val iter = if(maxMessages >= 0)
stream.slice(0, maxMessages)
else
stream
val formatter: MessageFormatter = messageFormatterClass.newInstance().asInstanceOf[MessageFormatter]
formatter.init(formatterArgs)
try {
for(messageAndTopic <- iter) {
try {
formatter.writeTo(messageAndTopic.message, System.out)
} catch {
case e =>
if (skipMessageOnError)
error("Error processing message, skipping this message: ", e)
else
throw e
}
if(System.out.checkError()) {
// This means no one is listening to our output stream any more, time to shutdown
System.err.println("Unable to write to standard out, closing consumer.")
formatter.close()
connector.shutdown()
System.exit(1)
}
}
} catch {
20
case e => error("Error processing message, stopping consumer: ", e)
20. Running everything
Download Kafka Source
• http://incubator.apache.org/kafka/downloads.html
Open a Terminal
• cp ~/Downloads/kafka-0.7.1-incubating-src.tgz .
• tar -xvf kafka-0.7.1-incubating-src.tgz
• cd kafka-0.7.1-incubating
• ./sbt update
• ./sbt package
Open 3 more terminals http://incubator.apache.org/kafka/quickstart.html
• Terminal 1
– bin/zookeeper-server-start.sh config/zookeeper.properties
• Terminal 2
– bin/kafka-server-start.sh config/server.properties
• Terminal 3
– bin/kafka-producer-shell.sh --props config/producer.properties --topic scalathon
– start typing
• Terminal 4
– bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon
• Terminal 3
– Type some more
• Terminal 4
– See what you just typed
– Ctrl+c
– bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic scalathon --from-beginning
– See EVERYTHING you have typed
21
21. We are hiring!
/*
Joe Stein, Chief Architect
http://www.medialets.com
Twitter: @allthingshadoop
*/
Medialets
The rich media ad
platform for mobile.
connect@medialets.com
www.medialets.com/showcas
e
22