2. Who Am I?
Friday, March 1, 13 2
crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
3. Who Am I?
Member of Netflix’s Platform
Engineering team, working on
large scale data infrastructure
(@g9yuayon)
Friday, March 1, 13 2
crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
4. Who Am I?
Member of Netflix’s Platform
Engineering team, working on
large scale data infrastructure
(@g9yuayon)
Built and operated Netflix’s
cloud crypto service
Friday, March 1, 13 2
crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
5. Who Am I?
Member of Netflix’s Platform
Engineering team, working on
large scale data infrastructure
(@g9yuayon)
Built and operated Netflix’s
cloud crypto service
Worked with Jae Bae on
querying multi-dimensional data
in real time
Friday, March 1, 13 2
crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
6. Use Cases
Friday, March 1, 13 3
We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
7. Use Cases
Real-time Operational
Metrics
Friday, March 1, 13 3
We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
8. Use Cases
Business or Product
Insights
Friday, March 1, 13 3
We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
9. What Are Log Events?
Field Name Field Value
ClientApplication “API”
ServerApplication “Cryptex”
StatusCode 200
ResponseTime 73
Friday, March 1, 13 4
Before we dive into use cases, let me explain what our log data look like. Lots of Netflix’s log
data can be represented by “events”. Netflix applications send hundreds of different types of
log events every day.
A log event is really just a set of fields. A field has a name and a value. The value itself can be
strings, numbers, or set of fields.
10. Friday, March 1, 13 5
Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
11. Tens of Thousands of Servers
Come and Go
Server Farm
Server Farm
Server Farm
Friday, March 1, 13 5
Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
12. Highly Reliable Collectors Collect
Log Events from All Servers
Server Farm
Server Farm
Log Collectors
Server Farm
Friday, March 1, 13 5
Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
13. Dynamically Configurable
Destinations
Server Farm
Hadoop
Server Farm Kafka
Log Collectors
HTTP Endpoints
Server Farm
Friday, March 1, 13 5
Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
14. Netflix is a log generating company
that also happens to stream movies
- Adrian Cockroft
Friday, March 1, 13 6
As Adrian used to say, Neflix is a log generating company that also happens to stream
movies. When we have vast amount of logs for different applications, we also get a treasure
trove. In fact, numerous teams, BI, operations, product development, data science... They
mine such data all the time. To put this into perspective, let me share some numbers.
15. 1,500,000
Friday, March 1, 13 7
During peak hours, our data pipeline collects over 1.5 million log events per second
17. Making Sense of Billions of Events
Friday, March 1, 13 9
Making sense of such vast amount of information is a continuing challenge for Netflix. After
all, most of the time it is not feasible to look into individual log event to get anything useful
out. We’ve got to have an intelligent ways to digest our data.
18. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
19. We’ve Got Tools
Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
20. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
21. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
22. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
23. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
24. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
25. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
26. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
27. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
28. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
29. Friday, March 1, 13 10
And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
30. What Is Missing?
Friday, March 1, 13 11
Why do we need yet another tool then? The key question is, what is missing?
31. Interactive Exploration
Friday, March 1, 13 12
For one thing: interactive exploration. Sometimes we want to get data in real time so we can
act quickly. Some data is only useful in a small time window after all. Sometimes we want to
perform lots of experimental queries just to find the right insights. If we wait too long for a
query back, we won’t be able to iterate fast enough. Either way, we need to get query results
back in seconds.
32. Getting Results Back in Seconds
Friday, March 1, 13 13
Because aggregation is out of the way, we can simply de-dup the error messages and index
them in a search engine. So, you get the best of the both worlds: an instant error report, and
instant error search engine.
33. Getting Results Back in Seconds
Friday, March 1, 13 13
Because aggregation is out of the way, we can simply de-dup the error messages and index
them in a search engine. So, you get the best of the both worlds: an instant error report, and
instant error search engine.
34. Getting Results Back in Seconds
Friday, March 1, 13 14
Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....
The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.
But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
35. Getting Results Back in Seconds
Friday, March 1, 13 14
Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....
The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.
But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
36. Getting Results Back in Seconds
150,000
Friday, March 1, 13 14
Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....
The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.
But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
37. Getting Results Back in Seconds
Friday, March 1, 13 14
Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....
The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.
But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
38. Querying Data Along Different Dimensions
Friday, March 1, 13 15
And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
39. Querying Data Along Different Dimensions
Friday, March 1, 13 15
And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
40. Querying Data Along Different Dimensions
Friday, March 1, 13 15
And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
41. Querying Data Along Different Dimensions
Friday, March 1, 13 15
And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
42. Discover Outstanding Data
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
43. Discover Outstanding Data
HTTP 500
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
44. Discover Outstanding Data
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
45. Discover Outstanding Data
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
46. Discover Outstanding Data
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
47. Discover Outstanding Data
Friday, March 1, 13 16
There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.
Hundreds of thousands of requests captured.
49. See Trends Over Time
Friday, March 1, 13 18
The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
50. See Trends Over Time
Friday, March 1, 13 18
The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
51. See Trends Over Time
Friday, March 1, 13 18
The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
52. See Trends Over Time
Friday, March 1, 13 18
The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
53. See Data Distributions
Friday, March 1, 13 19
The third fundamental question is: what is the distribution of my data? Merely average is not
enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
54. See Data Distributions
Friday, March 1, 13 19
The third fundamental question is: what is the distribution of my data? Merely average is not
enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
56. Friday, March 1, 13 21
Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
57. Problem:
Minimizing programming effort
Friday, March 1, 13 21
Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
58. Problem:
Minimizing programming effort
Solution:
-Homogeneous architecture
-Separating producing logs from
consuming logs
Friday, March 1, 13 21
Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
59. Friday, March 1, 13 22
Every application shares the same design and the same underlying runtime. The logic of
delivering log event is completely hidden away from programmers. All they need to do is
constructing a log event, and then hand the event to LogManager.
Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
60. A Single Data Pipeline
Log data Log Filter Collector
Agent
Log Collectors
LogManager.logEvent(anEvent)
Friday, March 1, 13 22
Every application shares the same design and the same underlying runtime. The logic of
delivering log event is completely hidden away from programmers. All they need to do is
constructing a log event, and then hand the event to LogManager.
Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
61. Log data Log Filter Collector
Agent
Log Collectors
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13 23
Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.
Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
62. Isolated Log Processing
Log data Log Filter Collector
Agent
Log Collectors
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13 23
Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.
Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
63. Isolated Log Processing
Log Filter Sink Plugin Hadoop
Log Kafka
Log data Log Filter Sink Plugin Druid
Dispatcher
Log Filter Sink Plugin ElasticSearch
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13 23
Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.
Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
64. Friday, March 1, 13 24
Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
65. Problem:
Not All Logs Are Worth Processing
Friday, March 1, 13 24
Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
66. Problem:
Not All Logs Are Worth Processing
Solution:
Dynamic Filtering
Friday, March 1, 13 24
Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
67. Friday, March 1, 13 25
We created both a fluent API and a corresponding in-fix mini-language to filter any JavaBean-
like object
68. Friday, March 1, 13 26
It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
69. Problem:
JSON Payload Is Tedious
Friday, March 1, 13 26
It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
70. Problem:
JSON Payload Is Tedious
Solution:
Build a parser
Friday, March 1, 13 26
It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
71. curl -X POST http://druid -d @data
Friday, March 1, 13 27
Added benefit of using a parser upfront is to catch all the semantic errors early.
72. curl -X POST http://druid -d @data
Friday, March 1, 13 27
Added benefit of using a parser upfront is to catch all the semantic errors early.
73. Friday, March 1, 13 28
This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
74. Problem:
Managing data sources can be hairy
Friday, March 1, 13 28
This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
75. Problem:
Managing data sources can be hairy
Solution:
Use cell-like deployment
Friday, March 1, 13 28
This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
76. Druid Druid Druid
Kafka Kafka Kafka
Log Data Pipeline
Friday, March 1, 13 29
We use a cell-like architecture. Each data source has its own persistent queue, its own
configuration, and its own indexing cluster. Adding a new data source requires only adding a
new set of asgs.
Tuning also becomes isolated.
77. Integrating with Netflix’s Infrastructure
Friday, March 1, 13 30
Integration with Netflix’s infrastructure is essential. We need insights to operate this system,
and we need smooth operations.
78. Friday, March 1, 13 31
For example, the current deployment handles 380,000 messages per second, or close to
2TB/hour during its peak time. Without integration into our monitoring system, we wouldn’t
know system glitches as shown in this chart.
79. On Netflix Side
• Integrating Kafka with Netflix cloud
• Real-time plug-in on Netflix’s data
pipeline
• User-configurable event filtering
Friday, March 1, 13 32
80. On Druid Side
• Integration with Netflix’s monitoring
system − Emitter+Servo
• Integration with Netflix’s platform library
• Handling of Zookeeper’s session
interruption
• Tuning sharding spec for linear scalability
Friday, March 1, 13 33
Emitter integration with Servo
There are lots of injection points in Druid where we can introduce our own implementations.
This greatly helped our integration.
81. Druid
Event Filter Collector
Agent
Log Collectors
rtexplorer
Friday, March 1, 13 34
We built our tool sets on top of many excellent open source tools, and it’s our pleasure to
contribute back. Therefore, we’re going to open source all the tools we built some time this
year.
82. Open Source Plan
Druid
Event Filter Collector
Agent
Log Collectors
rtexplorer
Friday, March 1, 13 34
We built our tool sets on top of many excellent open source tools, and it’s our pleasure to
contribute back. Therefore, we’re going to open source all the tools we built some time this
year.