SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Real-Time Data Insights In Netflix
                       Danny Yuan (@g9yuayon)
                       Jae Bae




Friday, March 1, 13                                 1
Who Am I?




Friday, March 1, 13                                                                     2

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    large scale data infrastructure
    (@g9yuayon)




Friday, March 1, 13                                                                     2

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    large scale data infrastructure
    (@g9yuayon)


   Built and operated Netflix’s
   cloud crypto service




Friday, March 1, 13                                                                     2

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
Who Am I?
    Member of Netflix’s Platform
    Engineering team, working on
    large scale data infrastructure
    (@g9yuayon)


   Built and operated Netflix’s
   cloud crypto service

   Worked with Jae Bae on
   querying multi-dimensional data
   in real time




Friday, March 1, 13                                                                     2

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which
translates to billions of requests per day.
Use Cases




Friday, March 1, 13                                                                     3

We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
Use Cases
      Real-time Operational
      Metrics




Friday, March 1, 13                                                                     3

We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
Use Cases
     Business or Product
     Insights




Friday, March 1, 13                                                                     3

We’re going to discuss two types of use cases today: Real-time operational metrics, and
business or product insights. By the way, who would know Canadians’ number 1 search query
would be 90210?
What Are Log Events?

                        Field Name                         Field Value

                      ClientApplication                         “API”

                      ServerApplication                      “Cryptex”

                        StatusCode                               200

                       ResponseTime                               73



Friday, March 1, 13                                                                             4

Before we dive into use cases, let me explain what our log data look like. Lots of Netflix’s log
data can be represented by “events”. Netflix applications send hundreds of different types of
log events every day.
A log event is really just a set of fields. A field has a name and a value. The value itself can be
strings, numbers, or set of fields.
Friday, March 1, 13                                                                             5

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
Tens of Thousands of Servers
                                 Come and Go
           Server Farm




           Server Farm




         Server Farm




Friday, March 1, 13                                                                             5

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
Highly Reliable Collectors Collect
                            Log Events from All Servers
           Server Farm




           Server Farm

                               Log Collectors




         Server Farm




Friday, March 1, 13                                                                             5

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
Dynamically Configurable
                                  Destinations
           Server Farm



                                                                 Hadoop




           Server Farm                                            Kafka

                               Log Collectors




                                                             HTTP Endpoints
         Server Farm




Friday, March 1, 13                                                                             5

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines
come and go all the time, but they all generate tons of application log events, and send them
to highly reliable data collectors. The collectors in turn send data to various destinations.
Netflix is a log generating company
  that also happens to stream movies

                                                              - Adrian Cockroft




Friday, March 1, 13                                                                           6

As Adrian used to say, Neflix is a log generating company that also happens to stream
movies. When we have vast amount of logs for different applications, we also get a treasure
trove. In fact, numerous teams, BI, operations, product development, data science... They
mine such data all the time. To put this into perspective, let me share some numbers.
1,500,000

Friday, March 1, 13                                                                    7

During peak hours, our data pipeline collects over 1.5 million log events per second
70,000,000,000

Friday, March 1, 13                8

Or 70 billions a day on average.
Making Sense of Billions of Events




Friday, March 1, 13                                                                                 9

Making sense of such vast amount of information is a continuing challenge for Netflix. After
all, most of the time it is not feasible to look into individual log event to get anything useful
out. We’ve got to have an intelligent ways to digest our data.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
We’ve Got Tools

Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
Friday, March 1, 13                                                                         10

And over the past couple of years Netflix has built numerous tools to help us.
We have this Turbine real-time dashboard for application metrics on live machines. It is also
open sourced, by the way.
We have Atlas, our monitoring solution, that handles millions of application metrics every
second
We have CSI, which uses a number of machine learning algorithms to identify correlations
and trends in monitored data
We have Biopsys, which searches logs on multiple live servers, and streams back results to a
user’s browser
We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team
use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive
queries.
And we had log summarization service that alert people on top error-generating service.
They are, however, static snapshot of some data that we can’t easily drill down, and they are
usually half an hour late.
What Is Missing?


Friday, March 1, 13                                                           11

Why do we need yet another tool then? The key question is, what is missing?
Interactive Exploration




Friday, March 1, 13                                                                              12

For one thing: interactive exploration. Sometimes we want to get data in real time so we can
act quickly. Some data is only useful in a small time window after all. Sometimes we want to
perform lots of experimental queries just to find the right insights. If we wait too long for a
query back, we won’t be able to iterate fast enough. Either way, we need to get query results
back in seconds.
Getting Results Back in Seconds




Friday, March 1, 13                                                                              13

Because aggregation is out of the way, we can simply de-dup the error messages and index
them in a search engine. So, you get the best of the both worlds: an instant error report, and
instant error search engine.
Getting Results Back in Seconds




Friday, March 1, 13                                                                              13

Because aggregation is out of the way, we can simply de-dup the error messages and index
them in a search engine. So, you get the best of the both worlds: an instant error report, and
instant error search engine.
Getting Results Back in Seconds




Friday, March 1, 13                                                                            14

Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
Getting Results Back in Seconds




Friday, March 1, 13                                                                            14

Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
Getting Results Back in Seconds



                       150,000

Friday, March 1, 13                                                                            14

Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
Getting Results Back in Seconds




Friday, March 1, 13                                                                            14

Here is one example: we process more than 150 thousand events per second about device
activities. What if we’d like to know that geographically how many users started playing
videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to
the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along
different dimensions? In this particular case, to find out failed attempts on our SilverLight
players that run on PCs and Macs?
Querying Data Along Different Dimensions




Friday, March 1, 13                                                   15

And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
Querying Data Along Different Dimensions




Friday, March 1, 13                                                   15

And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
Querying Data Along Different Dimensions




Friday, March 1, 13                                                   15

And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
Querying Data Along Different Dimensions




Friday, March 1, 13                                                   15

And from the same event, we may get answers to different questions:
How many people started viewing House of Cards in the past 6 hours?
Discover Outstanding Data




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data


                                                      HTTP 500




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data




Friday, March 1, 13                                                                                            16

There are three fundamental questions we usually want to get out large amount data. First is to find the
outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even
summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For
example, don’t you want to know what happens in the last 10 seconds which applications generated most of
the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more
complete example.

Hundreds of thousands of requests captured.
Discover Outstanding Data




Friday, March 1, 13                               17
See Trends Over Time




Friday, March 1, 13                                                                         18

The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
See Trends Over Time




Friday, March 1, 13                                                                         18

The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
See Trends Over Time




Friday, March 1, 13                                                                         18

The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
See Trends Over Time




Friday, March 1, 13                                                                         18

The second fundamental question is: what are the trends over time? More over, what is the
trend compared to that of the same data in a different time window? Again, slicing and dicing
is very important here because it helps us narrow down our view.
See Data Distributions




Friday, March 1, 13                                                                             19

The third fundamental question is: what is the distribution of my data? Merely average is not
enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
See Data Distributions




Friday, March 1, 13                                                                             19

The third fundamental question is: what is the distribution of my data? Merely average is not
enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
Technical Challenges




Friday, March 1, 13                                                                  20

I’d like to share some technical challenges we encountered when integrating Druid.
Friday, March 1, 13                                                                    21

Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
Problem:
             Minimizing programming effort




Friday, March 1, 13                                                                    21

Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
Problem:
             Minimizing programming effort


        Solution:
             -Homogeneous architecture
             -Separating producing logs from
                  consuming logs


Friday, March 1, 13                                                                    21

Even though we instrument code to death, people don’t want to write more code just for a
nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and
we’ve already separated producing logs from consuming logs. Applications have the common
build and continuous integration environment, identical deployment base, and shared
platform runtime.
Friday, March 1, 13                                                                       22

Every application shares the same design and the same underlying runtime. The logic of
delivering log event is completely hidden away from programmers. All they need to do is
constructing a log event, and then hand the event to LogManager.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
A Single Data Pipeline



                       Log data   Log Filter   Collector
                                                Agent
                                                                 Log Collectors




                      LogManager.logEvent(anEvent)


Friday, March 1, 13                                                                       22

Every application shares the same design and the same underlying runtime. The logic of
delivering log event is completely hidden away from programmers. All they need to do is
constructing a log event, and then hand the event to LogManager.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Log data   Log Filter         Collector
                                                     Agent
                                                                                    Log Collectors




photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                  23

Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Isolated Log Processing



                      Log data   Log Filter         Collector
                                                     Agent
                                                                                    Log Collectors




photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                  23

Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Isolated Log Processing

                                                 Log Filter          Sink Plugin             Hadoop




                         Log                                                                   Kafka
   Log data                                      Log Filter          Sink Plugin                          Druid
                      Dispatcher




                                                 Log Filter          Sink Plugin          ElasticSearch




photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                               23

Since producing log events is dead simple. We move all the processing logic to the backend.
We introduced this plugin design that is flexible enough to filter, transform, and dispatch log
events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Friday, March 1, 13                                                                         24

Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
Problem:
             Not All Logs Are Worth Processing




Friday, March 1, 13                                                                         24

Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
Problem:
             Not All Logs Are Worth Processing


        Solution:
              Dynamic Filtering



Friday, March 1, 13                                                                         24

Storing and processing log events takes time, requires resources, and ultimately costs
money. Lots of events are useful only when they are needed. Therefore, we built this filtering
capability into our platform.
Friday, March 1, 13                                                                      25

We created both a fluent API and a corresponding in-fix mini-language to filter any JavaBean-
like object
Friday, March 1, 13                                                                             26

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
Problem:
             JSON Payload Is Tedious




Friday, March 1, 13                                                                             26

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
Problem:
             JSON Payload Is Tedious


        Solution:
              Build a parser



Friday, March 1, 13                                                                             26

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow
users to get query results back in seconds. It doesn’t make sense to ask a user to spend half
an hour just to construct a query, and spend another half an hour to debug the query.
curl -X POST http://druid -d @data




Friday, March 1, 13                                                                 27



Added benefit of using a parser upfront is to catch all the semantic errors early.
curl -X POST http://druid -d @data




Friday, March 1, 13                                                                 27



Added benefit of using a parser upfront is to catch all the semantic errors early.
Friday, March 1, 13                                                                          28

This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
Problem:
             Managing data sources can be hairy




Friday, March 1, 13                                                                          28

This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
Problem:
             Managing data sources can be hairy


        Solution:
              Use cell-like deployment



Friday, March 1, 13                                                                          28

This is a nascent system with quite a few moving parts. We needed to add new data sources,
remove data sources, update schemas for data sources, or debug for certain data sources.
Such operations should be easy, and should have minimal impact to a production system.
Druid                  Druid                 Druid




                        Kafka                Kafka                   Kafka




                                 Log Data Pipeline

Friday, March 1, 13                                                                       29

We use a cell-like architecture. Each data source has its own persistent queue, its own
configuration, and its own indexing cluster. Adding a new data source requires only adding a
new set of asgs.

Tuning also becomes isolated.
Integrating with Netflix’s Infrastructure




Friday, March 1, 13                                                                               30

Integration with Netflix’s infrastructure is essential. We need insights to operate this system,
and we need smooth operations.
Friday, March 1, 13                                                                          31

For example, the current deployment handles 380,000 messages per second, or close to
2TB/hour during its peak time. Without integration into our monitoring system, we wouldn’t
know system glitches as shown in this chart.
On Netflix Side
         • Integrating Kafka with Netflix cloud
         • Real-time plug-in on Netflix’s data
                 pipeline
         • User-configurable event filtering


Friday, March 1, 13                              32
On Druid Side
   • Integration with Netflix’s monitoring
           system − Emitter+Servo

   • Integration with Netflix’s platform library
   • Handling of Zookeeper’s session
           interruption
   • Tuning sharding spec for linear scalability
Friday, March 1, 13                                                                           33

Emitter integration with Servo

There are lots of injection points in Druid where we can introduce our own implementations.
This greatly helped our integration.
Druid
         Event Filter   Collector
                         Agent
                                            Log Collectors




                                                                           rtexplorer



Friday, March 1, 13                                                                            34

We built our tool sets on top of many excellent open source tools, and it’s our pleasure to
contribute back. Therefore, we’re going to open source all the tools we built some time this
year.
Open Source Plan

                                                                                      Druid
         Event Filter   Collector
                         Agent
                                            Log Collectors




                                                                           rtexplorer



Friday, March 1, 13                                                                            34

We built our tool sets on top of many excellent open source tools, and it’s our pleasure to
contribute back. Therefore, we’re going to open source all the tools we built some time this
year.

Weitere ähnliche Inhalte

Was ist angesagt?

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talkDanny Yuan
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansSpark Summit
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiKevin McEntee
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...confluent
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSInfluxData
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 

Was ist angesagt? (20)

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting Platform
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESSARCHITECTING INFLUXENTERPRISE FOR SUCCESS
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 

Andere mochten auch

Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidSalil Kalia
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewMonal Daxini
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Dawen Liang
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
(Some) pitfalls of distributed learning
(Some) pitfalls of distributed learning(Some) pitfalls of distributed learning
(Some) pitfalls of distributed learningYves Raimond
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Balancing Discovery and Continuation in Recommendations
Balancing Discovery and Continuation in RecommendationsBalancing Discovery and Continuation in Recommendations
Balancing Discovery and Continuation in RecommendationsMohammad Hossein Taghavi
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 

Andere mochten auch (20)

Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
Factorization Meets the Item Embedding: Regularizing Matrix Factorization wit...
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
(Some) pitfalls of distributed learning
(Some) pitfalls of distributed learning(Some) pitfalls of distributed learning
(Some) pitfalls of distributed learning
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Balancing Discovery and Continuation in Recommendations
Balancing Discovery and Continuation in RecommendationsBalancing Discovery and Continuation in Recommendations
Balancing Discovery and Continuation in Recommendations
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 

Ähnlich wie netflix-real-time-data-strata-talk

Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...
Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...
Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...PROIDEA
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseCouchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseHPCC Systems
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Overview and Opentracing in theory by Gianluca Arbezzano
Overview and Opentracing in theory by Gianluca ArbezzanoOverview and Opentracing in theory by Gianluca Arbezzano
Overview and Opentracing in theory by Gianluca ArbezzanoGianluca Arbezzano
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...Grid Dynamics
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011Gerardo Pardo-Castellote
 
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...InfluxData
 
7_considerations_final
7_considerations_final7_considerations_final
7_considerations_finalJane Roberts
 
Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018VMware Tanzu
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsScyllaDB
 
BCO 117 IT Software for Business Lecture Reference Notes.docx
BCO 117 IT Software for Business Lecture Reference Notes.docxBCO 117 IT Software for Business Lecture Reference Notes.docx
BCO 117 IT Software for Business Lecture Reference Notes.docxjesuslightbody
 

Ähnlich wie netflix-real-time-data-strata-talk (20)

Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...
Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...
Atmosphere Conference 2015: Oktawave Horizon Project: the future of real-time...
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
TerraEchos Kairos on IBM PowerLinux servers
TerraEchos Kairos on IBM PowerLinux serversTerraEchos Kairos on IBM PowerLinux servers
TerraEchos Kairos on IBM PowerLinux servers
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseCouchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Overview and Opentracing in theory by Gianluca Arbezzano
Overview and Opentracing in theory by Gianluca ArbezzanoOverview and Opentracing in theory by Gianluca Arbezzano
Overview and Opentracing in theory by Gianluca Arbezzano
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...
How to Gain Visibility into Containers, VM’s and Multi-Cloud Environments Usi...
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Streaming analytics
Streaming analyticsStreaming analytics
Streaming analytics
 
7_considerations_final
7_considerations_final7_considerations_final
7_considerations_final
 
Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time Applications
 
BCO 117 IT Software for Business Lecture Reference Notes.docx
BCO 117 IT Software for Business Lecture Reference Notes.docxBCO 117 IT Software for Business Lecture Reference Notes.docx
BCO 117 IT Software for Business Lecture Reference Notes.docx
 
Data Driven Security
Data Driven SecurityData Driven Security
Data Driven Security
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

netflix-real-time-data-strata-talk

  • 1. Real-Time Data Insights In Netflix Danny Yuan (@g9yuayon) Jae Bae Friday, March 1, 13 1
  • 2. Who Am I? Friday, March 1, 13 2 crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.
  • 3. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon) Friday, March 1, 13 2 crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.
  • 4. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto service Friday, March 1, 13 2 crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.
  • 5. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto service Worked with Jae Bae on querying multi-dimensional data in real time Friday, March 1, 13 2 crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.
  • 6. Use Cases Friday, March 1, 13 3 We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?
  • 7. Use Cases Real-time Operational Metrics Friday, March 1, 13 3 We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?
  • 8. Use Cases Business or Product Insights Friday, March 1, 13 3 We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?
  • 9. What Are Log Events? Field Name Field Value ClientApplication “API” ServerApplication “Cryptex” StatusCode 200 ResponseTime 73 Friday, March 1, 13 4 Before we dive into use cases, let me explain what our log data look like. Lots of Netflix’s log data can be represented by “events”. Netflix applications send hundreds of different types of log events every day. A log event is really just a set of fields. A field has a name and a value. The value itself can be strings, numbers, or set of fields.
  • 10. Friday, March 1, 13 5 Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.
  • 11. Tens of Thousands of Servers Come and Go Server Farm Server Farm Server Farm Friday, March 1, 13 5 Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.
  • 12. Highly Reliable Collectors Collect Log Events from All Servers Server Farm Server Farm Log Collectors Server Farm Friday, March 1, 13 5 Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.
  • 13. Dynamically Configurable Destinations Server Farm Hadoop Server Farm Kafka Log Collectors HTTP Endpoints Server Farm Friday, March 1, 13 5 Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.
  • 14. Netflix is a log generating company that also happens to stream movies - Adrian Cockroft Friday, March 1, 13 6 As Adrian used to say, Neflix is a log generating company that also happens to stream movies. When we have vast amount of logs for different applications, we also get a treasure trove. In fact, numerous teams, BI, operations, product development, data science... They mine such data all the time. To put this into perspective, let me share some numbers.
  • 15. 1,500,000 Friday, March 1, 13 7 During peak hours, our data pipeline collects over 1.5 million log events per second
  • 16. 70,000,000,000 Friday, March 1, 13 8 Or 70 billions a day on average.
  • 17. Making Sense of Billions of Events Friday, March 1, 13 9 Making sense of such vast amount of information is a continuing challenge for Netflix. After all, most of the time it is not feasible to look into individual log event to get anything useful out. We’ve got to have an intelligent ways to digest our data.
  • 18. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 19. We’ve Got Tools Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 20. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 21. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 22. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 23. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 24. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 25. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 26. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 27. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 28. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 29. Friday, March 1, 13 10 And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way. We have Atlas, our monitoring solution, that handles millions of application metrics every second We have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored data We have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browser We also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries. And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.
  • 30. What Is Missing? Friday, March 1, 13 11 Why do we need yet another tool then? The key question is, what is missing?
  • 31. Interactive Exploration Friday, March 1, 13 12 For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we won’t be able to iterate fast enough. Either way, we need to get query results back in seconds.
  • 32. Getting Results Back in Seconds Friday, March 1, 13 13 Because aggregation is out of the way, we can simply de-dup the error messages and index them in a search engine. So, you get the best of the both worlds: an instant error report, and instant error search engine.
  • 33. Getting Results Back in Seconds Friday, March 1, 13 13 Because aggregation is out of the way, we can simply de-dup the error messages and index them in a search engine. So, you get the best of the both worlds: an instant error report, and instant error search engine.
  • 34. Getting Results Back in Seconds Friday, March 1, 13 14 Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted. But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • 35. Getting Results Back in Seconds Friday, March 1, 13 14 Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted. But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • 36. Getting Results Back in Seconds 150,000 Friday, March 1, 13 14 Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted. But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • 37. Getting Results Back in Seconds Friday, March 1, 13 14 Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted. But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • 38. Querying Data Along Different Dimensions Friday, March 1, 13 15 And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?
  • 39. Querying Data Along Different Dimensions Friday, March 1, 13 15 And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?
  • 40. Querying Data Along Different Dimensions Friday, March 1, 13 15 And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?
  • 41. Querying Data Along Different Dimensions Friday, March 1, 13 15 And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?
  • 42. Discover Outstanding Data Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 43. Discover Outstanding Data HTTP 500 Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 44. Discover Outstanding Data Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 45. Discover Outstanding Data Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 46. Discover Outstanding Data Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 47. Discover Outstanding Data Friday, March 1, 13 16 There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example. Hundreds of thousands of requests captured.
  • 49. See Trends Over Time Friday, March 1, 13 18 The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.
  • 50. See Trends Over Time Friday, March 1, 13 18 The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.
  • 51. See Trends Over Time Friday, March 1, 13 18 The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.
  • 52. See Trends Over Time Friday, March 1, 13 18 The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.
  • 53. See Data Distributions Friday, March 1, 13 19 The third fundamental question is: what is the distribution of my data? Merely average is not enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
  • 54. See Data Distributions Friday, March 1, 13 19 The third fundamental question is: what is the distribution of my data? Merely average is not enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
  • 55. Technical Challenges Friday, March 1, 13 20 I’d like to share some technical challenges we encountered when integrating Druid.
  • 56. Friday, March 1, 13 21 Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.
  • 57. Problem: Minimizing programming effort Friday, March 1, 13 21 Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.
  • 58. Problem: Minimizing programming effort Solution: -Homogeneous architecture -Separating producing logs from consuming logs Friday, March 1, 13 21 Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.
  • 59. Friday, March 1, 13 22 Every application shares the same design and the same underlying runtime. The logic of delivering log event is completely hidden away from programmers. All they need to do is constructing a log event, and then hand the event to LogManager. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 60. A Single Data Pipeline Log data Log Filter Collector Agent Log Collectors LogManager.logEvent(anEvent) Friday, March 1, 13 22 Every application shares the same design and the same underlying runtime. The logic of delivering log event is completely hidden away from programmers. All they need to do is constructing a log event, and then hand the event to LogManager. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 61. Log data Log Filter Collector Agent Log Collectors photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 23 Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 62. Isolated Log Processing Log data Log Filter Collector Agent Log Collectors photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 23 Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 63. Isolated Log Processing Log Filter Sink Plugin Hadoop Log Kafka Log data Log Filter Sink Plugin Druid Dispatcher Log Filter Sink Plugin ElasticSearch photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 23 Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput. Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 64. Friday, March 1, 13 24 Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.
  • 65. Problem: Not All Logs Are Worth Processing Friday, March 1, 13 24 Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.
  • 66. Problem: Not All Logs Are Worth Processing Solution: Dynamic Filtering Friday, March 1, 13 24 Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.
  • 67. Friday, March 1, 13 25 We created both a fluent API and a corresponding in-fix mini-language to filter any JavaBean- like object
  • 68. Friday, March 1, 13 26 It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.
  • 69. Problem: JSON Payload Is Tedious Friday, March 1, 13 26 It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.
  • 70. Problem: JSON Payload Is Tedious Solution: Build a parser Friday, March 1, 13 26 It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.
  • 71. curl -X POST http://druid -d @data Friday, March 1, 13 27 Added benefit of using a parser upfront is to catch all the semantic errors early.
  • 72. curl -X POST http://druid -d @data Friday, March 1, 13 27 Added benefit of using a parser upfront is to catch all the semantic errors early.
  • 73. Friday, March 1, 13 28 This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.
  • 74. Problem: Managing data sources can be hairy Friday, March 1, 13 28 This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.
  • 75. Problem: Managing data sources can be hairy Solution: Use cell-like deployment Friday, March 1, 13 28 This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.
  • 76. Druid Druid Druid Kafka Kafka Kafka Log Data Pipeline Friday, March 1, 13 29 We use a cell-like architecture. Each data source has its own persistent queue, its own configuration, and its own indexing cluster. Adding a new data source requires only adding a new set of asgs. Tuning also becomes isolated.
  • 77. Integrating with Netflix’s Infrastructure Friday, March 1, 13 30 Integration with Netflix’s infrastructure is essential. We need insights to operate this system, and we need smooth operations.
  • 78. Friday, March 1, 13 31 For example, the current deployment handles 380,000 messages per second, or close to 2TB/hour during its peak time. Without integration into our monitoring system, we wouldn’t know system glitches as shown in this chart.
  • 79. On Netflix Side • Integrating Kafka with Netflix cloud • Real-time plug-in on Netflix’s data pipeline • User-configurable event filtering Friday, March 1, 13 32
  • 80. On Druid Side • Integration with Netflix’s monitoring system − Emitter+Servo • Integration with Netflix’s platform library • Handling of Zookeeper’s session interruption • Tuning sharding spec for linear scalability Friday, March 1, 13 33 Emitter integration with Servo There are lots of injection points in Druid where we can introduce our own implementations. This greatly helped our integration.
  • 81. Druid Event Filter Collector Agent Log Collectors rtexplorer Friday, March 1, 13 34 We built our tool sets on top of many excellent open source tools, and it’s our pleasure to contribute back. Therefore, we’re going to open source all the tools we built some time this year.
  • 82. Open Source Plan Druid Event Filter Collector Agent Log Collectors rtexplorer Friday, March 1, 13 34 We built our tool sets on top of many excellent open source tools, and it’s our pleasure to contribute back. Therefore, we’re going to open source all the tools we built some time this year.