SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Data Insights in NetïŹ‚ix
                      Danny Yuan (@g9yuayon)
                      Jae Bae




Friday, March 1, 13                            1
Who Am I?




Friday, March 1, 13               2
Who Am I?
    Member of NetïŹ‚ix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)




Friday, March 1, 13                2
Who Am I?
    Member of NetïŹ‚ix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)

   Built and operated NetïŹ‚ix’s
   cloud crypto service




Friday, March 1, 13                 2
Who Am I?
    Member of NetïŹ‚ix’s Platform
    Engineering team, working on
    very large scale data
    infrastructure (@g9yuayon)

   Built and operated NetïŹ‚ix’s
   cloud crypto service

   Worked with Jae Bae on
   querying multi-dimensional data
   in real time




Friday, March 1, 13                  2
Friday, March 1, 13                                                                  3

Developers usually think about monitoring metrics when “real-time” data is
mentioned. We have powerful monitoring systems that track millions of metrics
per second. But I’m not going to talk about it today. Monitoring metric is crucial
data. That itself would warrant another multi-hour talk by our monitoring
team. :-)
No Monitoring Metrics Today




Friday, March 1, 13                                                                  3

Developers usually think about monitoring metrics when “real-time” data is
mentioned. We have powerful monitoring systems that track millions of metrics
per second. But I’m not going to talk about it today. Monitoring metric is crucial
data. That itself would warrant another multi-hour talk by our monitoring
team. :-)
photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/o/in/photostream/



Friday, March 1, 13                                                                                             4

Instead, I’m going to talk about logs. Why is it interesting at all?
1,500,000

Friday, March 1, 13                                                                    5

During peak hours, our data pipeline collects over 1.5 million log events per second
70,000,000,000

Friday, March 1, 13                6

Or 70 billions a day on average.
Server Farm
                                                    Log Filter          Sink Plugin          Hadoop




      Server Farm                                                                              Kafka
                                                    Log Filter          Sink Plugin                       Druid
                       Log Collectors




     Server Farm
                                                    Log Filter          Sink Plugin       ElasticSearch




photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                               7

We have this tens of thousands of machines, all of which send log data over a robust data
pipeline to highly reliable data collectors. The collectors then ïŹlter the data, transform the
data, and dispatch the data to to different destinations for further processing.

Photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
Highly Reliable Data Pipeline


      Server Farm
                                                    Log Filter          Sink Plugin          Hadoop




      Server Farm                                                                              Kafka
                                                    Log Filter          Sink Plugin                       Druid
                       Log Collectors




     Server Farm
                                                    Log Filter          Sink Plugin       ElasticSearch




photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
Friday, March 1, 13                                                                                               7

We have this tens of thousands of machines, all of which send log data over a robust data
pipeline to highly reliable data collectors. The collectors then ïŹlter the data, transform the
data, and dispatch the data to to different destinations for further processing.

Photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/
photostream/
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some
point
A Humble Beginning




Friday, March 1, 13                                                                            8

We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log
scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some
point
Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
Application
                                                                 Application

                                Application
                                                 Application               Application



                                                             Application
                      Application       Application

                                                       Application    Application




Friday, March 1, 13                                                                          9

Something happened. Our traffic turned into a hockey stick, and the number of applications
exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
So We Evolved




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
So We Evolved




hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket




Friday, March 1, 13                                                                      10

So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is
much more useful that the one provided by Apache Hadoop Distribution, because it supports
many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a-
service greatly helps each team.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Friday, March 1, 13                                                   11

A search tool that searches live instances’ logs is also developed.
Field Name      Field Value

                      Client     “API”

                      Server   “Cryptex”

               StatusCode         200

          ResponseTime             73



Friday, March 1, 13                          12

Hive becomes indispensable.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13     13

DSE Sting is a bless.
Friday, March 1, 13                                                  14

So we built yet another tool to scratch it with the help of Druid.
Still, We Have a Real-Time Itch




Friday, March 1, 13                                                  14

So we built yet another tool to scratch it with the help of Druid.
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                     15

Error summary in the past 10 seconds. You get to slice and dice through arbitrary
combination of different dimension across multiple time series.

Trends over search query of “90210” by Canadians

How many people started streaming any episode of House of Cards in the past hour, grouped
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Friday, March 1, 13                                                                          16

A query of all the users who started streaming House of Cards in the past three hours, and
results came back in seconds.
Interested?




Friday, March 1, 13                 17
See You
                      Tomorrow

Friday, March 1, 13                                                                               18

If you’re interested in how we did the real-time interactive queries with the help of Druid, do
come to our talk. See you tomorrow

Weitere Àhnliche Inhalte

Was ist angesagt?

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

Was ist angesagt? (20)

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Andere mochten auch

Andere mochten auch (20)

QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Real-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactionsReal-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactions
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Fraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConFraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data Con
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Ähnlich wie Strata lightening-talk

[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz
NAVER D2
 

Ähnlich wie Strata lightening-talk (20)

Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
Practical Semantic Web and Why You Should Care - DrupalCon DC 2009
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015Go After 4 Years in Production - QCon 2015
Go After 4 Years in Production - QCon 2015
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
State of Puppet
State of PuppetState of Puppet
State of Puppet
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Unleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineUnleashing the Rails Asset Pipeline
Unleashing the Rails Asset Pipeline
 
What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?What's this NetKernel Thing Anyway?
What's this NetKernel Thing Anyway?
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz[B6]heroku postgres-hgmnz
[B6]heroku postgres-hgmnz
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 

KĂŒrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

KĂŒrzlich hochgeladen (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Strata lightening-talk

  • 1. Data Insights in NetïŹ‚ix Danny Yuan (@g9yuayon) Jae Bae Friday, March 1, 13 1
  • 2. Who Am I? Friday, March 1, 13 2
  • 3. Who Am I? Member of NetïŹ‚ix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Friday, March 1, 13 2
  • 4. Who Am I? Member of NetïŹ‚ix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Built and operated NetïŹ‚ix’s cloud crypto service Friday, March 1, 13 2
  • 5. Who Am I? Member of NetïŹ‚ix’s Platform Engineering team, working on very large scale data infrastructure (@g9yuayon) Built and operated NetïŹ‚ix’s cloud crypto service Worked with Jae Bae on querying multi-dimensional data in real time Friday, March 1, 13 2
  • 6. Friday, March 1, 13 3 Developers usually think about monitoring metrics when “real-time” data is mentioned. We have powerful monitoring systems that track millions of metrics per second. But I’m not going to talk about it today. Monitoring metric is crucial data. That itself would warrant another multi-hour talk by our monitoring team. :-)
  • 7. No Monitoring Metrics Today Friday, March 1, 13 3 Developers usually think about monitoring metrics when “real-time” data is mentioned. We have powerful monitoring systems that track millions of metrics per second. But I’m not going to talk about it today. Monitoring metric is crucial data. That itself would warrant another multi-hour talk by our monitoring team. :-)
  • 8. photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/o/in/photostream/ Friday, March 1, 13 4 Instead, I’m going to talk about logs. Why is it interesting at all?
  • 9. 1,500,000 Friday, March 1, 13 5 During peak hours, our data pipeline collects over 1.5 million log events per second
  • 10. 70,000,000,000 Friday, March 1, 13 6 Or 70 billions a day on average.
  • 11. Server Farm Log Filter Sink Plugin Hadoop Server Farm Kafka Log Filter Sink Plugin Druid Log Collectors Server Farm Log Filter Sink Plugin ElasticSearch photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 7 We have this tens of thousands of machines, all of which send log data over a robust data pipeline to highly reliable data collectors. The collectors then ïŹlter the data, transform the data, and dispatch the data to to different destinations for further processing. Photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 12. Highly Reliable Data Pipeline Server Farm Log Filter Sink Plugin Hadoop Server Farm Kafka Log Filter Sink Plugin Druid Log Collectors Server Farm Log Filter Sink Plugin ElasticSearch photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Friday, March 1, 13 7 We have this tens of thousands of machines, all of which send log data over a robust data pipeline to highly reliable data collectors. The collectors then ïŹlter the data, transform the data, and dispatch the data to to different destinations for further processing. Photo credit: http://www.ïŹ‚ickr.com/photos/decade_null/142235888/sizes/m/in/ photostream/
  • 13. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some point
  • 14. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some point
  • 15. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some point
  • 16. A Humble Beginning Friday, March 1, 13 8 We didn’t build everything in one night. Actually, we had a humble start. I did a lot of log scraping like these. I also used R to analyze logs. But these are speciïŹc tasks, and at some point
  • 17. Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 18. Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 19. Application Application Application Application Application Application Application Application Application Application Friday, March 1, 13 9 Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • 20. So We Evolved Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 21. So We Evolved Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 22. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 23. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket Friday, March 1, 13 10 So we evolved. One thing we built was a hadoop grep. This tool searches TBs of data. It is much more useful that the one provided by Apache Hadoop Distribution, because it supports many more Grep options like context, sorting by columns, and etc. And DSE’s Hadoop-as-a- service greatly helps each team.
  • 24. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 25. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 26. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 27. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 28. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 29. Friday, March 1, 13 11 A search tool that searches live instances’ logs is also developed.
  • 30. Field Name Field Value Client “API” Server “Cryptex” StatusCode 200 ResponseTime 73 Friday, March 1, 13 12 Hive becomes indispensable.
  • 31. Friday, March 1, 13 13 DSE Sting is a bless.
  • 32. Friday, March 1, 13 13 DSE Sting is a bless.
  • 33. Friday, March 1, 13 13 DSE Sting is a bless.
  • 34. Friday, March 1, 13 14 So we built yet another tool to scratch it with the help of Druid.
  • 35. Still, We Have a Real-Time Itch Friday, March 1, 13 14 So we built yet another tool to scratch it with the help of Druid.
  • 36. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 37. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 38. Friday, March 1, 13 15 Error summary in the past 10 seconds. You get to slice and dice through arbitrary combination of different dimension across multiple time series. Trends over search query of “90210” by Canadians How many people started streaming any episode of House of Cards in the past hour, grouped
  • 39. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 40. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 41. Friday, March 1, 13 16 A query of all the users who started streaming House of Cards in the past three hours, and results came back in seconds.
  • 43. See You Tomorrow Friday, March 1, 13 18 If you’re interested in how we did the real-time interactive queries with the help of Druid, do come to our talk. See you tomorrow