SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Youtube Data
                  Warehouse
Biswapesh Chattopadhyay
biswapesh@google.com
XLDB 2011




                                 Google Confidential and Proprietary
YTDW - Motivation & History
● Consolidated warehouse of Youtube data
● Videos, playbacks, summarized logs, etc.
● Very large (X PB uncompressed, Trillion row tables)
● High volume ETL (XXX TB processed / day)
● 100% Google Tech Stack:
  ○ Query: Oracle -> MySQL -> ColumnIO
  ○ ETL: Python -> Sawzall + Tenzing + Python
  ○ Reporting: Microstrategy -> ABI
● Key technologies: Sawzall, Tenzing, Dremel, ABI




                                               Google Confidential and Proprietary
YTDW - Overall Architecture



  Logs
                                        Dremel     ABI
                                        Service   Reports

               ETL
             (Sawzall,
 MySQL       Tenzing,      YTDW
  DB        Python, C++   (ColumnIO /
                MR)          GFS)
                                        Tenzing
                                        Service

 Bigtable




                                                     Google Confidential and Proprietary
YTDW - About Sawzall
●   Scripting language on Google MR framework
●   Sawzall vs MR-Saw
●   Built-in security for accessing sensitive logs data
●   Strong support for aggregation and complex
    computations
●   Read/write various formats
●   Procedural language
●   Open sourced!
●   YTDW Usage:
    ○ ETL of Youtube logs
    ○ Complex one-off logs analysis
                                                Google Confidential and Proprietary
YTDW - About Tenzing
● SQL on MR - Think HIVE, HadoopSQL
● Key strengths:
   ○ Strong SQL support
   ○ Highly scalable - built on Google MR
   ○ Read / write many formats
● Weaknesses:
   ○ Not ideal for complex procedural code
   ○ Higher latency than Dremel
   ○ Limited support for nested-repeated structures
● YTDW Usage:
   ○ ETL for non-logs data, denormalizations
   ○ Medium complexity analysis on YTDW dataGoogle Confidential and Proprietary
YTDW - About Dremel
● Current use in YTDW:
  ○ Reporting query engine
  ○ Interactive simple logs analysis
● Key Strengths
  ○ Very low latency
  ○ SQL support
  ○ Strong nested-relational support
  ○ Access to logs
● Limitations
  ○ More complex SQL constructs (joins, setops, ...)
  ○ Limited library of functions
  ○ Doesn't scale as much as MR
                                              Google Confidential and Proprietary
YTDW - Techonology
     Comparison

              Sawzall   Tenzing   Dremel

 Latency       High     Medium     Low

Scalability    High      High     Medium

   SQL         None      High     Medium

  Power        High     Medium     Low

                                   Google Confidential and Proprietary
YTDW Future: Query Engines
 ● Adding MR capabilities to Dremel
   ○ Scalable reliable shuffle
   ○ Materializing large result sets
   ○ Read / write multiple data formats
 ● Easier / more powerful analysis in Dremel
   ○ User defined scalar and table values functions
   ○ More SQL features:
      ■ Better support for joins
      ■ Analytic functions, set operators, etc.
 ● Long term for Dremel:
   ○ Completely replace Tenzing MR backend
   ○ Extend BigQuery service capabilities
                                            Google Confidential and Proprietary
YTDW - About ABI
●   Complete reporting and dashboarding solution
●   Built on Google stack
●   Tight integration with Dremel and ColumnIO
●   Google Visualizations, some Flash
●   Current use in YTDW:
    ○ Most reports and dashboards




                                            Google Confidential and Proprietary
YTDW - Misc Technologies
● Python
  ○ Glue code - drivers, wrappers, etc.
  ○ Simple small scale extracts
● Scheduler
  ○ In-house scheduling framework
  ○ Built internally by YTDW engineers
● C++ MapReduce
  ○ Used sparingly for complex cases not possible
     using Sawzall /Tenzing
● Query Rewriter
  ○ Sits between ABI and Dremel
  ○ Rewrites queries to be faster / cheaper
                                          Google Confidential and Proprietary
YTDW - Performance
Performance improvement strategies:
   ● Data model:
       ○ De-normalized aggregated materialized views
       ○ Range partitioning
   ● Query rewrite layer:
       ○ Use the right aggregated materialized view
       ○ Prune partitions based on data knowledge
   ● Reporting front end
     ○ Aggressive result caching (memcache)


                                              Google Confidential and Proprietary
YTDW - Q & A




               Google Confidential and Proprietary

Weitere ähnliche Inhalte

Was ist angesagt?

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redisinovia
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSeveralnines
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»Olga Lavrentieva
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewDoiT International
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTrivadis
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBSeveralnines
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubolekbajda
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhatKumar Prabhat
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-Boyang Niu
 
Working with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsWorking with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsSeveralnines
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterMariaDB plc
 

Was ist angesagt? (20)

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redis
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd Databases
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDB
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
RubiX
RubiXRubiX
RubiX
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhat
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-
 
Working with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsWorking with the Moodle Database: The Basics
Working with the Moodle Database: The Basics
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it faster
 

Andere mochten auch

Home care in wa state with audio take 3 show
Home care in wa state with audio take 3 showHome care in wa state with audio take 3 show
Home care in wa state with audio take 3 showKatheryn Howell
 
Home Care In W Atate With Audio Take 3 Show
Home  Care In  W  Atate With Audio Take 3 ShowHome  Care In  W  Atate With Audio Take 3 Show
Home Care In W Atate With Audio Take 3 ShowKatheryn Howell
 
Cell organelles
Cell organellesCell organelles
Cell organellesTownview
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 

Andere mochten auch (7)

Cellresp.
Cellresp.Cellresp.
Cellresp.
 
Home care in wa state with audio take 3 show
Home care in wa state with audio take 3 showHome care in wa state with audio take 3 show
Home care in wa state with audio take 3 show
 
Home Care In W Atate With Audio Take 3 Show
Home  Care In  W  Atate With Audio Take 3 ShowHome  Care In  W  Atate With Audio Take 3 Show
Home Care In W Atate With Audio Take 3 Show
 
Cell organelles
Cell organellesCell organelles
Cell organelles
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 

Ähnlich wie Xldb2011 tue 1120_youtube_datawarehouse

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfMiguel Angel Fajardo
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLMariaDB plc
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014carrjc2
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed_Hat_Storage
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Pôle Systematic Paris-Region
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 

Ähnlich wie Xldb2011 tue 1120_youtube_datawarehouse (20)

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQL
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 

Mehr von liqiang xu

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909liqiang xu
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施liqiang xu
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)liqiang xu
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能liqiang xu
 
Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)liqiang xu
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)liqiang xu
 

Mehr von liqiang xu (9)

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
 

Kürzlich hochgeladen

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Kürzlich hochgeladen (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Xldb2011 tue 1120_youtube_datawarehouse

  • 1. Youtube Data Warehouse Biswapesh Chattopadhyay biswapesh@google.com XLDB 2011 Google Confidential and Proprietary
  • 2. YTDW - Motivation & History ● Consolidated warehouse of Youtube data ● Videos, playbacks, summarized logs, etc. ● Very large (X PB uncompressed, Trillion row tables) ● High volume ETL (XXX TB processed / day) ● 100% Google Tech Stack: ○ Query: Oracle -> MySQL -> ColumnIO ○ ETL: Python -> Sawzall + Tenzing + Python ○ Reporting: Microstrategy -> ABI ● Key technologies: Sawzall, Tenzing, Dremel, ABI Google Confidential and Proprietary
  • 3. YTDW - Overall Architecture Logs Dremel ABI Service Reports ETL (Sawzall, MySQL Tenzing, YTDW DB Python, C++ (ColumnIO / MR) GFS) Tenzing Service Bigtable Google Confidential and Proprietary
  • 4. YTDW - About Sawzall ● Scripting language on Google MR framework ● Sawzall vs MR-Saw ● Built-in security for accessing sensitive logs data ● Strong support for aggregation and complex computations ● Read/write various formats ● Procedural language ● Open sourced! ● YTDW Usage: ○ ETL of Youtube logs ○ Complex one-off logs analysis Google Confidential and Proprietary
  • 5. YTDW - About Tenzing ● SQL on MR - Think HIVE, HadoopSQL ● Key strengths: ○ Strong SQL support ○ Highly scalable - built on Google MR ○ Read / write many formats ● Weaknesses: ○ Not ideal for complex procedural code ○ Higher latency than Dremel ○ Limited support for nested-repeated structures ● YTDW Usage: ○ ETL for non-logs data, denormalizations ○ Medium complexity analysis on YTDW dataGoogle Confidential and Proprietary
  • 6. YTDW - About Dremel ● Current use in YTDW: ○ Reporting query engine ○ Interactive simple logs analysis ● Key Strengths ○ Very low latency ○ SQL support ○ Strong nested-relational support ○ Access to logs ● Limitations ○ More complex SQL constructs (joins, setops, ...) ○ Limited library of functions ○ Doesn't scale as much as MR Google Confidential and Proprietary
  • 7. YTDW - Techonology Comparison Sawzall Tenzing Dremel Latency High Medium Low Scalability High High Medium SQL None High Medium Power High Medium Low Google Confidential and Proprietary
  • 8. YTDW Future: Query Engines ● Adding MR capabilities to Dremel ○ Scalable reliable shuffle ○ Materializing large result sets ○ Read / write multiple data formats ● Easier / more powerful analysis in Dremel ○ User defined scalar and table values functions ○ More SQL features: ■ Better support for joins ■ Analytic functions, set operators, etc. ● Long term for Dremel: ○ Completely replace Tenzing MR backend ○ Extend BigQuery service capabilities Google Confidential and Proprietary
  • 9. YTDW - About ABI ● Complete reporting and dashboarding solution ● Built on Google stack ● Tight integration with Dremel and ColumnIO ● Google Visualizations, some Flash ● Current use in YTDW: ○ Most reports and dashboards Google Confidential and Proprietary
  • 10. YTDW - Misc Technologies ● Python ○ Glue code - drivers, wrappers, etc. ○ Simple small scale extracts ● Scheduler ○ In-house scheduling framework ○ Built internally by YTDW engineers ● C++ MapReduce ○ Used sparingly for complex cases not possible using Sawzall /Tenzing ● Query Rewriter ○ Sits between ABI and Dremel ○ Rewrites queries to be faster / cheaper Google Confidential and Proprietary
  • 11. YTDW - Performance Performance improvement strategies: ● Data model: ○ De-normalized aggregated materialized views ○ Range partitioning ● Query rewrite layer: ○ Use the right aggregated materialized view ○ Prune partitions based on data knowledge ● Reporting front end ○ Aggressive result caching (memcache) Google Confidential and Proprietary
  • 12. YTDW - Q & A Google Confidential and Proprietary