SlideShare a Scribd company logo
1 of 43
Dremel:
Interactive Analysis of Web-Scale Datasets
Carl Adler
IDSL - Dep. IM - NTUST
Outline
• About Dremel
• Main Features
• Record-Oriented v.s. Column-Oriented
• Data Model
• Nested Columnar Storage
• Query Execution
• Experiments
• Conclusions
About Dremel
A scalable, interactive ad-hoc query system for analysis of read-only nested
data. By combining multi-level execution trees and columnar data layout, it is
capable of running aggregation queries over trillion-row tables in seconds.
Main Features
• Dremel is a large-scale system
• The complement for MapReduce-based interactive query
• The nested data model
• Build on ideas from web search and parallel DBMSs
• Column-striped storage representation
Main Features
Dremel is a large-scale system:
• Reading 1TB of compressed data in 1 sec
-> Needs tens of thousands of disks, concurrently reading.
-> Fault tolerance is critical.
Main Features
The complement for MapReduce-based interactive query:
• Unlike traditional DBs, it is capable of operating on in situ nested data.
• Not a replacement for MR.
Main Features
The nested data model:
• Data used in web are often non-relational.
• Need some flexible data model like json.
Main Features
Build on ideas from web search and parallel DBMSs:
• Serving tree:
Divide a huge and complicated query into several small queries.
• SQL-like interface:
Like Hive and Pig.
Main Features
Column-striped storage representation:
• Read less data from secondary storage and reduce CPU cost due to
compression.
• Column stores have been adopted for analyzing relational data but to
the best of our knowledge have not been extended to nested data
models.
Record-Oriented v.s. Column-Oriented
Record-Oriented Column-Oriented
Record-Oriented v.s. Column-Oriented
• We can just retrieve A.B.C without
reading A.E or A.B.D, etc.
• Challenge: How to scan arbitrary
subset of fields efficiently and process
some analysis in the same
time.
Data Model
• The data model originated in the context of distributed systems (Protocol
Buffers), is used widely at Google, and is available as an open source
implementation.
• The data model is based on strongly-typed nested records.
Its abstract syntax is given by:
𝝉 = dom | < A1 : 𝝉[∗|?], ..., An : 𝝉[∗|?] >
Data Model
𝝉 = dom | < A1 : 𝝉[∗|?], ..., An : 𝝉[∗|?] >
• 𝝉: An atomic type or a record type.
• Atomic type: Integers, floating-point numbers, strings, etc.
• Record: It consist of one or multiple fields.
• Repeated fields (*) may occur multiple times in a record.
• Optional fields (?) may be missing from the record.
• Otherwise, a field is required.
Data Model
Data Model
This type of data model is language independent and platform-neutral, so a MR
program written in Java can consume records from a data source exposed via a
C++ library.
Nested Columnar Storage
• Values alone do not convey the structure of a record.
• Given two values of a repeated field, we do not know at what ‘level’ the value
repeated (e.g., whether these values are from two different records, or two
repeated values in the same record).
 Repetition Levels
• Given a missing optional field, we do not know which enclosing records were
defined explicitly.
 Definition Levels
Nested Columnar Storage
Nested Columnar Storage: Repetition Levels
• Repetition Levels:
It tells us at what repeated field in the field’s path the value has repeated.
• The field path Name.Language.Code contains two repeated fields, Name
and Language. Hence, the repetition level of Code ranges between 0 and 2;
level 0 denotes the start of a new record.
Nested Columnar Storage: Repetition Levels
Nested Columnar Storage: Definition Levels
• Definition Levels:
Each value of a field with path p, esp. every NULL, has a definition level
specifying how many fields in p that could be undefined (because they are
optional or repeated) are actually present in the record.
Nested Columnar Storage: Definition Levels
Nested Columnar Storage
• Splitting Records into Columns:
With this type of data model, write operation is very easy, but we need to
focus on reading. When reading, we don’t need to read the entire records,
and we can just read those columns we need to form a partial data model.
Nested Columnar Storage
Nested Columnar Storage
Complete record assembly automaton. Edges are labeled with repetition levels.
Query Execution
• Dremel’s query language is based on SQL and is designed to be efficiently
implementable on columnar nested storage.
• Each SQL statement takes as input one or multiple nested tables and their
schemas and produces a nested table and its output schema.
Query Execution
Sample query, its result, and output schema.
Query Execution
Architecture:
• Dremel uses a multi-level serving tree to execute queries.
• A root server receives incoming queries, reads metadata from the tables,
and routes the queries to the next level in the serving tree. The leaf servers
communicate with the storage layer or access the data on local disk.
Query Execution
System architecture and execution inside a server node.
Query Execution
• Consider a simple aggregation query below:
SELECT A, COUNT(B) FROM T GROUP BY A
• When the root server receives the above query, it determines all tablets, i.e.,
horizontal partitions of the table, that comprise T and rewrites the query as
follows:
SELECT A, SUM(c) FROM (R1
1 UNION ALL ... R1
n) GROUP BY A
• Tables R1
1 , …, R1
n are the results of queries sent to the nodes 1, …, n at level
1 of the serving tree:
Query Execution
• Tables R1
1 , …, R1
n are the results of queries sent to the nodes 1, …, n at level
1 of the serving tree:
R1
i = SELECT A, COUNT(B) AS c FROM T1
i GROUP BY A
• T1
i is a disjoint partition of tablets in T processed by server i at level 1.
• Here, we can know that the dataset will smaller than the original one, and
each dataset can be processed faster.
Query Execution
• Because Dremel is a multi-user system(usually several queries are executed
simultaneously).
• A query dispatcher schedules queries based on their priorities and balances
the load. Its other important role is to provide fault tolerance when one
server becomes much slower than others or a tablet replica becomes
unreachable.
Query Execution
• A system with 3000 leaf servers
• Each leaf server using 8 threads
• 3000 * 8 = 24000 (slots)
• A table spanning 100,000 tablets
• Assigning about 5 tablets / slot
Experiments
• The basic data access characteristics on a single machine
• How columnar storage benefits MR execution
• Dremel’s performance
Experiments
Table
name
Number of
records
Size (unrepl.,
compressed)
Number
of fields
Data
center
Repl.
factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
Datasets used in the experimental study
Experiments – Single Machine
Performance breakdown when reading from a local disk
(300K-record fragment of Table T1)
T1 85 billion 87 TB 270 A 3×
Experiments – MR and Dremel
Q1: SELECT SUM( CountWords (txtField)) / COUNT(*) FROM T1
T1 85 billion 87 TB 270 A 3×
Experiments – Serving Tree Topology
Q2: SELECT country, SUM( item.amount ) FROM T2 GROUP BY country
Q3: SELECT domain, SUM( item.amount ) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain
T2 24 billion 13 TB 530 A 3×
Experiments – Per-tablet Histograms
The area under each histogram corresponds to 100%. As the figure
indicates, 99% of Q2 (or Q3) tablets are processed under one second
(or two seconds).
Experiments – Scalability
In each run, the total expended CPU time is nearly identical, at about
300K seconds, whereas the user-perceived time decreases near-linearly
with the growing size of the system.
Experiments – Stragglers
Q6: SELECT COUNT(DISTINCT a) FROM T5
In contrast to the other datasets, T5 is two-way replicated. Hence, the
likelihood of stragglers slowing the execution is higher since there are
fewer opportunities to reschedule the work.
T5 1+ trillion 20 TB 30 B 2×
Conclusions
• Dremel is a custom, scalable data management solution built from simpler
components. It complements the MR paradigm.
• We outlined the key aspects of Dremel, including its storage format, query
language, and execution.
• Multi-level execution trees & Columnar data layout
• In the future, it might be widely adopted in the world.
Reference
• Dremel: Interactive Analysis of Web-Scale Datasets
END

More Related Content

What's hot

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
40 demand paging
40 demand paging40 demand paging
40 demand paging
myrajendra
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
Radhika R
 

What's hot (20)

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
Voldemort
VoldemortVoldemort
Voldemort
 
40 demand paging
40 demand paging40 demand paging
40 demand paging
 
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
 
Question answer
Question answerQuestion answer
Question answer
 
Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Cassandra
CassandraCassandra
Cassandra
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
MapReduce
MapReduceMapReduce
MapReduce
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

Similar to Dremel interactive analysis of web scale datasets

AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 

Similar to Dremel interactive analysis of web scale datasets (20)

AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Cost Based Oracle
Cost Based OracleCost Based Oracle
Cost Based Oracle
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
SQL
SQLSQL
SQL
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorial
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
dbs class 7.ppt
dbs class 7.pptdbs class 7.ppt
dbs class 7.ppt
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
SqlSaturday199 - Columnstore Indexes
SqlSaturday199 - Columnstore IndexesSqlSaturday199 - Columnstore Indexes
SqlSaturday199 - Columnstore Indexes
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Dremel interactive analysis of web scale datasets

  • 1. Dremel: Interactive Analysis of Web-Scale Datasets Carl Adler IDSL - Dep. IM - NTUST
  • 2. Outline • About Dremel • Main Features • Record-Oriented v.s. Column-Oriented • Data Model • Nested Columnar Storage • Query Execution • Experiments • Conclusions
  • 3. About Dremel A scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds.
  • 4. Main Features • Dremel is a large-scale system • The complement for MapReduce-based interactive query • The nested data model • Build on ideas from web search and parallel DBMSs • Column-striped storage representation
  • 5. Main Features Dremel is a large-scale system: • Reading 1TB of compressed data in 1 sec -> Needs tens of thousands of disks, concurrently reading. -> Fault tolerance is critical.
  • 6. Main Features The complement for MapReduce-based interactive query: • Unlike traditional DBs, it is capable of operating on in situ nested data. • Not a replacement for MR.
  • 7. Main Features The nested data model: • Data used in web are often non-relational. • Need some flexible data model like json.
  • 8. Main Features Build on ideas from web search and parallel DBMSs: • Serving tree: Divide a huge and complicated query into several small queries. • SQL-like interface: Like Hive and Pig.
  • 9. Main Features Column-striped storage representation: • Read less data from secondary storage and reduce CPU cost due to compression. • Column stores have been adopted for analyzing relational data but to the best of our knowledge have not been extended to nested data models.
  • 11. Record-Oriented v.s. Column-Oriented • We can just retrieve A.B.C without reading A.E or A.B.D, etc. • Challenge: How to scan arbitrary subset of fields efficiently and process some analysis in the same time.
  • 12. Data Model • The data model originated in the context of distributed systems (Protocol Buffers), is used widely at Google, and is available as an open source implementation. • The data model is based on strongly-typed nested records. Its abstract syntax is given by: 𝝉 = dom | < A1 : 𝝉[∗|?], ..., An : 𝝉[∗|?] >
  • 13. Data Model 𝝉 = dom | < A1 : 𝝉[∗|?], ..., An : 𝝉[∗|?] > • 𝝉: An atomic type or a record type. • Atomic type: Integers, floating-point numbers, strings, etc. • Record: It consist of one or multiple fields. • Repeated fields (*) may occur multiple times in a record. • Optional fields (?) may be missing from the record. • Otherwise, a field is required.
  • 15. Data Model This type of data model is language independent and platform-neutral, so a MR program written in Java can consume records from a data source exposed via a C++ library.
  • 16. Nested Columnar Storage • Values alone do not convey the structure of a record. • Given two values of a repeated field, we do not know at what ‘level’ the value repeated (e.g., whether these values are from two different records, or two repeated values in the same record).  Repetition Levels • Given a missing optional field, we do not know which enclosing records were defined explicitly.  Definition Levels
  • 18. Nested Columnar Storage: Repetition Levels • Repetition Levels: It tells us at what repeated field in the field’s path the value has repeated. • The field path Name.Language.Code contains two repeated fields, Name and Language. Hence, the repetition level of Code ranges between 0 and 2; level 0 denotes the start of a new record.
  • 19. Nested Columnar Storage: Repetition Levels
  • 20. Nested Columnar Storage: Definition Levels • Definition Levels: Each value of a field with path p, esp. every NULL, has a definition level specifying how many fields in p that could be undefined (because they are optional or repeated) are actually present in the record.
  • 21. Nested Columnar Storage: Definition Levels
  • 22. Nested Columnar Storage • Splitting Records into Columns: With this type of data model, write operation is very easy, but we need to focus on reading. When reading, we don’t need to read the entire records, and we can just read those columns we need to form a partial data model.
  • 24. Nested Columnar Storage Complete record assembly automaton. Edges are labeled with repetition levels.
  • 25. Query Execution • Dremel’s query language is based on SQL and is designed to be efficiently implementable on columnar nested storage. • Each SQL statement takes as input one or multiple nested tables and their schemas and produces a nested table and its output schema.
  • 26. Query Execution Sample query, its result, and output schema.
  • 27. Query Execution Architecture: • Dremel uses a multi-level serving tree to execute queries. • A root server receives incoming queries, reads metadata from the tables, and routes the queries to the next level in the serving tree. The leaf servers communicate with the storage layer or access the data on local disk.
  • 28. Query Execution System architecture and execution inside a server node.
  • 29. Query Execution • Consider a simple aggregation query below: SELECT A, COUNT(B) FROM T GROUP BY A • When the root server receives the above query, it determines all tablets, i.e., horizontal partitions of the table, that comprise T and rewrites the query as follows: SELECT A, SUM(c) FROM (R1 1 UNION ALL ... R1 n) GROUP BY A • Tables R1 1 , …, R1 n are the results of queries sent to the nodes 1, …, n at level 1 of the serving tree:
  • 30. Query Execution • Tables R1 1 , …, R1 n are the results of queries sent to the nodes 1, …, n at level 1 of the serving tree: R1 i = SELECT A, COUNT(B) AS c FROM T1 i GROUP BY A • T1 i is a disjoint partition of tablets in T processed by server i at level 1. • Here, we can know that the dataset will smaller than the original one, and each dataset can be processed faster.
  • 31. Query Execution • Because Dremel is a multi-user system(usually several queries are executed simultaneously). • A query dispatcher schedules queries based on their priorities and balances the load. Its other important role is to provide fault tolerance when one server becomes much slower than others or a tablet replica becomes unreachable.
  • 32. Query Execution • A system with 3000 leaf servers • Each leaf server using 8 threads • 3000 * 8 = 24000 (slots) • A table spanning 100,000 tablets • Assigning about 5 tablets / slot
  • 33. Experiments • The basic data access characteristics on a single machine • How columnar storage benefits MR execution • Dremel’s performance
  • 34. Experiments Table name Number of records Size (unrepl., compressed) Number of fields Data center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× Datasets used in the experimental study
  • 35. Experiments – Single Machine Performance breakdown when reading from a local disk (300K-record fragment of Table T1) T1 85 billion 87 TB 270 A 3×
  • 36. Experiments – MR and Dremel Q1: SELECT SUM( CountWords (txtField)) / COUNT(*) FROM T1 T1 85 billion 87 TB 270 A 3×
  • 37. Experiments – Serving Tree Topology Q2: SELECT country, SUM( item.amount ) FROM T2 GROUP BY country Q3: SELECT domain, SUM( item.amount ) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain T2 24 billion 13 TB 530 A 3×
  • 38. Experiments – Per-tablet Histograms The area under each histogram corresponds to 100%. As the figure indicates, 99% of Q2 (or Q3) tablets are processed under one second (or two seconds).
  • 39. Experiments – Scalability In each run, the total expended CPU time is nearly identical, at about 300K seconds, whereas the user-perceived time decreases near-linearly with the growing size of the system.
  • 40. Experiments – Stragglers Q6: SELECT COUNT(DISTINCT a) FROM T5 In contrast to the other datasets, T5 is two-way replicated. Hence, the likelihood of stragglers slowing the execution is higher since there are fewer opportunities to reschedule the work. T5 1+ trillion 20 TB 30 B 2×
  • 41. Conclusions • Dremel is a custom, scalable data management solution built from simpler components. It complements the MR paradigm. • We outlined the key aspects of Dremel, including its storage format, query language, and execution. • Multi-level execution trees & Columnar data layout • In the future, it might be widely adopted in the world.
  • 42. Reference • Dremel: Interactive Analysis of Web-Scale Datasets
  • 43. END