SlideShare a Scribd company logo
1 of 26
Daniel Abadi -- Yale University
Column-Stores vs. Row-Stores:
How Different Are They
Really?
Daniel Abadi (Yale),
Samuel Madden (MIT),
Nabil Hachem (AvantGarde Consulting)
June 12th
, 2008
Row vs. Column-Stores
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail Phone #
Street
Address
Row-Store Column-Store
− Might read in
unnecessary data
+ Only need to read
in relevant data
+ Easy to add a new record
− Tuple writes might require
multiple seeks
Column-Stores
• Really good for read-mostly data
warehouses
 Lot’s of column scans and aggregations
 Writes tend to be in batch
 [CK85], [SAB+05], [ZBN+05], [HLA+06],
[SBC+07] all verify this
 Top 3 in TPC-H rankings (Exasol, ParAccel,
and Kickfire) are column-stores
 Factor of 5 faster on performance
 Factor of 2 superior on price/performance
Data Warehouse DBMS
Software
• $4.5 billion industry (out of total $16 billion
DBMS software industry)
• Growing 10% annually
Momentum
• Right solution for growing market  $$$$
• Vertica, ParAccel, Kickfire, Calpont,
Infobright, and Exasol new entrants
• Sybase IQ’s profits rapidly increasing
• Yahoo’s world largest (multi-petabyte)
data warehouse is a column-store (from
Mahat Technologies acquisition)
Paper Looks At Key Question
• How much of the buzz around column-
stores just marketing hype?
 Do you really need to buy Sybase
IQ/Vertica/ParAccel?
 How far will your current row-store take you?
 Can you get column-store performance from a row-
store?
 Can you simulate a column-store in a row-store
Paper Methodology
• Comparing row-store vs. column-store is
dangerous/borderline meaningless
• Instead, compare row-store vs. row-store
and column-store vs. column-store
 Simulate a column-store inside of a row-store
 Remove column-oriented features from
column-store until it behaves like a row-store
Simulate Column-Store
Inside Row-Store
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail
1
2
3
1
2
3
1
2
3
Option A:
Vertical
Partitioning
…
Option B:
Index Every
Column
Last Name
Index
First Name Index
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
SSBM Averages
0.0
50.0
100.0
150.0
200.0
250.0
Time(seconds)
Average 25.7 79.9 221.2
Normal Row-Store
Vertically Partitioned
Row-Store
Row-Store With All
Indexes
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Reconstruction
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
Tuple Size
1
2
3
Column
Data
TID
1
2
3
TID Column
Data
1
2
3
TID Column
Data
Tuple
Header
•Queries touch 3-4 foreign keys in fact table, 1-2 numeric
columns
•Complete fact table takes up ~4 GB (compressed)
•Vertically partitioned tables take up 0.7-1.1 GB
(compressed)
Horizontal Partitioning
• Fact table horizontally partitioned on year
 Year is an element of the ‘Date’ dimension
table
 Most queries in SSBM have a predicate on
year
 Since vertically partitioned tables do not
contain the ‘Date’ foreign key, row-store could
not similarly partition them
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Construction
Tuple Construction
• Pretty much all queries require a column
to be extracted (in the SELECT clause)
that has not yet been accessed, e.g.:
 SELECT store_name, SUM(revenue)
FROM Facts, Stores
WHERE fact.store_id = stores.store_id
AND stores.area = “NEW ENGLAND”
GROUP BY store_name
Tuple Construction
• Result of lower part of query plan is a set
of TIDs that passed all predicates
• Need to extract SELECT attributes at
these TIDs
 BUT: index maps value to TID
 You really want to map TID to value (i.e., a
vertical partition)
  Tuple construction is SLOW
So….
• All indexes approach is pretty obviously a
poor way to simulate a column-store
• Problems with vertical partitioning are
NOT fundamental
 Store tuple header in a separate partition
 Allow virtual TIDs
 Allow HP using a foreign key on a different VP
• So can row-stores simulate column-
stores?
Row-Store vs. Column-Store
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Time(seconds)
Average 25.7 11.7 4.4
Row-Store Row-Store (M V) C-Store
Column-Store Experiments
• Start with column-store (C-Store)
• Remove column-store-specific
performance optimizations
• End with column-store that behaves like a
row-store
Compression
• Higher data value locality
in column-stores
 Better ratio  reduced I/O
• Can use schemes like
run-length encoding
 Easy to operate on directly
for improved performance
([AMF06])
Q1
Q1
Q1
Q1
Q1
Q1
Q1
Q2
Q2
Q2
Q2
…
…
Quarter
(Q1, 1, 300)
Quarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
• Early Materialization:
create rows first. But:
 Poor memory bandwidth
utilization
 Lose opportunity for
vectorized operation
2
1
3
1
2
3
3
3
7
13
42
80
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeIDcustID price
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
Early vs. Late Materialization
4
4
4
4
Other Column-Store
Optimizations
• Invisible join
 Column-store specific join
 Optimizations for star schemas
 Similar to a semi-join
• Block Processing
Simplified Version of Results
0.0
10.0
20.0
30.0
40.0
50.0
Time(seconds)
Average 4.4 14.9 40.7
Original C-Store
C-Store,No
Compression
C-Store,Early
Materialization
Conclusion
• Might be possible to simulate a row-store
in a column-store, BUT:
 Need better support for vertical partitioning at
the storage layer
 Need support for column-specific
optimizations at the executer level
• Working with HP-Labs to find out
Come Join the Yale DB
Group!

More Related Content

What's hot

Data cube computation
Data cube computationData cube computation
Data cube computation
Rashmi Sheikh
 

What's hot (20)

Seven step model of migration into the cloud
Seven step model of migration into the cloudSeven step model of migration into the cloud
Seven step model of migration into the cloud
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computing
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Ai complete note
Ai complete noteAi complete note
Ai complete note
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Data scientist roadmap
Data scientist roadmapData scientist roadmap
Data scientist roadmap
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case Studies
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 
Production system in ai
Production system in aiProduction system in ai
Production system in ai
 

Viewers also liked

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
ArangoDB Database
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
Biju Nair
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
Biju Nair
 

Viewers also liked (20)

Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
 
Efficient transaction processing in sap hana
Efficient transaction processing in sap hanaEfficient transaction processing in sap hana
Efficient transaction processing in sap hana
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Beckman abadi-5min-pres
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-pres
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 Panel
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really?

SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStore
sqlserver.co.il
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really? (20)

2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2017 biological databasespart2
2017 biological databasespart22017 biological databasespart2
2017 biological databasespart2
 
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
 
Sql rally 2013 columnstore indexes
Sql rally 2013   columnstore indexesSql rally 2013   columnstore indexes
Sql rally 2013 columnstore indexes
 
BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2
 
MySql
MySqlMySql
MySql
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
 
Excel Tips 101
Excel Tips 101Excel Tips 101
Excel Tips 101
 
Excel tips
Excel tipsExcel tips
Excel tips
 
SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStore
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Tunning overview
Tunning overviewTunning overview
Tunning overview
 
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSFOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
 
Excel Tips.pptx
Excel Tips.pptxExcel Tips.pptx
Excel Tips.pptx
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptx
 

More from Daniel Abadi

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Daniel Abadi
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
Daniel Abadi
 

More from Daniel Abadi (9)

Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Column-Stores vs. Row-Stores: How Different are they Really?

  • 1. Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th , 2008
  • 2. Row vs. Column-Stores Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail Phone # Street Address Row-Store Column-Store − Might read in unnecessary data + Only need to read in relevant data + Easy to add a new record − Tuple writes might require multiple seeks
  • 3. Column-Stores • Really good for read-mostly data warehouses  Lot’s of column scans and aggregations  Writes tend to be in batch  [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this  Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores  Factor of 5 faster on performance  Factor of 2 superior on price/performance
  • 4. Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually
  • 5. Momentum • Right solution for growing market  $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition)
  • 6. Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype?  Do you really need to buy Sybase IQ/Vertica/ParAccel?  How far will your current row-store take you?  Can you get column-store performance from a row- store?  Can you simulate a column-store in a row-store
  • 7. Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store  Simulate a column-store inside of a row-store  Remove column-oriented features from column-store until it behaves like a row-store
  • 8. Simulate Column-Store Inside Row-Store Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail 1 2 3 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Last Name Index First Name Index
  • 9. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 10. SSBM Averages 0.0 50.0 100.0 150.0 200.0 250.0 Time(seconds) Average 25.7 79.9 221.2 Normal Row-Store Vertically Partitioned Row-Store Row-Store With All Indexes
  • 11. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Reconstruction
  • 12. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 13. Tuple Size 1 2 3 Column Data TID 1 2 3 TID Column Data 1 2 3 TID Column Data Tuple Header •Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed)
  • 14. Horizontal Partitioning • Fact table horizontally partitioned on year  Year is an element of the ‘Date’ dimension table  Most queries in SSBM have a predicate on year  Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them
  • 15. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Construction
  • 16. Tuple Construction • Pretty much all queries require a column to be extracted (in the SELECT clause) that has not yet been accessed, e.g.:  SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.area = “NEW ENGLAND” GROUP BY store_name
  • 17. Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs  BUT: index maps value to TID  You really want to map TID to value (i.e., a vertical partition)   Tuple construction is SLOW
  • 18. So…. • All indexes approach is pretty obviously a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental  Store tuple header in a separate partition  Allow virtual TIDs  Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores?
  • 20. Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store that behaves like a row-store
  • 21. Compression • Higher data value locality in column-stores  Better ratio  reduced I/O • Can use schemes like run-length encoding  Easy to operate on directly for improved performance ([AMF06]) Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter (Q1, 1, 300) Quarter (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)
  • 22. • Early Materialization: create rows first. But:  Poor memory bandwidth utilization  Lose opportunity for vectorized operation 2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4 prodID storeIDcustID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID Early vs. Late Materialization 4 4 4 4
  • 23. Other Column-Store Optimizations • Invisible join  Column-store specific join  Optimizations for star schemas  Similar to a semi-join • Block Processing
  • 24. Simplified Version of Results 0.0 10.0 20.0 30.0 40.0 50.0 Time(seconds) Average 4.4 14.9 40.7 Original C-Store C-Store,No Compression C-Store,Early Materialization
  • 25. Conclusion • Might be possible to simulate a row-store in a column-store, BUT:  Need better support for vertical partitioning at the storage layer  Need support for column-specific optimizations at the executer level • Working with HP-Labs to find out
  • 26. Come Join the Yale DB Group!