How BigQuery broke my heart

•

5 likes•8,585 views

BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.

Technology

how
BigQuerybroke my heart
Gabe Hamilton

Reporting Solutions Smackdown
We are evaluating replacements for SQL Server for our
Reporting & Business Intelligence backend.
Many TBs of data.
Closer to SQL the less report migration we need to do.
We like saving money.

Solutions we've been testing
Redshift
BigQuery
CouchDB
MongoDB
Cassandra
TerraData
Oracle

Plus various changes to our design
Some of these are necessary for certain technologies.
Denormalization
Sharding strategies
Nested data
Tune our existing Star Schema and Tables

BigQuery is
A massively parallel datastore
Columnar
Queries are SQL Select statements
Uses a Tree structure to distribute across
nodes

And what price?
3.5 cents /GBResource
Pricing
Query cost is per GB in the columns processed
Interactive Queries $0.035
Batch Queries $0.02
Storage $0.12 (per GB/month)

Which is great for our big queries
A gnarly query that looks at 200GB of data costs $7.50 in
BigQuery.
If that takes 2 hours to run on a $60/hr cluster of a
competing technology...
It's a little more complicated because in theory several of
those queries could run simultaneously on the competing
tech.
Still, that's 4 X cheaper plus the speed improvement.

Example: Github data from past year
3.5 GB Table
SELECT type, count(*) as num FROM [publicdata:samples.github_timeline]
group by type order by num desc;
Query complete (1.1s elapsed, 75.0 MB processed)
Event Type num
PushEvent 2,686,723
CreateEvent 964,830
WatchEvent 581,029
IssueCommentEvent 507,724
GistEvent 366,643
IssuesEvent 305,479
ForkEvent 180,712
PullRequestEvent 173,204
FollowEvent 156,427
GollumEvent 104,808
Cost $0.0026
or 5 for a penny

Uploaded our test dataset
Which is 250GB
Docs are good, tools are good.
Hurdle 1: only one join per query.
Ok, rewrite as ugly nested selects...

Round 2
No problem, I had seen that joins were
somewhat experimental.
Try the denormalized version of the data.
SELECT ProductId, StoreId, ProductSizeId, InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate
1st error message helpfully says, try GROUP EACH BY

It's not you, it's me
The documentation had some semi-useful information:
Because the system is interactive, queries that produce a large number of
groups might fail. The use of the TOP function instead of GROUP BY might
solve the problem.
However, the BigQuery TOP function only operates on one column.
At this point I had jumped through enough hoops. I posted
on Stack Overflow, the official support channel according to
the docs, and have gotten no response.

Epilogue
Simplifying my query down to two grouping
columns did cause it to run with a limit
statement.
SELECT ProductId, StoreId,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP each BY ProductId, StoreId
Limit 1000
Query complete (4.5s elapsed, 28.1 GB processed)
Without a limit it gives Error: Response too large to return.
Perhaps there is still hope for me and BigQuery...

Me
Like this talk?
@gabehamilton
My twitter feed is just technical stuff.
or slideshare.net/gabehamilton

What's hot

Exploring BigData with Google BigQueryDharmesh Vaya

Big Query BasicsIdo Green

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain

TDC2016SP - Trilha BigDatatdc-globalcode

Google Big Query UDFsDavid Gloyn-Cox

Google Cloud Platform at Vente-Exclusive.comAlex Van Boxel

30 days of google cloud eventPreetyKhatkar

Google BigQuery 101 & What’s NewDoiT International

How Google Does Big Data - DevNexus 2014James Chittenden

Google Dremel. Concept and Implementations.Vicente Orjales

Streaming 4 billion Messages per day. Lessons Learned.Angelos Petheriotis

VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok

How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.

Google and big queryQlikView-India

Google BigQuery is the future of Analytics! (Google Developer Conference)Rasel Rana

An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA

Augmenting Mongo DB with treasure dataTreasure Data, Inc.

GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIsPatrick Chanezon

2017 09-27 democratize data products with SQLYu Ishikawa

Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.

What's hot (20)

Exploring BigData with Google BigQuery

Big Query Basics

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

TDC2016SP - Trilha BigData

Google Big Query UDFs

Google Cloud Platform at Vente-Exclusive.com

30 days of google cloud event

Google BigQuery 101 & What’s New

How Google Does Big Data - DevNexus 2014

Google Dremel. Concept and Implementations.

Streaming 4 billion Messages per day. Lessons Learned.

VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...

How to Design a Modern Data Warehouse in BigQuery

Google and big query

Google BigQuery is the future of Analytics! (Google Developer Conference)

An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google

Augmenting Mongo DB with treasure data

GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs

2017 09-27 democratize data products with SQL

Scaling to Infinity - Open Source meets Big Data

Similar to How BigQuery broke my heart

disertationRuben Casas

LatJUG. Google App Enginedenis Udod

http://www.hfadeel.com/Blog/?p=151xlight

Final deckSteve Watt

Architectural anti-patterns for data handlingGleicon Moraes

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal

Hadoop and Voldemort @ LinkedInHadoop User Group

Architectural anti patterns_for_data_handlingGleicon Moraes

Apache Kylin Streaming hongbin ma

Stacktrace Berlin RC.2Oliver Seemann

Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis

AWS Presentation at JasperWorld APACAmazon Web Services

MPTStore: A Fast, Scalable, and Stable Resource IndexChris Wilper

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey

NoSql IntroductionGleicon Moraes

SEO for Large WebsitesDominic Woodman

Getting Started with Amazon RedshiftAmazon Web Services

MinneBar 2013 - Scaling with CassandraJeff Smoley

Scaling Your Web ApplicationKetan Deshmukh

Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds

Similar to How BigQuery broke my heart (20)

disertation

LatJUG. Google App Engine

http://www.hfadeel.com/Blog/?p=151

Final deck

Architectural anti-patterns for data handling

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

Architectural anti patterns_for_data_handling

Apache Kylin Streaming

Stacktrace Berlin RC.2

Big Data Analytics: Finding diamonds in the rough with Azure

AWS Presentation at JasperWorld APAC

MPTStore: A Fast, Scalable, and Stable Resource Index

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...

NoSql Introduction

SEO for Large Websites

Getting Started with Amazon Redshift

MinneBar 2013 - Scaling with Cassandra

Scaling Your Web Application

Recommender Systems from A to Z – Real-Time Deployment

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Developing An App To Navigate The Roads of BrazilV3cube

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Exploring the Future Potential of AI-Enabled Smartphone Processors

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Breaking the Kubernetes Kill Chain: Host Path Mount

Salesforce Community Group Quito, Salesforce 101

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

CNv6 Instructor Chapter 6 Quality of Service

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Presentation on how to chat with PDF using ChatGPT code interpreter

A Domino Admins Adventures (Engage 2024)

Automating Google Workspace (GWS) & more with Apps Script

Data Cloud, More than a CDP by Matt Robison

Developing An App To Navigate The Roads of Brazil

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

How BigQuery broke my heart

1. how BigQuerybroke my heart Gabe Hamilton

2. Reporting Solutions Smackdown We are evaluating replacements for SQL Server for our Reporting & Business Intelligence backend. Many TBs of data. Closer to SQL the less report migration we need to do. We like saving money.

3. Solutions we've been testing Redshift BigQuery CouchDB MongoDB Cassandra TerraData Oracle

4. Plus various changes to our design Some of these are necessary for certain technologies. Denormalization Sharding strategies Nested data Tune our existing Star Schema and Tables

5. BigQuery is A massively parallel datastore Columnar Queries are SQL Select statements Uses a Tree structure to distribute across nodes

6. How many nodes? 10,000 nodes!

7. And what price? 3.5 cents /GBResource Pricing Query cost is per GB in the columns processed Interactive Queries $0.035 Batch Queries $0.02 Storage $0.12 (per GB/month)

8. Which is great for our big queries A gnarly query that looks at 200GB of data costs $7.50 in BigQuery. If that takes 2 hours to run on a $60/hr cluster of a competing technology... It's a little more complicated because in theory several of those queries could run simultaneously on the competing tech. Still, that's 4 X cheaper plus the speed improvement.

9. Example: Github data from past year 3.5 GB Table SELECT type, count(*) as num FROM [publicdata:samples.github_timeline] group by type order by num desc; Query complete (1.1s elapsed, 75.0 MB processed) Event Type num PushEvent 2,686,723 CreateEvent 964,830 WatchEvent 581,029 IssueCommentEvent 507,724 GistEvent 366,643 IssuesEvent 305,479 ForkEvent 180,712 PullRequestEvent 173,204 FollowEvent 156,427 GollumEvent 104,808 Cost $0.0026 or 5 for a penny

10. It was love at first type.

11. But Then... Reality

12. Uploaded our test dataset Which is 250GB Docs are good, tools are good. Hurdle 1: only one join per query. Ok, rewrite as ugly nested selects...

13. Result

14. Round 2 No problem, I had seen that joins were somewhat experimental. Try the denormalized version of the data. SELECT ProductId, StoreId, ProductSizeId, InventoryDate, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate 1st error message helpfully says, try GROUP EACH BY

15. Final Result

16. It's not you, it's me The documentation had some semi-useful information: Because the system is interactive, queries that produce a large number of groups might fail. The use of the TOP function instead of GROUP BY might solve the problem. However, the BigQuery TOP function only operates on one column. At this point I had jumped through enough hoops. I posted on Stack Overflow, the official support channel according to the docs, and have gotten no response.

17. Epilogue Simplifying my query down to two grouping columns did cause it to run with a limit statement. SELECT ProductId, StoreId, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP each BY ProductId, StoreId Limit 1000 Query complete (4.5s elapsed, 28.1 GB processed) Without a limit it gives Error: Response too large to return. Perhaps there is still hope for me and BigQuery...

18. Me Like this talk? @gabehamilton My twitter feed is just technical stuff. or slideshare.net/gabehamilton

How BigQuery broke my heart

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How BigQuery broke my heart

Similar to How BigQuery broke my heart (20)

More from Gabriel Hamilton

More from Gabriel Hamilton (15)

Recently uploaded

Recently uploaded (20)

How BigQuery broke my heart