BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
2. Reporting Solutions Smackdown
We are evaluating replacements for SQL Server for our
Reporting & Business Intelligence backend.
Many TBs of data.
Closer to SQL the less report migration we need to do.
We like saving money.
4. Plus various changes to our design
Some of these are necessary for certain technologies.
Denormalization
Sharding strategies
Nested data
Tune our existing Star Schema and Tables
5. BigQuery is
A massively parallel datastore
Columnar
Queries are SQL Select statements
Uses a Tree structure to distribute across
nodes
7. And what price?
3.5 cents /GBResource
Pricing
Query cost is per GB in the columns processed
Interactive Queries $0.035
Batch Queries $0.02
Storage $0.12 (per GB/month)
8. Which is great for our big queries
A gnarly query that looks at 200GB of data costs $7.50 in
BigQuery.
If that takes 2 hours to run on a $60/hr cluster of a
competing technology...
It's a little more complicated because in theory several of
those queries could run simultaneously on the competing
tech.
Still, that's 4 X cheaper plus the speed improvement.
9. Example: Github data from past year
3.5 GB Table
SELECT type, count(*) as num FROM [publicdata:samples.github_timeline]
group by type order by num desc;
Query complete (1.1s elapsed, 75.0 MB processed)
Event Type num
PushEvent 2,686,723
CreateEvent 964,830
WatchEvent 581,029
IssueCommentEvent 507,724
GistEvent 366,643
IssuesEvent 305,479
ForkEvent 180,712
PullRequestEvent 173,204
FollowEvent 156,427
GollumEvent 104,808
Cost $0.0026
or 5 for a penny
14. Round 2
No problem, I had seen that joins were
somewhat experimental.
Try the denormalized version of the data.
SELECT ProductId, StoreId, ProductSizeId, InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate
1st error message helpfully says, try GROUP EACH BY
16. It's not you, it's me
The documentation had some semi-useful information:
Because the system is interactive, queries that produce a large number of
groups might fail. The use of the TOP function instead of GROUP BY might
solve the problem.
However, the BigQuery TOP function only operates on one column.
At this point I had jumped through enough hoops. I posted
on Stack Overflow, the official support channel according to
the docs, and have gotten no response.
17. Epilogue
Simplifying my query down to two grouping
columns did cause it to run with a limit
statement.
SELECT ProductId, StoreId,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP each BY ProductId, StoreId
Limit 1000
Query complete (4.5s elapsed, 28.1 GB processed)
Without a limit it gives Error: Response too large to return.
Perhaps there is still hope for me and BigQuery...