Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
howBigQuerybroke my heartGabe Hamilton
Reporting Solutions SmackdownWe are evaluating replacements for SQL Server for ourReporting & Business Intelligence backen...
Solutions weve been testingRedshiftBigQueryCouchDBMongoDBCassandraTerraDataOracle
Plus various changes to our designSome of these are necessary for certain technologies.DenormalizationSharding strategiesN...
BigQuery isA massively parallel datastoreColumnarQueries are SQL Select statementsUses a Tree structure to distribute acro...
How many nodes?10,000 nodes!
And what price?3.5 cents /GBResourcePricingQuery cost is per GB in the columns processedInteractive Queries $0.035Batch Qu...
Which is great for our big queriesA gnarly query that looks at 200GB of data costs $7.50 inBigQuery.If that takes 2 hours ...
Example: Github data from past year3.5 GB TableSELECT type, count(*) as num FROM [publicdata:samples.github_timeline]group...
It was love at first type.
But Then...Reality
Uploaded our test datasetWhich is 250GBDocs are good, tools are good.Hurdle 1: only one join per query.Ok, rewrite as ugly...
Result
Round 2No problem, I had seen that joins weresomewhat experimental.Try the denormalized version of the data.SELECT Product...
Final Result
Its not you, its meThe documentation had some semi-useful information:Because the system is interactive, queries that prod...
EpilogueSimplifying my query down to two groupingcolumns did cause it to run with a limitstatement.SELECT ProductId, Store...
MeLike this talk?@gabehamiltonMy twitter feed is just technical stuff.or slideshare.net/gabehamilton
Nächste SlideShare
Wird geladen in …5
×

How BigQuery broke my heart

8.049 Aufrufe

Veröffentlicht am

BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.

Veröffentlicht in: Technologie
  • Hey Gabriel - know this a pretty old post, but wanted to know if you've had a chance to look into Snowflake?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • @Gilberto Torrezan Filho Yeah, BigQuery keeps getting more amazing. I did try rerunning the queries in spring 2014 and still was not able to.

    Redshift and Teradata were the finalists of our evaluation. Redshift has great flexibility and was the cheapest (it's even cheaper now), Teradata had the fastest performance and some nice features (like aggregate indexes). I would have gone with Redshift but the company chose Teradata.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Well, BigQuery as today (2014) is pretty different from when you tested it. The prices are lower, the query can now return large results, there are no differences in cost between batch and interactive queries... After 1 year, what did you chose to solve your analytics problems?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Patrick,
    I think I glanced at it but didn't know enough about it to put it on the short list of ones to evaluate. That's the funny thing about evaluating software, you never know what's lurking just outside of your search space.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gabe,

    Curious as to why you didn't look at Vertica?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

How BigQuery broke my heart

  1. 1. howBigQuerybroke my heartGabe Hamilton
  2. 2. Reporting Solutions SmackdownWe are evaluating replacements for SQL Server for ourReporting & Business Intelligence backend.Many TBs of data.Closer to SQL the less report migration we need to do.We like saving money.
  3. 3. Solutions weve been testingRedshiftBigQueryCouchDBMongoDBCassandraTerraDataOracle
  4. 4. Plus various changes to our designSome of these are necessary for certain technologies.DenormalizationSharding strategiesNested dataTune our existing Star Schema and Tables
  5. 5. BigQuery isA massively parallel datastoreColumnarQueries are SQL Select statementsUses a Tree structure to distribute acrossnodes
  6. 6. How many nodes?10,000 nodes!
  7. 7. And what price?3.5 cents /GBResourcePricingQuery cost is per GB in the columns processedInteractive Queries $0.035Batch Queries $0.02Storage $0.12 (per GB/month)
  8. 8. Which is great for our big queriesA gnarly query that looks at 200GB of data costs $7.50 inBigQuery.If that takes 2 hours to run on a $60/hr cluster of acompeting technology...Its a little more complicated because in theory several ofthose queries could run simultaneously on the competingtech.Still, thats 4 X cheaper plus the speed improvement.
  9. 9. Example: Github data from past year3.5 GB TableSELECT type, count(*) as num FROM [publicdata:samples.github_timeline]group by type order by num desc;Query complete (1.1s elapsed, 75.0 MB processed)Event Type numPushEvent 2,686,723CreateEvent 964,830WatchEvent 581,029IssueCommentEvent 507,724GistEvent 366,643IssuesEvent 305,479ForkEvent 180,712PullRequestEvent 173,204FollowEvent 156,427GollumEvent 104,808Cost $0.0026or 5 for a penny
  10. 10. It was love at first type.
  11. 11. But Then...Reality
  12. 12. Uploaded our test datasetWhich is 250GBDocs are good, tools are good.Hurdle 1: only one join per query.Ok, rewrite as ugly nested selects...
  13. 13. Result
  14. 14. Round 2No problem, I had seen that joins weresomewhat experimental.Try the denormalized version of the data.SELECT ProductId, StoreId, ProductSizeId, InventoryDate,avg(InventoryQuantity) as InventoryQuantityFROM BigDataTest.denormGROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate1st error message helpfully says, try GROUP EACH BY
  15. 15. Final Result
  16. 16. Its not you, its meThe documentation had some semi-useful information:Because the system is interactive, queries that produce a large number ofgroups might fail. The use of the TOP function instead of GROUP BY mightsolve the problem.However, the BigQuery TOP function only operates on one column.At this point I had jumped through enough hoops. I postedon Stack Overflow, the official support channel according tothe docs, and have gotten no response.
  17. 17. EpilogueSimplifying my query down to two groupingcolumns did cause it to run with a limitstatement.SELECT ProductId, StoreId,avg(InventoryQuantity) as InventoryQuantityFROM BigDataTest.denormGROUP each BY ProductId, StoreIdLimit 1000Query complete (4.5s elapsed, 28.1 GB processed)Without a limit it gives Error: Response too large to return.Perhaps there is still hope for me and BigQuery...
  18. 18. MeLike this talk?@gabehamiltonMy twitter feed is just technical stuff.or slideshare.net/gabehamilton

×