SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Google Dremel
Vicente Orjales - vorjales@gmail.com
http://www.linkedin.com/in/vorjales
Index
● Introduction.
● Google and Big Data. History.
● Columnar representation of Nested Data.
● Query Execution.
● Google Dremel vs. Map-Reduce.
● Implementations: Google BigQuery and Apache Drill
● Case Studies.
● Demo - BigQuery.
● Appendix (References and Screenshots): SLIDES 16-26
Introduction.What is Dremel?
Google Dremel is a system developed by Google for interactively querying
large data sets.
● Near real time interactive analysis (instead batch processing). SQL-
like query language.
● Nested data (most of unstructured data supported) with a column
storage representation.
● Tree architecture (as in web search): multi-level execution trees for
query processing.
It is widely used inside Google and available for the public as Google BigQuery.
There is also an open source implementation named Apache Drill (formerly
Open Dremel) still in beta phase.
Google and Big Data. History.
● 2003: Google File System. Scalable distributed file system for
data intensive applications built over cheap commodity hardware.
● 2004: Map-Reduce. Programming model for large data sets.
Automatically parallelizes jobs in large cluster of commodity
machines. Inspiration for Hadoop.
● 2006: Bigtable. Distributed storage system for structured data.
Inspiration for NoSQL databases
● 2010: Percolator. Adds transactions and locks at table and row
level.
● 2010: Pregel. Scalable Graph Computing. Mining data from social
graphs.
● 2010: Dremel. Ner real time solution with SQL-like language.
Model for Google BigQuery.
Columnar Representation of Nested Data
All values of a nested field such as A.B.C are stored contiguously. Therefore,
A.B.C can be retrieved without reading A.E, A.B.D, etc.
Columnar storage proved has been successfully used for flat relational data,
but making it work for the nested data within Dremel was one the challenges
in the Dremel design: how to preserve all structural information and be able
to reconstruct records from an arbitrary subset of fields.
Columnar Representation of Nested Data
Columnar Representation of Nested Data
record (r=0) has repeated
Language (r=0) has repeated
Query execution
Multi-level serving tree for executing queries:
● Root Server: receive incoming queries, read metadata from tables, route
query to next level.
● Leaf Server: Communication and access to the data layer for retrieving the
results.
● Each server has an internal execution tree that corresponds with a physical
query execution plan.
Query execution - Example: count()
Google Dremel vs Map-Reduce.
M-R
Batch Processing
Not query language. May
be added with extra
modules (e.g.: Hive in
Hadoop)
Data organization depends
on the source
Hadoop implementation:
Considerable admin effort
Dremel
Interactive query
SQL-like query language
Nested data. Column
oriented data organization.
BigQuery implementation:
Admin transparent to the
user.
Google Dremel vs Map-Reduce.
● Emerged at different times and with different motivations: MR focused on
distributed processing for large data sets, while Gremel driver was ad hoc.
queries in near real time.
● Some authors argue that MR approach is not enough for the new big data
requirements companies are confronting today (real time, complex joins,
ACID...).
● While web companies has moved forward with their own products (e.g.
Google with Dremel or Pregel), most of "conventional" entreprises has
decided to buy instead build, selecting and investing in Hadoop based
products in most of cases (Cloudera, Hortonworks, EMC).
The market tendency is to keep building Hadoop based solutions and enrich
them with extra layers providing better real time support and scalability, while
web companies keep innovating with new products. In this scenario Dremel
and MR may be complementary instead of alternative approachs.
Dremel Implementations
● Google BigQuery
○ It is the externalization of the Dremel product, making it available to
the public as part of the Google Cloud.
○ Data upload: directly to BigQuery or through Google Cloud Storage
(better performance for big large data sets).
○ Data manipulation: web interface / Language API / REST API
● Apache Drill (formerly Open Dremel)
○ Open Source implementation of Google Dremel.
○ Apache License (Apache Incubator). http://incubator.apache.org/drill/
○ Still Beta (Alpha?) Phase
BigQuery - Case studies
● Top advertiser network in Brazil.
● Challenge: process millions of records generated daily
(impressions, click-through, views) to be sure system is targeting
ads effectively.
● Solution: BigQuery to gain real time insights into more than 3
billion ads in 350K blogs and webs.
● Travel agency with bus ticketing in India.
● Challenge: Analyze booking and inventory from systems of
hundreds of bus operators serving more than 10,000 routes.
Hadoop takes much time and requires specialized staff.
● Solution: BigQuery allows to quickly detect improvement
opportunities (e.g.: routes with available seats)
● its customer has 16 bungalow villages in Europe.
● Challenge: Forecast number of vacationers and determine
optimal pricing for accommodations. Limitations of current
systems. Several architectures has failed (MS, SAS, BO, Excel).
● Solution: Their own application based in BigQuery performs
analysis in seconds instead hours and without extra people.
Source: https://developers.google.com/bigquery/case_studies/
BigQuery - Demo - Conclusions
● No any installation or configuration of infrastructure is needed. Everything
is in the Google Cloud.
● Interactive system.
● The amount of processed data depends on the selected columns (column
oriented storage)
● ...and the billing depends on the amount of data we've processed!!
Appendix
Demo Screens
BigQuery - Demo - Project Creation
1. Select existing
project or create a
new one
2. Choose the API
we plan to use, e.g.:
BigQuery API
https://code.google.com/apis/console/
3. Go to BigQuery
section
First of all, we have to create
a new project into the
Google API Console and add
the BigQuery API to that
project.
BigQuery - Demo - Web Console
Data sets
https://bigquery.cloud.google.com/
Query results /
Data set records
Query executed over
the data sets
Query completion
measures
● BigQuery offers a web console to upload
any data set and execute any query.
● Also test data sets are available to start
using the system immediately.
BigQuery - Demo - Loading Data
● It will not allow to create data sets neither upload information until we activate the billing for the
project.
● Two ways for uploading data:
○ Coming directly from our machine
○ Or we can use Google Cloud Storage as intermediary (better performance for big data sets).
○ Data uploaded as CSV or JSON (flat or nested) files.
● Simple example: Upload list of stations from our machine
○ Create a new dataset named "hubway" and create a new table "stations"
○ Table schema: id:integer, name:string, des:string, install:boolean, locked:boolean, temporary:
boolean, lat:float, lng:float
List of steps to
upload data from
local disk
Data source for the example: http://hubwaydatachallenge.
org/submission/leaderboard/
END

Weitere ähnliche Inhalte

Was ist angesagt?

Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptxSurya937648
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google CloudPgDay.Seoul
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Presentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUGPresentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUGfrancelabs
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 

Was ist angesagt? (20)

Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Presentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUGPresentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUG
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 

Ähnlich wie Google Dremel. Concept and Implementations.

Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHackswesley chun
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMatthias Feys
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Google App Engine 7 9-14
Google App Engine 7 9-14Google App Engine 7 9-14
Google App Engine 7 9-14Tony Frame
 
Google Cloud @ Hackathons (2020)
Google Cloud @ Hackathons (2020)Google Cloud @ Hackathons (2020)
Google Cloud @ Hackathons (2020)wesley chun
 
A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite Codemotion
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSAmazon Web Services
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQueryWilliam M. Cohee
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Ido Green
 
Google not all clouds are created equal - sap sapphire 2014 (1)
Google not all clouds are created equal - sap sapphire 2014 (1)Google not all clouds are created equal - sap sapphire 2014 (1)
Google not all clouds are created equal - sap sapphire 2014 (1)David Torres
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 

Ähnlich wie Google Dremel. Concept and Implementations. (20)

Google Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacksGoogle Cloud lightning talk @MHacks
Google Cloud lightning talk @MHacks
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
GCP Slide.pptx
GCP Slide.pptxGCP Slide.pptx
GCP Slide.pptx
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Google App Engine 7 9-14
Google App Engine 7 9-14Google App Engine 7 9-14
Google App Engine 7 9-14
 
Google Cloud @ Hackathons (2020)
Google Cloud @ Hackathons (2020)Google Cloud @ Hackathons (2020)
Google Cloud @ Hackathons (2020)
 
A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
Google not all clouds are created equal - sap sapphire 2014 (1)
Google not all clouds are created equal - sap sapphire 2014 (1)Google not all clouds are created equal - sap sapphire 2014 (1)
Google not all clouds are created equal - sap sapphire 2014 (1)
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Google Dremel. Concept and Implementations.

  • 1. Google Dremel Vicente Orjales - vorjales@gmail.com http://www.linkedin.com/in/vorjales
  • 2. Index ● Introduction. ● Google and Big Data. History. ● Columnar representation of Nested Data. ● Query Execution. ● Google Dremel vs. Map-Reduce. ● Implementations: Google BigQuery and Apache Drill ● Case Studies. ● Demo - BigQuery. ● Appendix (References and Screenshots): SLIDES 16-26
  • 3. Introduction.What is Dremel? Google Dremel is a system developed by Google for interactively querying large data sets. ● Near real time interactive analysis (instead batch processing). SQL- like query language. ● Nested data (most of unstructured data supported) with a column storage representation. ● Tree architecture (as in web search): multi-level execution trees for query processing. It is widely used inside Google and available for the public as Google BigQuery. There is also an open source implementation named Apache Drill (formerly Open Dremel) still in beta phase.
  • 4. Google and Big Data. History. ● 2003: Google File System. Scalable distributed file system for data intensive applications built over cheap commodity hardware. ● 2004: Map-Reduce. Programming model for large data sets. Automatically parallelizes jobs in large cluster of commodity machines. Inspiration for Hadoop. ● 2006: Bigtable. Distributed storage system for structured data. Inspiration for NoSQL databases ● 2010: Percolator. Adds transactions and locks at table and row level. ● 2010: Pregel. Scalable Graph Computing. Mining data from social graphs. ● 2010: Dremel. Ner real time solution with SQL-like language. Model for Google BigQuery.
  • 5. Columnar Representation of Nested Data All values of a nested field such as A.B.C are stored contiguously. Therefore, A.B.C can be retrieved without reading A.E, A.B.D, etc. Columnar storage proved has been successfully used for flat relational data, but making it work for the nested data within Dremel was one the challenges in the Dremel design: how to preserve all structural information and be able to reconstruct records from an arbitrary subset of fields.
  • 7. Columnar Representation of Nested Data record (r=0) has repeated Language (r=0) has repeated
  • 8. Query execution Multi-level serving tree for executing queries: ● Root Server: receive incoming queries, read metadata from tables, route query to next level. ● Leaf Server: Communication and access to the data layer for retrieving the results. ● Each server has an internal execution tree that corresponds with a physical query execution plan.
  • 9. Query execution - Example: count()
  • 10. Google Dremel vs Map-Reduce. M-R Batch Processing Not query language. May be added with extra modules (e.g.: Hive in Hadoop) Data organization depends on the source Hadoop implementation: Considerable admin effort Dremel Interactive query SQL-like query language Nested data. Column oriented data organization. BigQuery implementation: Admin transparent to the user.
  • 11. Google Dremel vs Map-Reduce. ● Emerged at different times and with different motivations: MR focused on distributed processing for large data sets, while Gremel driver was ad hoc. queries in near real time. ● Some authors argue that MR approach is not enough for the new big data requirements companies are confronting today (real time, complex joins, ACID...). ● While web companies has moved forward with their own products (e.g. Google with Dremel or Pregel), most of "conventional" entreprises has decided to buy instead build, selecting and investing in Hadoop based products in most of cases (Cloudera, Hortonworks, EMC). The market tendency is to keep building Hadoop based solutions and enrich them with extra layers providing better real time support and scalability, while web companies keep innovating with new products. In this scenario Dremel and MR may be complementary instead of alternative approachs.
  • 12. Dremel Implementations ● Google BigQuery ○ It is the externalization of the Dremel product, making it available to the public as part of the Google Cloud. ○ Data upload: directly to BigQuery or through Google Cloud Storage (better performance for big large data sets). ○ Data manipulation: web interface / Language API / REST API ● Apache Drill (formerly Open Dremel) ○ Open Source implementation of Google Dremel. ○ Apache License (Apache Incubator). http://incubator.apache.org/drill/ ○ Still Beta (Alpha?) Phase
  • 13. BigQuery - Case studies ● Top advertiser network in Brazil. ● Challenge: process millions of records generated daily (impressions, click-through, views) to be sure system is targeting ads effectively. ● Solution: BigQuery to gain real time insights into more than 3 billion ads in 350K blogs and webs. ● Travel agency with bus ticketing in India. ● Challenge: Analyze booking and inventory from systems of hundreds of bus operators serving more than 10,000 routes. Hadoop takes much time and requires specialized staff. ● Solution: BigQuery allows to quickly detect improvement opportunities (e.g.: routes with available seats) ● its customer has 16 bungalow villages in Europe. ● Challenge: Forecast number of vacationers and determine optimal pricing for accommodations. Limitations of current systems. Several architectures has failed (MS, SAS, BO, Excel). ● Solution: Their own application based in BigQuery performs analysis in seconds instead hours and without extra people. Source: https://developers.google.com/bigquery/case_studies/
  • 14. BigQuery - Demo - Conclusions ● No any installation or configuration of infrastructure is needed. Everything is in the Google Cloud. ● Interactive system. ● The amount of processed data depends on the selected columns (column oriented storage) ● ...and the billing depends on the amount of data we've processed!!
  • 16. BigQuery - Demo - Project Creation 1. Select existing project or create a new one 2. Choose the API we plan to use, e.g.: BigQuery API https://code.google.com/apis/console/ 3. Go to BigQuery section First of all, we have to create a new project into the Google API Console and add the BigQuery API to that project.
  • 17. BigQuery - Demo - Web Console Data sets https://bigquery.cloud.google.com/ Query results / Data set records Query executed over the data sets Query completion measures ● BigQuery offers a web console to upload any data set and execute any query. ● Also test data sets are available to start using the system immediately.
  • 18. BigQuery - Demo - Loading Data ● It will not allow to create data sets neither upload information until we activate the billing for the project. ● Two ways for uploading data: ○ Coming directly from our machine ○ Or we can use Google Cloud Storage as intermediary (better performance for big data sets). ○ Data uploaded as CSV or JSON (flat or nested) files. ● Simple example: Upload list of stations from our machine ○ Create a new dataset named "hubway" and create a new table "stations" ○ Table schema: id:integer, name:string, des:string, install:boolean, locked:boolean, temporary: boolean, lat:float, lng:float List of steps to upload data from local disk Data source for the example: http://hubwaydatachallenge. org/submission/leaderboard/
  • 19. END