SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Using realtime SQL2003 to query
JSON on Hadoop with Apache Drill
               January 28, 2013
                    Jacques Nadeau
     Apache Drill Contributor @ MapR Technologies
Me
• Apache Drill and HBase Contributor
• Sponsored by MapR Technologies to lead Apache Drill
  contributions


   – Enterprise-grade high performance distribution for
     Hadoop
   – Open source plus standards-based extensions
   – Large number Fortune 100 customers, startups too.
   – Free distribution for unlimited nodes
   – Partnered to provide on Google Compute Engine and
     Amazon Elastic MapReduce
Transaction
                         information
Jane works as an
Analyst at an
ecommerce website

How does she figure         User
                            profiles
out good targeting
segments for the next
marketing campaign?

She has some ideas
and lots of data        Access
                        logs
Let’s try using existing options
•   Use Oracle
     – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce
       team to build a MapReduce job that provides export. Contact DBA to import data exports. Use
       Oracle SQL to determine answers.
•   Use Hive
     – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support
       nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from
       Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL
       query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the
       way with the flattened Mongo data, portion of Oracle extraction, and the lack of major
       portions of SQL syntax.
•   Use Data Virtualization Solution
     – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data
       since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run
       locally against Hadoop data and fails to effectively distribute your query.
•   Use MapReduce
     – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL
       from Oracle. Define a custom MapReduce to import Mongo data.
     – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
Why are things so hard?
• Slow
   – Virtualization solutions don’t support data locality and pushdown
   – MapReduce sacrifices performance to support long running jobs, recoverability, and
     ultimate flexibility
• Old
   – Most systems assume flat data with well-defined static schemas
• Hard
   – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and
     SQL?)
   – Analysts often need custom development help
• Error Prone
   – ETL leads to data synchronization issues
   – Lack of query transparency leads to incorrect assumptions and bad business conclusions
• Expensive
   – Commercial solutions are very expensive
   – Typically provide poor compatibility with newer NoSQL technologies
Open Source Mantra: WWGD?
         Distributed                 Interactive   Batch
                       Datastore
         File System                 analysis      processing


              GFS         BigTable      Dremel      MapReduce


                                                     Hadoop
             HDFS          HBase
                                                    MapReduce




Build Apache Drill to provide a true open source
   solution to interactive analysis of Big Data
Apache Drill Overview
• Drill overview
   –   Low latency interactive queries
   –   Standard ANSI SQL2003 support
   –   Domain Specific Languages / Your own QL
   –   Inspired by, compatible with Google BigQuery/Dremel
   –   Supports Nested/Hierarchical Data Formats
   –   Supports RDBMS, Hadoop and NoSQL alike

• Open-Source and Flexible
   – Apache Incubator
   – 100’s involved across US and Europe
   – Community consensus on API, functionality
Why do we need another tool?

Point queries              Data Analyst & Reporting Queries
0-100ms                    3 minutes – 20 minutes
     Interactive Queries
     100ms – 3 minutes                                  Data Mining and Major ETL
                                                        20 minutes – 20 hours




                                                              MapReduce,
                           Apache
 Per                                                          Hive and PIG
                           Drill
 system
 interfaces
Why not improve Hive or Pig?
•   Different Goals
•   SQL should be first class concern
•   MapReduce severely hampers processing model and performance
     –   Startup cost is high
     –   Map:Reduce recoverability and barrier disadvantages
     –   Job:Job recoverability and barrier disadvantages (chained jobs)
•   Need to build from in-memory representation
     –   Two canonical in-memory formats (row-based and columnar)
     –   Support much larger memory sizes
     –   Smaller memory footprint per record
     –   Avoid serialization/deserialization and object creation costs between nodes and operations
•   Performance of interactive queries is critical
     –   Evaluation and Operator code generation & compilation
•   First class recognition of nested types without metadata requirement
     –   Schema Discovery and standard schema representation
•   Clear delineation between important stages
     –   Support for multiple optimizers and researcher experimentation
How does it work?
• Drillbits run on each node to minimize
  network transfer
• Queries can be fed to any Drillbit.      SELECT * FROM
                                           oracle.transactions,
• Coordination, query planning,            mongo.users,
  optimization, scheduling, and            hdfs.events
                                           LIMIT 1
  execution are distributed
Flexibility with Strongly Defined Tiers and APIs
Apache Drill currently in development
• Heavy active development by multiple
  supporting organizations
• Available
  – Logical plan syntax and interpreter
  – Reference Interpreter
• In progress
  – SQL interpreter
  – Storage Engine implementations for Accumulo,
    Cassandra, HBase, and HDFS file formats
Conclusion & Questions
• Put Apache Drill on your roadmap, we’ll make your life
  easier

• Join the community
   – Code: http://github.com/apache/incubator-drill
   – Mailing List: drill-user@incubator.apache.org
   – Wiki: https://cwiki.apache.org/confluence/display/DRILL

• Access this presentation: http://bit.ly/Wo6DLd

• Contact Me:
   – jacques.drill@gmail.com

Weitere ähnliche Inhalte

Mehr von MapR Technologies

Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceMapR Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 

Mehr von MapR Technologies (20)

Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

GBDC 2013-01-28

  • 1. Using realtime SQL2003 to query JSON on Hadoop with Apache Drill January 28, 2013 Jacques Nadeau Apache Drill Contributor @ MapR Technologies
  • 2. Me • Apache Drill and HBase Contributor • Sponsored by MapR Technologies to lead Apache Drill contributions – Enterprise-grade high performance distribution for Hadoop – Open source plus standards-based extensions – Large number Fortune 100 customers, startups too. – Free distribution for unlimited nodes – Partnered to provide on Google Compute Engine and Amazon Elastic MapReduce
  • 3. Transaction information Jane works as an Analyst at an ecommerce website How does she figure User profiles out good targeting segments for the next marketing campaign? She has some ideas and lots of data Access logs
  • 4. Let’s try using existing options • Use Oracle – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce team to build a MapReduce job that provides export. Contact DBA to import data exports. Use Oracle SQL to determine answers. • Use Hive – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the way with the flattened Mongo data, portion of Oracle extraction, and the lack of major portions of SQL syntax. • Use Data Virtualization Solution – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run locally against Hadoop data and fails to effectively distribute your query. • Use MapReduce – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL from Oracle. Define a custom MapReduce to import Mongo data. – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
  • 5. Why are things so hard? • Slow – Virtualization solutions don’t support data locality and pushdown – MapReduce sacrifices performance to support long running jobs, recoverability, and ultimate flexibility • Old – Most systems assume flat data with well-defined static schemas • Hard – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and SQL?) – Analysts often need custom development help • Error Prone – ETL leads to data synchronization issues – Lack of query transparency leads to incorrect assumptions and bad business conclusions • Expensive – Commercial solutions are very expensive – Typically provide poor compatibility with newer NoSQL technologies
  • 6. Open Source Mantra: WWGD? Distributed Interactive Batch Datastore File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 7. Apache Drill Overview • Drill overview – Low latency interactive queries – Standard ANSI SQL2003 support – Domain Specific Languages / Your own QL – Inspired by, compatible with Google BigQuery/Dremel – Supports Nested/Hierarchical Data Formats – Supports RDBMS, Hadoop and NoSQL alike • Open-Source and Flexible – Apache Incubator – 100’s involved across US and Europe – Community consensus on API, functionality
  • 8. Why do we need another tool? Point queries Data Analyst & Reporting Queries 0-100ms 3 minutes – 20 minutes Interactive Queries 100ms – 3 minutes Data Mining and Major ETL 20 minutes – 20 hours MapReduce, Apache Per Hive and PIG Drill system interfaces
  • 9. Why not improve Hive or Pig? • Different Goals • SQL should be first class concern • MapReduce severely hampers processing model and performance – Startup cost is high – Map:Reduce recoverability and barrier disadvantages – Job:Job recoverability and barrier disadvantages (chained jobs) • Need to build from in-memory representation – Two canonical in-memory formats (row-based and columnar) – Support much larger memory sizes – Smaller memory footprint per record – Avoid serialization/deserialization and object creation costs between nodes and operations • Performance of interactive queries is critical – Evaluation and Operator code generation & compilation • First class recognition of nested types without metadata requirement – Schema Discovery and standard schema representation • Clear delineation between important stages – Support for multiple optimizers and researcher experimentation
  • 10. How does it work? • Drillbits run on each node to minimize network transfer • Queries can be fed to any Drillbit. SELECT * FROM oracle.transactions, • Coordination, query planning, mongo.users, optimization, scheduling, and hdfs.events LIMIT 1 execution are distributed
  • 11. Flexibility with Strongly Defined Tiers and APIs
  • 12. Apache Drill currently in development • Heavy active development by multiple supporting organizations • Available – Logical plan syntax and interpreter – Reference Interpreter • In progress – SQL interpreter – Storage Engine implementations for Accumulo, Cassandra, HBase, and HDFS file formats
  • 13. Conclusion & Questions • Put Apache Drill on your roadmap, we’ll make your life easier • Join the community – Code: http://github.com/apache/incubator-drill – Mailing List: drill-user@incubator.apache.org – Wiki: https://cwiki.apache.org/confluence/display/DRILL • Access this presentation: http://bit.ly/Wo6DLd • Contact Me: – jacques.drill@gmail.com