Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

How to select a modern data warehouse and get the most out of it?

2.357 Aufrufe

Veröffentlicht am

In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.

We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.

In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.

Veröffentlicht in: Daten & Analysen
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

How to select a modern data warehouse and get the most out of it?

  1. 1. mycervello.com How to select a modern cloud data warehouse and get the most out of it 18th June 2019 Slim Baltagi Director, Big Data & ML NYC Advanced Analytics Meetup Jim Leavitt Vice President co-CEO & Founder of Cervello
  2. 2. © 2019 Cervello, an A.T. Kearney company Agenda PART I (Slim Baltagi – 35 minutes) 1. Key Terms 2. Traditional Data Warehouses vs. Modern Data Warehouses 3. How to select a cloud data warehouse? PART II (Jim Leavitt – 20 minutes) 1. Data era 2. Business use cases 3. How to get/make the most of a modern data warehouse? PART III ( Discussion with attendees – 30 minutes) 2
  3. 3. © 2019 Cervello, an A.T. Kearney company PART I – MODERN DATA WAREHOUSE 3
  4. 4. © 2019 Cervello, an A.T. Kearney company 1. Key Terms: Cloud • What is exactly the cloud? According to Gartner, ‘A style of computing in which scalable and elastic IT-enabled capabilities are delivered as a service using Internet technologies’ • What are the models within the cloud and how do they apply to data warehousing? 4 Hardware: Datacenter Software: Data warehouse Management: Optimizing, tuning, maintenance Kitchen, Cook/Order, Serve, Eat, Clean On-premises You You You At home Infrastructure IaaS Example: EC2 Vendor You You At a vacation home Platform PaaS Example: Amazon Redshift, EMR Vendor Vendor You At a fast food restaurant Software SaaS Example: Snowflake Vendor Vendor Vendor At a full service restaurant
  5. 5. © 2019 Cervello, an A.T. Kearney company 1. Key Terms: Data Warehouse • What is a data warehouse? ‘A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process.’ Source: Building the data warehouse, a book by Bill Inmon, who is considered to be the father of data warehousing • Since the beginning, data warehousing was about the business making better and quicker decisions • Over the last 30 years, data warehouses evolved from traditional DBMS to Specialized Analytics DBMS to Out-of-the-box data warehouse appliances to Hadoop/Spark-based data warehouses to ‘cloudified’, to finally built-for-the cloud data warehouses. • A ’cloudified’ data warehouse is a cloud-based representation of a traditional data warehouse (such as Amazon Redshift and Microsoft SQL Data warehouse) while a data warehouse ‘built for the cloud’ (such as Google BigQuery and Snowflake) is a data warehouse designed from the ground up to run natively on the cloud and fully deliver on the elasticity of the cloud. • A key factor driving the evolution of data warehousing is the cloud. Why? • In this modern era of cloud, social networks and mobile, the ‘original’ definition of a data warehouse might need an update! How about additional attributes such as elastic, scalable (All data, All users), agile, shareable, conductive to exploratory analytics. What do you think? 5
  6. 6. © 2019 Cervello, an A.T. Kearney company Aspect Traditional Data warehouse Modern Data Warehouse Offering mode Product Service Deployment mode On-premise, cloudified Built for the Cloud Data age Historical Historical, Live Data sources Traditional such as ERP, CRM, … Additional such as social media, sensor data, … Storage & compute Tightly coupled Decoupled Data processing Batch Batch, Micro-batch, Streaming Data format Structured data Structured, Semi-Structured & Unstructured data Data coverage Only your most critical data Possibly all of your available data Data transformation ETL ETL, ELT Usage Primarily BI BI, Advanced Analytics, Service Cost Model Pay upfront Pay-as-you-go Usability More complex Easier Management Burdens on the DBA Fully managed Time to market Usually longer Usually shorter Provisioning Usual delays for infrastructure Almost instant Time to insight Usually Longer Usually Shorter Agility Schema-on-write Schema-on-read, Schema-on-write Flexibility Static resources On-demand resources 1. Key Terms: Modern 6
  7. 7. © 2019 Cervello, an A.T. Kearney company 2. Traditional Data Warehouses: Key Challenges 1. Scalability: growth in data volume, number of users and applications 2. Concurrency: as number of users increase, they can not operate simultaneously 3. Performance: slow running queries, … 4. Resilience: data backup/retention and node failure protection 5. Complexity: from initial implementation to ongoing maintenance 6. Lack of native support for semi-structured data: JSON and XML formats are popular for data exchange. In traditional data warehouses, data needs to be transformed first and schema needs to be defined before loading 7. High maintenance overhead in the form of constant indexing, tuning, sorting 8. Handling workload fluctuation: sizing servers for workload peaks and valleys 9. Waste of resources: expensive computing resources are wasted during off peak usage 10. Lack of support for processing streaming data 11. Upfront costs: hardware, software licenses, staffing, … 12. Project delays due to provisioning infrastructure Sure, you are familiar with real world scenarios such as End of Month Reporting, Intensive Load Process, Demanding Executive Dashboard Users, … 7
  8. 8. © 2019 Cervello, an A.T. Kearney company 2. Modern Data Warehouses: How they overcome above challenges? Major modern data warehouses are cloud based. Most of them do overcome the key challenges of the traditional data warehouses. Unfortunately, overcoming these challenges is happening at different degrees and it is not always a simple check of a yes or no!! 1. Scalability: Scale up, down, or off quickly without delay. Yes for Snowflake. Low for Redshift, Yes for Azure Data Warehouse, Yes for Big Query ( but with limits) 2. Concurrency: Lots of users can operate simultaneously 3. Performance: processing bottlenecks and delays, slow running queries, … 4. Resilience: data backup/retention and node failure protection 5. Complexity: Cloud data warehouses are easier to use than on-premise data warehouses 6. Lack of native support of semi-structured data: Although cloud data warehouse do support semi-structured data such as JSON. This support varies from one data warehouse to another. oConcurrent throughput for JSON: Yes for Snowflake, No for Redshift, No for Azure Data Warehouse, Moderate for Big Query. oHigh performance for JSON scans: Yes for Snowflake, No for Redshift unless you use the additional Amazon Spectrum tool, No for Azure Data Warehouse, Moderate for Big Query. 8
  9. 9. © 2019 Cervello, an A.T. Kearney company 2. Modern Data Warehouses 7. High maintenance overhead is not a major issue with managed infrastructure for cloud data warehouses, but the level of maintenance varies from one data warehouse to another. This fluctuates to creating indices and distribution keys to near zero management 8. Handling workload fluctuation: With elasticity, cloud data warehouses adapt to workload peaks and valleys 9. Waste of resources: No more wasted resources during off peak usage! Auto suspend, Auto resume? 10. Lack of support for streaming data: For example, loading and processing is possible with Snowpipe, a serverless compute service from Snowflake data warehouse 11. Upfront costs: No upfront costs as the model of cloud data warehouse is pay as you go 12. Project Delays due to provisioning infrastructure: Not Applicable for cloud data warehouse. With instant provisioning, projects don’t need to wait for infrastructure 9
  10. 10. © 2019 Cervello, an A.T. Kearney company 3. How to select a Cloud Data Warehouse? • You’ll come across some marketing fluff when you are researching and selecting cloud data warehouses. Example: One cloud data warehouse claims to be 100X faster than the other. Another ‘modern’ one claims 100,000% cost savings compared to a legacy ‘one’! • Often cloud data warehouses are compared against a handful set of criteria only and sometimes even a single criteria: performance, using a workload derived from a so called an industry standard benchmark. • A fit for purpose comparison across key categories and evaluation criteria with a focus on your own use cases would be more practical. Example of a category: Data & Service Availability. Related criteria: Failure recovery, Disaster recovery, Data protection, Service monitoring & altering. • At Cervello, we came with a 4-Step Tool Evaluation Framework that integrates input from analysts, vendors, our own perspective and the client current requirements and future state. • Be aware that the most popular cloud data warehouse might not be the best fit for your organization! • Selecting a cloud data warehouse for your organization is all about figuring out what is the best fit for your business, your budget and your employees. 10
  11. 11. © 2019 Cervello, an A.T. Kearney company Cervello Tool Evaluation Framework 11 Prioritized and weighted selection criterion based on the desired future state Implementation and in- depth expertise Vendor ecosystem perspective Analyst reports and research CLIENT70% CERVELLO15% ECOSYSTEM10 % MARKET5% Client prioritizes and weights key criterion and scores the vendors based on the criterion Cervello provides our perspective based on experience and intelligence of the vendors in the market place Further refine the list based on vendor responses to key criterion as well as information from the vendor ecosystem Leverage Gartner, Forrester, and other market analysts to establish a baseline list of vendors to evaluate weightingfactorfordrivingconsensusonadecision START FINISH
  12. 12. © 2019 Cervello, an A.T. Kearney company Service Architecture Database Mono-Cloud or Multi-Cloud Separation of Compute and Storage Support of ANSI SQL PaaS or SaaS? Elasticity Support of Semi-structured data ‘Cloudified’ or ‘built for the cloud’ Concurrency Support of multiple file format Integration with cloud provider services & tools Performance Support of diverse data types Cost Model & Transparency Scalability Handling of variable schema Usability Support for parallel data upload Support of window functions Management Overhead Support for streaming data Support of stored procedures High Availability Security & Regulations Query Optimization Service Maintenance Built-In Optimization Metadata & Statistics Maintenance Configuration Pause/Resume mechanism Tuning Where Clauses Service Limits Data Ingestion Tuning Joins Initialization Time Support of mixed workloads Support of advanced analytics functions Service updates without disruption Integration with tools such us dataflow ones, BI, ML, … Support for materialized views Service Monitoring & Alerting Continuous Data protection Mechanisms for extensibility Some Evaluation Criteria of a Cloud Data Warehouse 12
  13. 13. © 2019 Cervello, an A.T. Kearney company Example of a Modern Data Warehouse: Snowflake Data Warehouse • Snowflake is based on three key innovations: 1. Unique architecture: a unique architecture designed for the cloud and able to provide complete elasticity for all your concurrent users and applications 2. Database engine that natively handles all your data both semi structured and structured data without sacrificing performance nor flexibility 3. Technology that eliminates the need for manual data warehouse management and tuning. No indexing, tuning, partitioning or vacuuming after loading data => effortless management • Snowflake architecture consists of three layers, each one is physically decoupled from the other layer and scales independently: 1. Data Storage layer uses cloud storage to store all data loaded into snowflake in a scalable and inexpensive way 2. Compute layer comprises of virtual warehouses compute resources that execute data processing tasks required for queries. The virtual warehouses have access to all of the data in the storage layer 3. Cloud Services layer coordinates the entire system managing security, optimization and metadata 13
  14. 14. © 2019 Cervello, an A.T. Kearney company Snowflake’s multi-cluster, shared data architecture 14
  15. 15. © 2019 Cervello, an A.T. Kearney company 15 • Data Sources/Data Providers: On-premise or Cloud, Structured or Semi-Structured Data • Dataflow Tools: Streaming Data Platforms, ETL and ELT platforms. E: Extract, L: Load, T: Transform • Data Storage: S3/Azure Storage (missing in the architecture diagram!) • Analytics Services: BI, UI, Data Science, … Snowflake reference architecture
  16. 16. © 2019 Cervello, an A.T. Kearney company Trying Snowflake data warehouse • A couple unique features: • Multi-cloud: Snowflake is the only cloud data warehouse that is offered on Amazon AWS, Microsoft Azure and soon on Google Cloud Platform. • Zero copy cloning: A technique used to quickly replicate a database, without any physical data copy, to build a fully populated test environment. This can ease the burden of DEVOPs, as terabytes of data can be cloned within seconds with subsequent inserts and updates allowed on the new data-set. • Time Travel: Time Travel allows you to track the change of data over time. This feature is available to all accounts and enabled by default to all. It allows you to access the historical data of a Table. One can find out how the table looked at any point in time within the last 90 days. • Fail-Safe: Fail-Safe ensures historical data is protected in event of disasters such as disk failures or any other hardware failures. Snowflake provides 7 days of Fail-Safe protection of data which can be recovered only by Snowflake in event of a disaster. • Data Sharing: Provides access to both compute and data resources to external partners or subsidiaries on a read-only basis. This avoids the need to build multiple Extract Transform and Load (ETL) pipelines to external users, and avoids the need for Change Data Capture (CDC) routines when the warehouse data is updated, as users always view the latest data. • Try it yourself: most cloud data warehouses offer a free trial. Example: • Snowflake Free trial! https://bit.ly/2WMVKp3 • You can build a short term POC in either AWS or Azure while using US $400 credit for 30 days for both compute and storage. Snowflake on Google Cloud Platform (GCP) is being offered soon. 16
  17. 17. © 2019 Cervello, an A.T. Kearney company Part II – Getting Value 17
  18. 18. © 2019 Cervello, an A.T. Kearney company 1. Data Era: Challenges + Opportunities
  19. 19. © 2019 Cervello, an A.T. Kearney company Great Analogy 19
  20. 20. © 2019 Cervello, an A.T. Kearney company 2. Next Generation of Business Use Cases § Data as a Service § Self-Service Analytics § Advanced Analytics Lab § Real-Time Insight § Single View of Customer 20 Data Consumers § Partners, Suppliers, and Vendors § Executives, Managers, and Analysts § Data Scientists § Embedded Applications § Many Business Users New Use Cases
  21. 21. © 2019 Cervello, an A.T. Kearney company Medical Device Client: Data + Analytics Overview 21
  22. 22. © 2019 Cervello, an A.T. Kearney company Medical Device Client: Data + Analytics Benefits 22 Cost • $500k IT expense reduction from cloud and Snowflake migration and consolidated BI toolset • Continuous process simplification • $1M annual savings from freight consolidation analytics • 500 annual hours shifted to higher value activities because of end-to-end automating reporting (e.g., fill rates, open orders) Revenue • $800K in annual revenue recovery through contract enforcement analytics • Improved patient + physician experience (in software product) with embedded reporting Quality • Data validation checks built into daily processing Future • Smart medical devices (IoT)
  23. 23. © 2019 Cervello, an A.T. Kearney company Technical + business approach • Business engagement • Best practices/advisory • System integration Data asset • Think Big (but organized) • Internal + external • Structured/ unstructured, Governed/loosely governed 3. How To Drive Value 23 Modern technology stack • Flexible and modular • Scalable on demand • Machine Learning + Artificial Intelligence Process + organization • Supply (Data Engineers) + consumption (Scientists/Analysts) • Automated insights = decision-making opportunity • Organizational opportunity
  24. 24. © 2019 Cervello, an A.T. Kearney company Cervello Profile We are a professional services firm focused on helping organizations win with data. We have a global presence with an ability to service customers across industries. We are organized around three practices that are under pinned by our expertise in modern data architectures. 24 PERFORMANCE MANAGEMENT SUPPLIER + CUSTOMER RELATIONSHIPS DATA MONETIZATION + PRODUCTS DATA MANAGEMENT & MACHINE LEARNING We embed data management processes and use machine learning to improve data quality Using leading cloud technology platforms we develop and design analytics-ready connected data solutions MODERN DATA ARCHITECTURE & PLATFORMS ANALYTICS & DATA SCIENCE MODELING & PLANNING MOBILE EXPERIENCE We provide different methods to consume data to facilitate insights and actions inside and outside the organization BUSINESS DRIVEN INSIGHTS & ACTIONS HOWWEDELIVER WHATWEDELIVER STRATEGY BUILD SERVICES RUN & COE SERVICES Our teams are located in Boston, New York, Dallas, London and Bangalore
  25. 25. Boston | New York | Dallas | London Get in touch with contributors: Thank You! Learn more about Cervello at mycervello.com © 2019 Cervello, an A.T. Kearney company Slim Baltagi sbaltagi@gmail.com https://www.linkedin.com/in/slimbaltagi/ @SlimBaltagi 25 Jim Leavitt jleavitt@mycervello.com https://www.linkedin.com/in/jimleavitt/