Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

ADV Slides: Comparing the Enterprise Analytic Solutions

Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.

Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.

  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

ADV Slides: Comparing the Enterprise Analytic Solutions

  1. 1. Comparing the Enterprise Analytic Solutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444
  2. 2. Proprietary + Confidential Powering Data Experiences to Drive Growth Joel McKelvey, Looker Product Management Google Cloud
  3. 3. Proprietary + Confidential 1 https://www.forrester.com/report/InsightsDriven+Businesses+Set+The+Pace+For+Global+Growth/-/E-RES130848 “Insights-driven businesses harness and implement digital insights strategically and at scale to drive growth and create differentiating experiences, products, and services.” 7x Faster growth than global GDP 30% Growth or more using advanced analytics in a transformational way 2.3x More likely to succeed during disruption
  4. 4. Proprietary + Confidential *Source: https://emtemp.gcom.cloud/ngw/globalassets/en/information-technology/documents/trends/gartner-2019-cio-agenda-key-takeaways.pdf Rebalance Your Technology Portfolio Toward Digital Transformation Gartner: Digital-fueled growth is the top investment priority for technology leaders.* Percent of respondents increasing investment Percent of respondents decreasing investment Cyber/information security 40% 1% Cloud services or solutions (Saas, Paa5, etc.) 33% 2% Core system improvements/transformation 31% 10% How to implement product-centric delivery (by percentage of respondents) Business Intelligence or data analytics solution 45% 1% Digital Transformation
  5. 5. Proprietary + Confidential Governed metrics | Best-in-class APIs | In-database | Git version-control | Security | Cloud Integrated Insights Sales reps enter discussions equipped with more context and usage data embedded within Salesforce. Data-driven Workflows Reduce customer churn with automated email campaigns if customer health drops Custom Applications Maintain optimal inventory levels and pricing with merchandising and supply chain management application Modern BI & Analytics Self-service analytics for install operations, sales pipeline management, and customer operations SQL In Results Back
  6. 6. Proprietary + Confidential ‘API-first’ extensibility Technology Layers Semantic modeling layer In-database architecture Built on the cloud strategy of your choice
  7. 7. Proprietary + Confidential 1 in 2 customers integrate insights/experiences beyond Looker 2000+ Customers 5000+ Developers Empower People with the Smarter Use of Data
  8. 8. © 2020 Looker. All rights reserved. Confidential. BEACON Digital, Part III BI Modernization March 23, 9:30am PST Embedded Analytics March 24, 9:30am PST looker.com/beacon
  9. 9. Proprietary + Confidential Thank you
  10. 10. Comparing the Enterprise Analytic Solutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444
  11. 11. William McKnight President, McKnight Consulting Group • Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva Pharmaceuticals, Verizon, and many other Global 1000 companies • Frequent keynote speaker and trainer internationally • Hundreds of articles, blogs and white papers in publication • Focused on delivering business value and solving business problems utilizing proven, streamlined approaches to information management • Former Database Engineer, Fortune 50 Information Technology executive and Ernst&Young Entrepreneur of Year Finalist • Owner/consultant: Data strategy and implementation consulting firm 2
  12. 12. McKnight Consulting Group Client Portfolio
  13. 13. Preparing the Organization for the Future
  14. 14. Priorities
  15. 15. Data Success Measurement User Satisfaction Business ROI and growth instigated Data Maturity (Long-term User Sat and Bus ROI) Misc.
  16. 16. Data Profile vs. Usage Profile
  17. 17. Best Category and Top Tool Picked Best Category Picked Top 2 Category Picked Same Ol’ Platform 80% 70% 60% 50% Increasing Probability that Platform Selection Leads to Success
  18. 18. Analytic Database Data Stores
  19. 19. No More 10 • One Size Fits All • The DW for everything
  20. 20. Modern Use Cases Data Lake Data Warehouse Data Lake Machine Learning Categorical Model (e.g. Decision Tree) Categorical Data Quantitative Data Split Quantitative Model (e.g. Regression) Train Train Score Score Evaluate Historical Transaction Data Deploy Scores Real Time Transactions Actions
  21. 21. Analytics Reference Architecture Logs (Apps, Web, Devices) User tracking Operational Metrics Offload data Raw Data Topics JSON, AVRO Processed Data Topics Sensors Transactional / Context Data OLTP/ODS ETL Or EL with T in Spark Batch Low Latency Applications Files Reach through or ETL/ELT or Import Stream Processing Stream Processing Q Q Distributed Analytical Warehouse Governed Data Lake Data Governance
  23. 23. Balance of Analytics Analytic Applications DW Data Lake Analytic Applications DW Data Lake Analytic Applications DW Data Lake DW
  24. 24. Cloud Analytic Databases
  25. 25. Beyond Performance Checklist • Cost Predictability and Transparency • Multi-Cluster Costs • In-Database Machine Learning • SQL Compatibility • Provisioning Workloads with Security Controls • ML Security same as Database Security • Resource Elasticity • Automated Resource Elasticity • Granular Resource Elasticity • Licensing Structure • Cost Conscious Features • Data Storage Alternatives • Unstructured and Semi-Structured Data Support • Streaming Data Support • Connectivity with standard ETL and Data Visualization software • Concurrency Scaling • Seamless Upgrades • Hot Pluggable Components • Single Point of Entry for System Administration • Easy Administration • Optimizer Robustness • Disaster Recovery • Workload Isolation 16
  26. 26. • Login to AWS Console https://console.aws.amazon.com/ • Create and Launch EC2 Instance – Choose your Amazon Machine Instance – Choose your Instance Type – Add Storage – Configure Security Group – PEM Key Pair – Connect/SSH to Instance – Access keys • Set up S3 Storage – Create bucket • Set up Redshift – Identity & Access Management – Create Role and Attach Policies – Configure Cluster – Launch Cluster • Load Data • Query Data Enterprise Analytic Solutions Setup (i.e., EC2, S3, Redshift)
  27. 27. • Create Statistics • Manual & Automatic Snapshots • Distribution Keys • Elastic Resize • Vacuum Tables • Cluster Parameter Group (for Workload Management) • Short Query Acceleration Other Concepts (Redshift example)
  28. 28. • Azure SQL Data Warehouse is scaled by Data Warehouse Units (DWUs) which are bundled combinations of CPU, memory, and I/O. According to Microsoft, DWUs are “abstract, normalized measures of compute resources and performance.” • Amazon Redshift uses EC2-like instances with tightly-coupled compute and storage nodes which is a “node” in a more conventional sense • Snowflake “nodes” are loosely defined as a measure of virtual compute resources. Their architecture is described as “a hybrid of traditional shared-disk database architectures and shared-nothing database architectures.” Thus, it is difficult to infer what a “node” actually is. • Google BigQuery does not use the concept of a node at all, but instead refers to “slots” as “a unit of computational capacity required to execute SQL queries" Different Terminology
  29. 29. Sample Enterprise Analytic Platforms
  30. 30. Sample Enterprise Analytic Solutions
  31. 31. Enterprise Analytic Solutions • Actian Avalanche • AWS Redshift • Azure Synapse • Cloudera • Google BigQuery • IBM Db2 Warehouse on Cloud and Cloud Pak for Data DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT
  32. 32. Enterprise Analytic Solutions • Microfocus Vertica • Oracle Autonomous Data Warehouse • Snowflake • Teradata • Yellowbrick DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT
  33. 33. Actian Avalanche • MPP relational columnar database built to deliver high performance at low TCO both in the cloud and on-prem for BI and operational analytics use cases. • Actian Avalanche is based on its underlying technology, known as Vector. The basic architecture of Actian Avalanche is the Actian patented X100 engine, which utilizes a concept known as "vectorized query execution" where processing of data is done in chunks of cache-fitting vectors. • Avalanche performs “single instruction, multiple data” processes by leveraging the same operation on multiple data simultaneously and exploiting the parallelism capabilities of modern hardware. It reduces overhead found in conventional "one-row-at-a-time processing" found in other platforms. Additionally, the compressed column-oriented format uses a scan-optimized buffer manager. • The measure of Actian Avalanche compute power is known as Avalanche Units (AU). The price is per AU per hour and includes both compute and cluster storage. • It’s a pure column store • Compression is typically 5:1 • Multi-Core Parallelism • CPU Cache is Used as Execution Memory – Process data in chip cache not RAM • Storage Indexes are created automatically by quickly identifying candidate data blocks for solving queries • Fat and cost-effective
  34. 34. Amazon Redshift • Amazon Redshift was the first managed data warehouse service and continues to get a high level of mindshare in this category. • One of the interesting features of Redshift is result set caching. • At the enterprise class, Redshift dense compute nodes (dc2.8xlarge) have 2.56TB per node of solid state drives (SSD) local storage. Their dense storage nodes (ds2.8xlarge) have 16TB per node, but it is on spinning hard disks (HDD) with slower I/O performance. • Redshift has some future- proofing (like Spectrum and short query acceleration) that a modern data engineering approach might utilize. Short query acceleration uses machine learning to provide higher performance, faster results, and better predictability of query execution times. • Amazon Redshift is a fit for organizations needing a data warehouse with a clear, consistent pricing model. Amazon Web Services supports most of the databases in this report, and then some. Redshift is not the only analytic database on AWS, although sometimes this gets convoluted.
  35. 35. Azure Synapse • Azure SQL Data Warehouse made its debut for public use in mid-2016. This is a managed service, dedicated data warehouse offering from the DATAllegro/PDW/APS legacy. Azure SQL Data Warehouse Gen 2, optimized for compute, is a massive parallel processing and shared nothing architecture on cluster nodes each running Azure SQL Database—which shares the same codebase as Microsoft SQL Server. • Azure SQL Data Warehouse supports 128 concurrent queries, a nice, high relative number. • Microsoft also has a deep partnership with Databricks, which is becoming very popular in the data science community. The partnership uses Azure Active Directory to log into the database. • Overall, Azure SQL Data Warehouse continues to be an excellent choice for companies needing a high-performance and scalable analytical database in the cloud or to augment the current, on- premises offering with a hybrid architecture at a reasonable cost.
  36. 36. Cloudera Data Warehouse Service • Cloudera Data Warehouse (CDW) boasts flexibility through support for both data center and multiple public cloud deployments, as well as capabilities across analytical, operational, data lake, data science, security, and governance needs. • CDW is part of CDP, a secure and governed cloud service platform that offers a broad set of enterprise data cloud services with the key data functionality for the modern enterprise. CDP was designed to address multi-faceted needs by offering multi-function data management and analytics to solve an enterprise’s most pressing data and analytic challenges in a streamlined fashion. • The architecture and deployment of CDP begins with the Management Console, where several important tasks are performed. First, the preferred cloud environment (for example, AWS or Azure) is set up. Second, data warehouse clusters and machine learning (ML) workspaces are launched. Third, additional services, such as Data Catalog, Workload Experience Manager, and Replication Manager are utilized, if required. • The Cloudera Data Warehouse service provides self-service independent virtual warehouses running on top of the data kept in a cloud object store, such as S3.
  37. 37. Google BigQuery • Google BigQuery has the most distinctive approach to cloud analytic databases, with an ecosystem of products for data ingestion and manipulation and a unique pricing apparatus. • The back end is abstracted, BigQuery acts as a RESTful front end to all the Google Cloud storage needed, with all data replicated geographically and Google managing where queries execute. (The customer can choose the jurisdictions of their storage according to their safe harbor and cross-border restrictions.) • Pricing is by data and query, including Data Definition Language (DDL), or by flat rate pricing by “slot,” a unit of computational capacity required to execute SQL queries. This price model may make sense for high-data usage customers. Google also lowers the cost of unused storage. • Google Marketing Platform data (including the former DoubleClick), Salesforce.com, AccuWeather, Dow Jones, and 70+ other public data sets out there can be included in the BigQuery dataset • Billing is based on the amount of data you query and store. Customers can pre-purchase flat- rate computation “slots” or units in increments per month per 500 compute units. However, Google recently introduced Flex Slots, which allow slot reservations as short as one minute and billed by the hour. There is a separate charge for active storage of data.
  38. 38. Microfocus Vertica in Eon Mode • Vertica is owned by Micro Focus. They also introduced their Vertica in Eon Mode deployment as the way to set up a Vertica cluster in the cloud. Vertica in Eon mode is a fully ANSI SQL compliant relational database management system that separates compute from storage. • Vertica is built on Massively Parallel Processing (MPP) and columnar- based architecture that scales and provides high-speed analytics. • Vertica offers two deployment modes – Vertica in Enterprise Mode and Vertica in Eon Mode. Vertica in Eon Mode uses a dedicated Amazon S3 bucket for storage, with a varying number of compute nodes spun up as necessary to meet the demands of the workloads. • Vertica in Eon Mode also allows the database to be turned “off” without cluster disruption when turned back “on.” Vertica in Eon Mode also has workload management and its compute nodes can access ORC and Parquet formats in other S3 clusters.
  39. 39. Snowflake Data Warehouse • Snowflake Computing was founded in 2012 as the first data warehouse purpose-built for the cloud. Snowflake has seen tremendous adoption, including international accounts and deployment on Azure cloud. • Snowflake’s compute scales in full cluster increments with node counts in powers of two. Spinning up or down is instant and requires no manual intervention, resulting in leaner operations. Snowflake scales linearly with the cluster (i.e., for a four node cluster, moving to the next incremental size will result in a four node expansion). • Regarding billing, you pay per second only for the compute in use. • On Amazon AWS, Snowflake is architected to use Amazon S3 as its storage layer and has a native advantage of being able to access an S3 bucket within the COPY command syntax. On Microsoft Azure, it uses Azure Blob store. • The UI is well regarded.
  40. 40. Teradata Vantage • Teradata is available on Amazon Web Services, Teradata Cloud, VMware, Microsoft Azure, on-premises, and IntelliFlex – Teradata’s latest MPP architecture with separate storage and compute. • With Vantage, Teradata is still the gold standard in complex mixed workload query situations for enterprise- level, worry-free concurrency as well as scaling requirements and predictably excellent performance featuring top notch non-functional requirements. • Dynamic resource prioritization and workload management.
  41. 41. Understanding Pricing 1/2 • The price-performance metric is dollars per query-hour ($/query-hour). – This is defined as the normalized cost of running a workload. – It is calculated by multiplying the rate offered by the cloud platform vendor times the number of computation nodes used in the cluster and by dividing this amount by the aggregate total of the execution time • To determine pricing, each platform has different options. Buyers should be aware of all their pricing options. • For Azure SQL Data Warehouse, you pay for compute resources as a function of time. – The hourly rate for SQL Data Warehouse various slightly by region. – Also add the separate storage charge to store the data (compressed) at a rate of $ per TB per hour. • For Amazon Redshift, you also pay for compute resources (nodes) as a function of time. – Redshift also has reserved instance pricing, which can be substantially cheaper than on- demand pricing, available with 1 or 3-year commitments and is cheapest when paid in full upfront.
  42. 42. Understanding Pricing 2/2 • For Snowflake, you pay for compute resources as a function of time—just like SQL Data Warehouse and Redshift. – However you chose the hourly rate based on certain enterprise features you need (“Standard”, “Premier”, “Enterprise”/multi-cluster, “Enterprise for Sensitive Data” and “Virtual Private Snowflake”) • With Google BigQuery, one option is to pay for bytes processed at $ per TB – There’s also BigQuery flat rate • Azure SQL Data Warehouse pricing was found at https://azure.microsoft.com/en-us/pricing/details/sql-data- warehouse/gen2/. • Amazon Redshift pricing was found at https://aws.amazon.com/redshift/pricing/. • Snowflake pricing was found at https://www.snowflake.com/pricing/. • Google BigQuery pricing was found at https://cloud.google.com/bigquery/pricing.
  43. 43. Design Your Benchmark • What are you benchmarking? – Query performance – Load performance – Query performance with concurrency – Ease of use • Competition • Queries, Schema, Data • Scale • Cost • Query Cut-Off • Number of runs/cache • Number of nodes • Tuning allowed • Vendor Involvement • Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server) • Any not-free third party, SaaS, or on-demand software • Instance type of nodes • Measure Price/Performance! 37
  44. 44. Summary • Data professionals are sitting on the future of the organization • Data architecture is an essential organizational skill • Artificial intelligence will drive the organization for the future • All need a high-standard data warehouse • Cloud analytic databases are for most organizational workloads • Adopt a columnar orientation to data for analytic workloads • Data lakes are becoming essential • Use cloud storage or managed Hadoop for the data lake • Keep an eye on developments in information management and how they apply to your organization 38
  45. 45. Comparing the Enterprise Analytic Solutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444