Más contenido relacionado

Presentaciones para ti(20)

Similar a ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture(20)





ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

  1. When and How Data Lakes Fit into a Modern Data Architecture Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight (214) 514-1444 Second Thursday of Every Month, at 2:00 ET #AdvAnalytics
  2. William McKnight President, McKnight Consulting Group • Frequent keynote speaker and trainer internationally • Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva Pharmaceuticals, Verizon, and many other Global 1000 companies • Hundreds of articles, blogs, benchmarks and white papers in publication • Focused on delivering business value and solving business problems utilizing proven, streamlined approaches to information management • Former Database Engineer, Fortune 50 Information Technology executive and Ernst&Young Entrepreneur of Year Finalist • Owner/consultant: 2018 & 2017 Inc. 5000 Data Strategy and Implementation consulting firm • Brings 25+ years of information management and DBMS experience
  3. McKnight Consulting Group Offerings Strategy Training Strategy § Trusted Advisor § Action Plans § Roadmaps § Tool Selections § Program Management Training § Classes § Workshops Implementation § Data/Data Warehousing/Business Intelligence/Analytics § Master Data Management § Governance/Quality § Big Data Implementation 3
  4. Analytic Data Stores
  5. 3 Major Decisions • Decision #1: The Data Store Type – The largest factor for distinguishing between databases and file-based scale-out system utilization is the data profile. The latter is best for data that fits the loose label of 'unstructured' (or semi-structured) data, while more traditional data -- and smaller volumes of all data -- still belong in a relational database. • Decision #2: Data Store Placement – You must also decide where to place your data store -- on-premises or in the cloud (and which cloud). In the past, the only clear choice for most organizations was on-premises data. However, the costs of scale are gnawing away at the notion that this remains the best approach for a data platform. For more on why databases are moving to the cloud, please read this article. • Decision #3: The Workload Architecture – Finally, you must keep in mind the distinction between operational or analytical workloads. Short transactional requests and more complex (often longer) analytics requests demand different architectures. Analytics databases, though quite diverse, are the preferred platforms for the analytics workload. 5
  6. Whither the idea of the Data Warehouse? Intake Export Files Txn App Data Full Delta Stream Structured Big Data TIER 1 Access1..n Regional and Departmental Views ADS Applications & Engines Operational Analytics & Hot Views Data Marts Independent Dependent Relational Data TIER 3 Conformed Dimensions Distribution Common Summary and Derived Values Master Data Reference Data Hub Transaction Data Hub TIER 2 6
  7. Data Warehousing • Data Warehouses (still) have a lower total cost of ownership than data marts • A data warehouse is a SHARED platform – Build once, use many – Access at Data Warehouse – Access by creating a mart off the DW • Still A LOT cheaper than building from scratch “… a subject- oriented, integrated, non-volatile, time- variant collection of data, organized to support management needs.” — Bill Inmon
  8. Reasons for Analytic Architecture Change • Take Advantage Of… – Cloud Databases – Get into a Columnar Data Orientation – Get into the Data Architecture you want – Cloud Storage • Projects Requiring Consolidated Data 8
  9. The Key is Right-Fitting Platforms • THE Data Warehouse – Value-Added Components: Modeling for Access, Data Quality, Tooling, Conformed Dimensions, Data Governance, Etc. • A Dependent Data Mart (Fed from the Data Warehouse) • A Data Lake • A Big Data Cluster • An Independent Data Mart • An Operational Hub • An Operational Data Lake 9
  10. Data Lake Usage Understanding by the Builders D a t a C u l t i v a t i o n Data Warehouse Data Mart Sensible Divisions of Analytic Platforms
  11. The Post-Operational Ecosystem Data Lake DW DM DM 11
  12. Usage Understanding by the Builders D a t a C u l t i v a t i o n Data Warehouse /Lake What If? Data Mart
  13. Deploying the Data Lake
  14. Data Lake Data Scientist Workbench and Data Warehouse Staging OLTP Systems Data Lake Data Scientists ERP CRM Supply Chain MDM … Data Warehouse Data Mart Stream or Batch Updates DI Real-Time, Event-Driven Apps 14
  15. Data Lake Patterns • Data Refinery – Do Data Warehouse ETL in the Data Lake • Archive Storage • Data Science Lab • [Data Lake as the Data Warehouse] 15
  16. Files RDBMS Streaming Data Sources Ingest Governance Process Central Data Store Kafka, Pulsar Snowball Kinesis QuickSight HadoopCloud Storage EMR Glue Catalog & User Interface Access Management DynamoDB ElasticSearch Web Interface API Gateway IAM & Cognito Analyze Python R Machine Learning Data Lake Example Components 16
  17. Data Lake Setup • Managed deployments in the Hadoop family of products • External tables in Hive metastore that point at cloud storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen 2) – To run SQL against the data – HiveQL and Spark SQL require entries in the metastore 17
  18. Object Storage Instances • Object Storage instances/clusters have local storage, i.e., on the physical drives mounted to the instances themselves, that is HDFS and Hive • Object Storage technologies access their cloud vendor’s respective cloud storage—viz.: – Amazon EMR accesses S3 – Dataproc accesses Google Cloud Storage – HDI accesses Azure Data Lake Storage Gen2 • Local storage is used by the Object Storage platform for housekeeping 18
  19. The Data Warehouse of the Future • Pair a lake with an analytical engine that charges only by what you use • If you have a ton of data that can sit in cold storage and only needs to be accessed or analyzed occasionally, store it in Amazon S3/Azure Blob Storage/Google Cloud Storage – Use a database (on-premise or in the cloud) that can create external tables that point at the storage – Analysts can query directly against it, or draw down a subset for some deeper/intensive analysis – The GB/month storage fee plus data transfer/egress fees will be much cheaper than leaving it in a data warehouse 19
  20. Notes on the Data Warehouse of the Future • More Achievable separate compute and storage architecture • Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken down, scaled up or out, or interchanged without data movement • Storage can be centralized, but compute can be distributed • Major players have mechanism to ensure consistency to achieve ACID-like compliance • Remote data replication to ensure redundancy and recovery • Most of the query execution is processing time, and not data transport, so if cloud compute and storage are in the same cloud vendor region, performance is hardly impacted 20
  21. Sample Cluster Configuration Google BigQuery Cloud Provider Google Cloud Platform Version 3.6 Hadoop Version 2.7.3 Hive Version 1.2.1 Spark Version 2.3.2 Instance Type n1-highmem-16 Head/Master Nodes 1 Worker Nodes 16 and 32 vCPUs (per node) 16 RAM (per node) 104 GB Compute Cost (per node per hour) $0.947 Platform Premium (per node per hour) $0.160 21
  22. Tips • If possible, configure remote data to be stored in parquet format, as opposed to comma-separated or other text format • As new data sources are added to cloud storage, use a code distribution system—like Github—to distribute new table definitions to distributed teams • Use data partitioning to improve performance—but don’t forget new partitions have to be declared to the Hive metastore when they are added to the data • Co-locate compute and storage in the same region • Use AES-256 encryption on cloud storage bucket to ensure encryption at-rest • Hold the remotely-stored data to the same governance and data quality standards you would if it were on-premise—consider a data catalog or other metadata technique to keep the data organized and easy-to-find for new compute engines • Drop commonly used data in the lake, like master data from MDM 22
  23. The Data Science Lab Role of the Data Lake
  24. Artificial Intelligence and Machine Learning • Looming on the horizon is an injection of AI/ML into every piece of software • Consider the domain of data integration – Predicting with high accuracy the steps ahead – Fixing its bugs • Machine learning is being built into databases so the data will be analyzed as it is loaded – I.e., Python with TensorFlow and Scala on Spark. • The split of the necessary AI/ML between the "edge" of corporate users and the software itself is still to be determined 24
  25. Training Data for Machine Learning & Artificial Intelligence • You must have enough data to analyze to build models • Your data determines the depth of AI you can achieve -- for example, statistical modeling, machine learning, or deep learning -- and its accuracy 25
  26. AI Data • Call center recordings and chat logs • Streaming sensor data, historical maintenance records and search logs • Customer account data and purchase history • Email response metrics • Product catalogs and data sheets • Public references • YouTube video content audio tracks • User website behaviors • Sentiment analysis, user-generated content, social graph data, and other external data sources 26
  27. When and How Data Lakes Fit into a Modern Data Architecture Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight (214) 514-1444 Second Thursday of Every Month, at 2:00 ET #AdvAnalytics