Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

seven steps to dataops @ dataops.rocks conference Oct 2019

833 Aufrufe

Veröffentlicht am

given at the dataops.rocks conference October 2019

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

seven steps to dataops @ dataops.rocks conference Oct 2019

  1. 1. Seven Steps to DataOps (+3)
  2. 2. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Gartner Research estimates that up to 50% or more of data science/analytics projects fail. Question: Would you go to a restaurant where you had to send your food back half the time?
  3. 3. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. If you own that restaurant, how do you fix the problem? • Buy a large high-tech stove? • And cook ‘Big Food’ • Recruit a new type of chef? • A ‘Food’ Scientist • Do latest gastronomic trend? • AI (Asian Influenced) Cuisine
  4. 4. Why DataOps? We are failing at being data-driven • 87% of data science projects never make it into production. • Data Analytics Investment up, yet “data driven” organizations down 37% to 31% since 2019. • 80% of AI projects resemble alchemy (Gartner) • 60% of all Data Analytic Fail (BI, Data Lake, Warehouse, Science) • 79% of data projects have too many errors
  5. 5. Strategic Trend: DataOps • DataOps Manifesto 2017 • Gartner Hype Cycle in late 2018 • Increased market adoption of DataOps principles by leaders of data and analytic teams in 2019
  6. 6. Seven Steps to DataOps (+3) 1. Orchestrate Two Journeys 2. Add Tests And Monitoring 3. Use a Version Control System 4. Branch and Merge 5. Use Multiple Environments 6. Reuse & Containerize 7. Parameterize Your Processing + Three Bonus Steps (Architecture, Inter/Intra Team Collaboration & Process Analytics)
  7. 7. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Orchestrate data to customer value Analytic process are like manufacturing: materials (data) and production outputs (refined data, charts, graphs, model) Access: Python Code Transform: SQL Code, ETL Model: R Code Visualize: Tableau Workbook Report: Tableau Online ❶
  8. 8. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Speed deployment to production Analytic processes are like software development: deliverables continually move from development to production Data Engineers Data Scientists Data Analysts Diverse Team Diverse Tools Diverse Customers Business Customer Products & Systems ❶
  9. 9. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Innovation and Value Pipeline Together Focus on both orchestration and deployment while automating & monitoring quality Don’t want break production when I deploy my changes Don’t want to learn about data quality issues from my customers ❶
  10. 10. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Add Automated Monitoring And Tests Move Fast and Count Things Monitoring: To ensure that during in the Value Pipeline, the data quality remains high. Tests: Before promoting work, running new and old tests gives high confidence that the change did not break anything in the Innovation Pipeline ❷
  11. 11. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Automate Monitoring & Tests In Production Test Every Step And Every Tool in Your Value Pipeline Are your outputs consistent? And Save Test Results! Are data inputs free from issues? Is your business logic still correct? Access: Python Code Transform: SQL Code, ETL Model: R Code Visualize: Tableau Workbook Report: Tableau Online ❷
  12. 12. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Example Tests (Basic) ❷
  13. 13. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Example Test (Location Balance) ❷ Access: Python Code Transform: SQL Code, ETL Model: R Code Visualize: Tableau Workbook Report: Tableau Online source 1 million rows database 1 million rows 300K facts 700K dimensions report 300K facts 700K dimensions
  14. 14. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Example Test (Historical Balance) ❷ SKU Product Product Group Volume SKU1 P1 G1 100 SKU2 P1 50 SKU3 P2 75 SKU4 P3 G2 125 SKU5 P4 200 SKU6 P5 25 575 Production Data, Pipeline & Environment Pre-Production Data, Pipeline & Environment SKU Product Product Group Volume SKU1 P1 G1 101 SKU2 P1 55 SKU3 P2 76 SKU4 P3 126 SKU5 P4 G2 200 SKU6 P5 29 587 Access: Python Code Transfor m: SQL Code, ETL Model: R Code Visualiz e: Tableau Workbook Report : Tableau Online Access: Python Code Transfor m: SQL Code, ETL Model: R Code Visualiz e: Tableau Workbook Report : Tableau Online Histbal G1 225 G2 350 Histbal G1 358 G2 229
  15. 15. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Production Testing and Monitoring Lower Your Error Rates and Embarrassment! [HOW] Test and Monitor: • Automatically, in production • Across the entire tool chain • Send alerts / notification • Keep track of history • Make it easy to create tests [WHAT] Test Types: • Traditional Data Quality • Statistical Process Control • Location Balance Test • Historic Balance Test • Business Based Tests
  16. 16. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. For the Innovation Pipeline Tests Are For Also Code: Keep Data Fixed Deploy Feature Run all tests here before promoting ❷
  17. 17. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Automated ‘Tests’ Serve a Dual Purpose: 1. Data Tests and Monitoring in Production 2. Regression, Functional and Performance Tests in Development Data Fixed Data Variable Code Fixed Value Pipeline Code Variable Innovation Pipeline Quality Your Customer Receives = f (data, code) https://medium.com/data-ops/disband-your-impact-review- board-automate-analytics-testing-42093d09fe11 Duality of Tests❷
  18. 18. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Use a Version Control System At The End Of The Day, Analytic Work Is All Just Code Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Tableau Online Source Code Control ❸
  19. 19. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Branch & Merge Source Code Control Branching & Merging enables people to safely work on their own tasks ❹
  20. 20. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Tableau Online Use Multiple Environments Analytic Environment Your Analytic Work Requires Coordinating Tools And Hardware ❺
  21. 21. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Use Multiple Environments Provide an Analytic Environment for each branch • Analysts need a controlled environment for their experiments • Engineers need a place to develop outside of production • Update Production only after all tests are run! ❺
  22. 22. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Sandboxes Are Complex Analytic Environment ❺ Data Engineers, Scientists or Analytics Team’s Analytic Tools R (model) Alteryx (business ETL) Redshift (data) SQL (ETL) Hardware & Network Configurations Right Hardware and Software Versions Tableau (workbook) Python Test Data Sets Code Branch Test Result History Analytic Environment/ Development Sandbox Creation is Complex: Hard to create the right set of data, tools, people, history and configuration for a fast build test debug cycle
  23. 23. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Reuse & Containerize Containerize 1. Manage the environment for each component (e.g. Docker, AMI) 2. Practice Environment Version Control Reuse 1. Do not create one ‘monolith’ of code 2. Reuse the code and results ❻
  24. 24. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Parameterize Your Processing Think Of Your Value Pipeline Like A Big Function • Named sets of parameters will increase your velocity • With parameters, you can vary” • Inputs • Outputs • Steps in the workflow • You can make a time machine • Secure storage for credentials ❼
  25. 25. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. The Seven Steps In Action 1. Select story 2. Create branch 3. Create environment 4. Implement feature 5. Write new tests 6. Run new and existing tests 7. Check in to branch 8. Merge to parent 9. Delete environment When sprint ends • Deliver all completed features to customer • Merge sprint branch to master • Roll un-merged features into the next sprint
  26. 26. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Three Bonus Steps! • DataOps Data Architecture • DataOps Collaboration • Inter and Intra Team • DataOps Measurement
  27. 27. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Why a DataOps Centric Architecture? • Canonical Data Architectures only think about production, not the process to make changes to production. • A DataOps Data Architecture makes the steps to change what is in production a “central idea.” • Think first about changes over time to your code, your servers, your tools, and monitoring for errors are first class citizens in the design. • Why? • It is a little like designing a mobile phone with a fixed battery. • You can end up with processes characterized by unplanned work, manual deployment, errors, and bureaucracy. • A ‘Right to Repair’ Data Architecture
  28. 28. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Canonical Data Architecture Does not reflect collaboration and operations Production Environment Source Data Data Customers Raw Lake Data Engine- ering Refined Data Data Science Data Viz. Data Govern- ance
  29. 29. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. DataOps Data Architecture Spans tool chain & environments Cloud/On-Prem Production EnvironmentTest Dev Source Data Data Customers Raw Lake Data Engine- ering Refined Data Data Science Data Viz. Data Govern- ance Orchestrate, Monitor, Test Orchestrate, Monitor, Test Orchestrate, Monitor, Test DataKitchen DataOps Storage & Version Control History & Metadata Auth & Permissions Environ- ment Secrets DataOps Metrics & Reports Automated Deployment EnvironmentCreation andManagement. DataOps Team
  30. 30. Chris – DataOps Engineer Eric – Production Engineer Betty – Data Engineer This Is A Multi Step, Multi Person, Multi Environment Process To Make this Request a Reality Challenges: • How to leverage best practices and re-use? • How to collaborate and coordinate work? • How to ease movement between team members with many tools and environments? • How to maintain security? • How to automate work and reduce manual errors? DataOps and Intra-Team Coordination Pat – Data Scientist
  31. 31. Chris – DataOps Engineer Production Environment: • Separate Hardware/Software Environment • Secure • No Access By Developers • Managed by Eric • Separate Credentials PRODUCTION DEVELOPMENT Development Environment: • Separate Hardware/Software Environment • Secure • Access By Data Engineers, Data Scientists, Analysts and DataOps Engineers • Setup by Chris (DataOps Engineer) • Separate Credentials Eric – Production Engineer Betty – Data Engineer Intra-Team Coordination Pat – Data Scientist
  32. 32. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Inter-Team Coordination Multiple locations, orgs, tasks
  33. 33. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Inter-Team Coordination: Two Locations, Multiple Tools Home Office Team Local Office ‘Self- Service’ Team VP Marketing Data Engineer Data Scientist Centralized, Weekly Cadence of Changes Data Analyst Distributed, Daily/Hourly Cadence of Changes Boston New Jersey
  34. 34. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Challenges With Coordination Data Engineer Data Scientist Data Analyst Make a change in schema? Break Reports? Add New Data SetsNot Available For All? Change Report Calculations Inconsistencies? New Data & Schema Update/New Report Not Working? Home Office Team Local Office ‘Self- Service’ Team VP Marketing
  35. 35. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Shared Result, Separate Responsibilities Home Office Team Data Engineer Data Scientist Local Office ‘Self- Service’ Team Calculate: SQL Segment: Python Transform SSIS Load SQLServer Deploy: Tableau Publish: T Server Add Data Alteryx Data Analyst VP Marketing
  36. 36. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Overall Orchestration Home Office Team Data Engineer Data Scientist Local Office ‘Self- Service’ Team Calculate: SQL Segment: Python Transform SSIS Load SQLServer Deploy: Tableau Publish: T Server Add Data Alteryx Data Analyst VP Marketing
  37. 37. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. DataOps Process Analytics Analytic teams are not very analytic about measuring and improving their internal work • Prove your teams’ value, measure: • Team and individual productivity • Production error rates • Data provider error rates • SLAs • Production deployment rates • Release environments • Tests Coverage • Customizable with data export to fit your company’s needs
  38. 38. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. DataOps: data about data Statistical process control graphs monitoring “bad IDs” and raw row counts
  39. 39. Copyright © 2019 by DataKitchen, Inc. All Rights Reserved. Error Rates Decline in Production On Time Delivery within SLA, & decreasing build time Per Project Analytics Productivity: Recipe Work Increasing Team Collaboration Increased Deploys Between Environments Increased Number of Automated Tests Increasing
  40. 40. DataOps Benefit: Time Well Spent Percentage Time Team Spends Per Week Before DataOps Errors & Operational Tasks New Features & Data For Customers Improvements & Debt
  41. 41. DataOps Benefit: Time Well Spent After DataOps Percentage Time Team Spends Per Week Before DataOps New Features & Data For Customers Errors & Operational Tasks New Features & Data For Customers Improvements & Debt Errors & Operational Tasks Process Improvements & Tech Debt Reduction
  42. 42. DataOps Benefit: Faster, Better & Happier After DataOpsBefore DataOps High Errors Production Errors Data Analytics Team Deployment Latency Weeks, Months Dev Prod
  43. 43. DataOps Benefit: Faster, Better & Happier After DataOpsBefore DataOps High Errors Production Errors Low Errors Data Analytics Team Deployment Latency Weeks, Months Dev Prod Hours & Mins Dev Prod
  44. 44. • For These Slides, Contact Me: • cbergh@datakitchen.io • DataOps Manifesto: • http://dataopsmanifesto.org • DataOps Blog: • http://medium.com/data-ops • DataOps Book • Come visit our table today! To Learn More On DataOps

×