Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Measuring Data Quality with DataOps

320 Aufrufe

Veröffentlicht am

Most organisations think that they have poor data quality, but don’t know how to measure it or what to do about it. Teams of data scientists, analysts, and ETL developers are either blindly taking a “garbage in -> garbage out” approach, or worse still, “cleansing” data to fit their limited perspectives. DataOps is a systematic approach to measuring data and for planning mitigations for bad data.

Veröffentlicht in: Daten & Analysen
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Measuring Data Quality with DataOps

  1. 1. Clarity Cloudworks illuminating issues before they become problems
  2. 2. Development and Operations 
 are not 
 the only groups in IT
  3. 3. Data Teams • Are focused on urgent, unplanned work • Traditionally operate the systems they develop, because they don’t perceive hand-o! is possible • Scant theory, 
 what little writing exists is technology-focused
  4. 4. The DataOps Manifesto Whether referred to as data science, data engineering, data management, big data, business intelligence, or the like, through our work we have come to value in analytics: https://www.dataopsmanifesto.org/
  5. 5. Individuals and interactions over processes and tools 
  6. 6. Working analytics over comprehensive documentation
  7. 7. Customer collaboration over contract negotiation
  8. 8. Experimentation, iteration, and feedback over extensive upfront design
  9. 9. Cross-functional ownership of operations over siloed responsibilities
  10. 10. Do you have 
 Bad Data? In the absence of information, 
 rumour becomes widely believed.
 Rumour is biased toward emotion, 
 which in work places tends to be negative.
  11. 11. What problems does data quality cause? • Data / ETL pipelines crash, 
 resulting in unavailable, stale, or incorrect data • > 80% of Data Scientists’ time spent 
 collecting data • Incorrect data is used for decisions 
 or published • Doubts about data hurt morale and 
 discourage evidence-based decision making
  12. 12. What is Data Quality? Data quality is good 
 when people who inspect data see what they expect. Data quality is bad 
 when people are surprised by the data they see.
  13. 13. Jadqfbfa ??
  14. 14. A BC A
  15. 15. Document data characteristics 
 and train people to know them If you only learn one thing today: 
 In the absence of training and documentation most people will be surprised by the data even when nothing is wrong.
  16. 16. What do we want?
 Evidence Based Decision Making
 When do we want it?
 After Peer Review
  17. 17. Data Testing • Accuracy, Consistency, Completeness Tests • On records and relationships • Relationship Consistency Tests
  18. 18. Test Objectives • Accuracy - is it true? • Consistent - does it obey the rules? • Complete - what is missing?
  19. 19. Data Test Scopes • Within a record (SQL row, NoSQL document, etc.) • Within a set (SQL Table, etc.) • Within an Application (HRIS, ERP, etc.)* • Across the organisation* * - combinatorial
  20. 20. Monitoring
  21. 21. Monitor Data as if it is Infrastructure When Where Who Code Event driven Commit / PR Test Developers fix errors Infrastructure Constantly at tight intervals Production Automated repair failover to Ops Data Constantly Production Automated repair failover to data steward
  22. 22. Data Production Value Development Idea Value Pipeline Innovation Pipeline continuous data monitoring, continuous application monitoring, periodic code testing.
  23. 23. Pipelines • Monitor each step in the pipeline • If steps are idempotent, kill and retry once any step whose measures are anomalous • Raise an incident if the retry is also anomalous • Insert data quality gates between steps in test design and in response to incidents
  24. 24. Pipeline Measures For each step in a data pipeline: • Duration • Cost (BUFFER_GETS, PAGE_READS, CPU Seconds) • Records in • Records out
  25. 25. Quality Measures • Accuracy and completeness checks 
 are number of errors and error % 
 for every scope and time period • Consistency checks 
 are errors and error % 
 for each rule and time period
  26. 26. How to Test Real World
 Accuracy Cache Accuracy Complete Consistent Record Talk to people
 (Call centre verification) Compare to system of record Permissable Values Rules within the record Set n/a Compare to system of record Reconciliation Rules within the set Application n/a n/a n/a Rules between types Organisation n/a n/a n/a Rules between applications
  27. 27. When to Test Real World
 Accuracy Cache Accuracy Complete Consistent Record Infrequent Regular Every read Every read Set n/a Regular Regular Regular Application n/a n/a Regular Regular Organisation n/a n/a n/a Regular
  28. 28. The journey of 
 a thousand applications 
 starts with 
 a single test.
  29. 29. steven@claritycloudworks.com +64 27 620 1237 claritycloudworks.com Steven Ensslen