Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Data Engineer's Lunch #85: Designing a Modern Data Stack

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 27 Anzeige

Data Engineer's Lunch #85: Designing a Modern Data Stack

Herunterladen, um offline zu lesen

What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.

What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Data Engineer's Lunch #85: Designing a Modern Data Stack (20)

Weitere von Anant Corporation (20)

Anzeige

Data Engineer's Lunch #85: Designing a Modern Data Stack

  1. 1. Designing a Modern Data Stack Will Angel - 23 Jan 2023 - Data Engineer’s Lunch
  2. 2. Overview 1. What is a Data Stack? 2. Data Solution Design Process 3. Data Stack Design Examples 4. Demo 5. Conclusion
  3. 3. What is a data stack?
  4. 4. What is a regular software stack? A software “stack” is the set of software or software components needed to run an application. Notable examples: ● LAMP ○ Linux ○ Apache ○ MySQL ○ PHP ● MERN ○ MongoDB ○ Express.js ○ React.js ○ Node.js
  5. 5. Are data stacks just regular software stacks? Yes and no. Data engineering is a specialty within software engineering, and everything is software running on computers at the end of the day, so yes, data stacks are software stacks. But, there are notable differences that are worth addressing… Especially because every data tool company wants to market their tool as part of the “Modern Data Stack”
  6. 6. What is Modern about the “Modern” data stack? Four major trends make the ‘modern data stack’ make sense: 1. Modern Cloud platforms. 2. Column Store Data Warehouses. 3. Cost of disk trending to zero. 4. Proliferation of managed data tools.
  7. 7. Defining Characteristics of the Modern Data Stack 1. Cloud & SQL Based: Column-store based Cloud Data Warehouse at the center ○ With optional file / object store based data lake. 2. Modular: Managed SaaS tools for almost every part of the data lifecycle. ○ Optional: run open source components and write your own integrations.
  8. 8. What is so special about cloud data warehouses? Modern column store data warehouses run on a cloud computing platform have some great benefits for building data intensive applications: ● Flexible & scalable pay-as-you go compute: ○ No upfront hardware or major purchases required. ○ No outgrowing your data center at awkward times. ● Managed services ○ Running your own infrastructure reliably and effectively is hard, so paying for a cloud computing company to do it for you is usually a great deal. ○ Allows for data teams to move quickly without needing as much specialized operational experience.
  9. 9. The cost of storage Cost per GB has fallen ~100,000x since the mid 90s. The cellphone in your pocket has more storage and processing power than a Cray-2 supercomputer from the mid 80s. The Big Data Revolution is mostly driven by this trend.
  10. 10. Data Solution Design Process
  11. 11. Data Solution design process 1. Determine desired capabilities & design constraints 2. Create iteration plan 3. Execute plan. 4. Evaluate delivered data solution. 5. Return to 1. Same as OODA (Observe, Orient, Decide, Act)/ PDCA (Plan, Do, Check, Act) frameworks. Iteration cycle scale and length can be minutes to years (I recommended shorter and smaller).
  12. 12. Step 1. Problem Definition The first step in developing a solution is to identify the problem. This step can include: ● Requirements gathering ● Software vision documentation ● User research & interviews ● Industry research ● Documentation ● More documentation…
  13. 13. Step 2. Create an iteration plan Create a plan to deliver a working system that has the capabilities to solve all of the necessary problems. This can include: ● System design diagrams & documents ● Jira tickets and work breakdown structure ● Doodles on a napkin
  14. 14. Step 3. Execute the plan Once you have a plan that looks good enough, build the thing! This should include: ● Software development ● Software development to improve the software development process ● Procurement - buying off the shelf tools. ● Testing - systems integration & technical tests. ● Testing - user / client demos.
  15. 15. Step 4. Evaluate After developing a functional data solution, it is important to evaluate whether you did an acceptable job. This includes: ● Requirements review - does the data solution meet the requirements? ● Capability value - do the data solution’s new capabilities actually provide value? ● Identify future improvement opportunities ● Identify future development process improvement opportunities
  16. 16. Step 5. Repeat the cycle Data Platform development is an iterative process, and much of the value depends on the end users: unused data is worthless, so if the developed system is unused, it won’t have been worth building most of the time. Iteration is a great way to discover unknown requirements and opportunities, and work with the end users of data to build good data systems that help cultivate a vibrant ecosystem.
  17. 17. Design Example 1: Generic BI Data Stack
  18. 18. The Modern Data Stack for Business Intelligence Core Components: 1. Storage - Cloud Data Warehouse (Snowflake, Redshift, BigQuery) 2. Ingestion - Managed ETL (Stitch, Fivetran) 3. Transformation - dbt / SQL 4. Visualization - BI tool of choice
  19. 19. Auxiliary Components You’ll also want: ● Data Observability - tools like Monte Carlo & BigEye ● Data Cataloging - tools like Castor or Alation ● Systems Observability - ELK / Prometheus & Grafana A modern data platform is a large distributed system with numerous third party vendors and constantly changing API integrations. Treat it with respect or it will break on you.
  20. 20. Design Example 2: Personal Data Warehouse
  21. 21. High Level Design - Personal Data Warehouse Primary Design constraints: 1. Low cost. 2. Low maintenance 3. Data Variety: lots of unstructured data. Notable freeing design characteristics: 1. Low velocity - weekly update maximum for most bulk sources 2. Low volume - ~1-5gb per source per update for full refresh 3. Low user count - single user (me) 1. Raw Storage in Google Cloud Storage 2. Data Transformation Pipelines in Dataflow (managed Apache beam) 3. BigQuery Data Warehouse for relational data 4. Looker Studio (formerly Google Data Studio) for BI.
  22. 22. Detailed Design - Personal Data Warehouse
  23. 23. Conclusion
  24. 24. Caveats: 1. Modern Data Stack – like many other terms – is mostly a marketing term / fad. 2. The major components of modern data stacks have sharp edges a. Costs can quickly spiral out of control if data access is overly democratic. b. Powerful configuration options - updates to data pipelines are easier to make, not necessarily more correct. 3. There are still huge opportunities for tooling improvements. a. Last ~10 years have seen a huge unbundling of data tools and new ‘best in breed’ SaaS providers. i. Integrating all these components into a cohesive platform is a lot of work, so we will see bundled all in one data platforms become increasingly competitive. b. Metadata / data cataloging tools need improvement to support better data management.
  25. 25. The best data stack is the one that works best for you. ● Data Stack Design is system design ○ The best systems are those that provide the desired capabilities. ■ Actually think about what the design goals of your data stack are. ● Data Stack Development is iterative ○ Sometimes everyone will be happiest with a simple solution like a cron job querying the production database (preferably a replica). ■ This can work well for years. ■ This can also turn into a hot mess operationally and require urgent replacement with a better solution ○ Finding an optimal balance between planning and learning is hard. ■ Finding a close enough to optimal balance is feasible.
  26. 26. Thank you! Have any data problems? I’m looking for new Data Engineering / Technical Product Manager Roles. Email: Will@williamangel.net Website: www.williamangel.net | www.d8aeng.com Twitter: @DataDrivenAngel Linkedin: https://www.linkedin.com/in/william-angel/

×