Data Engineer's Lunch #85: Designing a Modern Data Stack

Designing a Modern Data Stack
Will Angel - 23 Jan 2023 - Data Engineer’s Lunch

Overview
1. What is a Data Stack?
2. Data Solution Design Process
3. Data Stack Design Examples
4. Demo
5. Conclusion

What is a regular software stack?
A software “stack” is the set of software or
software components needed to run an
application.
Notable examples:
● LAMP
○ Linux
○ Apache
○ MySQL
○ PHP
● MERN
○ MongoDB
○ Express.js
○ React.js
○ Node.js

Are data stacks just regular software stacks?
Yes and no.
Data engineering is a specialty within software engineering,
and everything is software running on computers at the end
of the day, so yes, data stacks are software stacks.
But, there are notable differences that are worth addressing…
Especially because every data tool company wants to market
their tool as part of the “Modern Data Stack”

What is Modern about the “Modern” data stack?
Four major trends make the
‘modern data stack’ make
sense:
1. Modern Cloud platforms.
2. Column Store Data
Warehouses.
3. Cost of disk trending to zero.
4. Proliferation of managed
data tools.

Deﬁning Characteristics of the Modern Data Stack
1. Cloud & SQL Based: Column-store based Cloud Data Warehouse at the center
○ With optional ﬁle / object store based data lake.
2. Modular: Managed SaaS tools for almost every part of the data lifecycle.
○ Optional: run open source components and write your own integrations.

What is so special about cloud data warehouses?
Modern column store data warehouses run on a cloud
computing platform have some great beneﬁts for building data
intensive applications:
● Flexible & scalable pay-as-you go compute:
○ No upfront hardware or major purchases required.
○ No outgrowing your data center at awkward times.
● Managed services
○ Running your own infrastructure reliably and effectively is hard, so
paying for a cloud computing company to do it for you is usually a great
deal.
○ Allows for data teams to move quickly without needing as much
specialized operational experience.

The cost of storage
Cost per GB has fallen
~100,000x since the mid 90s.
The cellphone in your pocket has
more storage and processing
power than a Cray-2
supercomputer from the mid 80s.
The Big Data Revolution is
mostly driven by this trend.

Data Solution design process
1. Determine desired capabilities & design constraints
2. Create iteration plan
3. Execute plan.
4. Evaluate delivered data solution.
5. Return to 1.
Same as OODA (Observe, Orient, Decide, Act)/ PDCA (Plan, Do,
Check, Act) frameworks. Iteration cycle scale and length can be
minutes to years (I recommended shorter and smaller).

Step 1. Problem Deﬁnition
The ﬁrst step in developing a solution is to identify the problem.
This step can include:
● Requirements gathering
● Software vision documentation
● User research & interviews
● Industry research
● Documentation
● More documentation…

Step 2. Create an iteration plan
Create a plan to deliver a working system that has the capabilities to solve all of
the necessary problems.
This can include:
● System design diagrams & documents
● Jira tickets and work breakdown structure
● Doodles on a napkin

Step 3. Execute the plan
Once you have a plan that looks good enough, build the thing!
This should include:
● Software development
● Software development to improve the software development process
● Procurement - buying off the shelf tools.
● Testing - systems integration & technical tests.
● Testing - user / client demos.

Step 4. Evaluate
After developing a functional data solution, it is important to evaluate whether you
did an acceptable job.
This includes:
● Requirements review - does the data solution meet the requirements?
● Capability value - do the data solution’s new capabilities actually provide value?
● Identify future improvement opportunities
● Identify future development process improvement opportunities

Step 5. Repeat the cycle
Data Platform development is an iterative process, and much of the value depends
on the end users: unused data is worthless, so if the developed system is unused,
it won’t have been worth building most of the time.
Iteration is a great way to discover unknown requirements and opportunities, and
work with the end users of data to build good data systems that help cultivate a
vibrant ecosystem.

Design Example 1:
Generic BI Data Stack

The Modern Data Stack for Business Intelligence
Core Components:
1. Storage - Cloud Data
Warehouse
(Snowﬂake, Redshift,
BigQuery)
2. Ingestion - Managed
ETL (Stitch, Fivetran)
3. Transformation - dbt /
SQL
4. Visualization - BI tool
of choice

Auxiliary Components
You’ll also want:
● Data Observability - tools like Monte Carlo & BigEye
● Data Cataloging - tools like Castor or Alation
● Systems Observability - ELK / Prometheus & Grafana
A modern data platform is a large distributed system with
numerous third party vendors and constantly changing API
integrations. Treat it with respect or it will break on you.

Design Example 2:
Personal Data Warehouse

High Level Design - Personal Data Warehouse
Primary Design constraints:
1. Low cost.
2. Low maintenance
3. Data Variety: lots of unstructured
data.
Notable freeing design characteristics:
1. Low velocity - weekly update
maximum for most bulk sources
2. Low volume - ~1-5gb per source
per update for full refresh
3. Low user count - single user (me)
1. Raw Storage in Google Cloud Storage
2. Data Transformation Pipelines in Dataﬂow
(managed Apache beam)
3. BigQuery Data Warehouse for relational data
4. Looker Studio (formerly Google Data Studio) for BI.

Detailed Design - Personal Data Warehouse

Caveats:
1. Modern Data Stack – like many other terms – is mostly a marketing term / fad.
2. The major components of modern data stacks have sharp edges
a. Costs can quickly spiral out of control if data access is overly democratic.
b. Powerful conﬁguration options - updates to data pipelines are easier to make, not necessarily
more correct.
3. There are still huge opportunities for tooling improvements.
a. Last ~10 years have seen a huge unbundling of data tools and new ‘best in breed’ SaaS providers.
i. Integrating all these components into a cohesive platform is a lot of work, so we will see
bundled all in one data platforms become increasingly competitive.
b. Metadata / data cataloging tools need improvement to support better data management.

The best data stack is the one that works best for you.
● Data Stack Design is system design
○ The best systems are those that provide the desired capabilities.
■ Actually think about what the design goals of your data stack are.
● Data Stack Development is iterative
○ Sometimes everyone will be happiest with a simple solution like a cron job querying the
production database (preferably a replica).
■ This can work well for years.
■ This can also turn into a hot mess operationally and require urgent replacement with a
better solution
○ Finding an optimal balance between planning and learning is hard.
■ Finding a close enough to optimal balance is feasible.

Thank you!
Have any data problems? I’m looking for new Data
Engineering / Technical Product Manager Roles.
Email: Will@williamangel.net
Website: www.williamangel.net | www.d8aeng.com
Twitter: @DataDrivenAngel
Linkedin: https://www.linkedin.com/in/william-angel/

Data Engineer's Lunch #85: Designing a Modern Data Stack

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Engineer's Lunch #85: Designing a Modern Data Stack

Ähnlich wie Data Engineer's Lunch #85: Designing a Modern Data Stack (20)

Mehr von Anant Corporation

Mehr von Anant Corporation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Engineer's Lunch #85: Designing a Modern Data Stack