The document discusses 5 data trends that data engineers should know:
1. Rise of data pipelines to repeatedly move data around using code.
2. Use of compute engines to query cloud data without moving it by separating data and compute.
3. Data modeling to define metrics once for the entire organization.
4. Building of internal and external data products to extract insights from large amounts of data.
5. Ensuring data quality by developing tests and monitoring data flows to maintain data integrity.
6. Rise of Data Engineering as Craft
Why has
Data Become
So
Ubiquitous?
7. Rise of Data Engineering as Craft
Aggregated into
EDW
Output
Oracle SAP
Logs
TX
Actions
Cognos
Tableau
Data Produced
When a Single Monolithic Pipeline
Worked, It Looked Like This
8. Rise of Data Engineering as CraftBut Everyone Wanted One
Exec Team
Marketing Product Sales
9. Rise of Data Engineering as CraftAnd They Each Need Data from the
Others
Exec Team
Marketing Product Sales
10. Rise of Data Engineering as Craft
This is a Data Mesh:
A Network of Data Producers &
Consumers
16. Rise of Data Engineering as CraftWhat is a Data Engineer?
Data Engineers: the people who move, shape, and
transform data from where it is generated to
where it is needed, and do it
1.Consistently
2.Efficiently
3.Scalably
4.Accurately
5.Compliantly
17. Rise of Data Engineering as Craft
aka
Software Engineers Deep in Data
20. Rise of Data Engineering as Craft
What is the Data Engineering
Equivalent?
21. Rise of Data Engineering as Craft
The Data
Engineering
Lifecycle
22. Rise of Data Engineering as Craft
Each Step of the DELC Needs
New Tools
23. Rise of Data Engineering as CraftData Pipelines:
Watermains of Data
Code in a modern language to
repeatably move data around
Innovators
Airflow, Elementl, Prefect
25. Rise of Data Engineering as CraftCompute Engines:
Access Cloud Data
Query data in the cloud, without
moving it. Key insight: separation
of data and compute.
Innovators
Dremio, Databricks
26.
27. Rise of Data Engineering as CraftData Modeling:
Universal Metrics Library
Define metrics once for the entire
organization
Innovators
Transform Data, Looker
28.
29. Rise of Data Engineering as CraftData Products:
Stand on the Shoulders of Gigabytes
Build and deploy data products
internally and externally
Innovators
BI: Preset
ML: Streamlit, Tecton
30.
31.
32. Rise of Data Engineering as CraftData Quality:
Harness & Tame Error
Develop tests and monitor data
flows to ensure data integrity
Innovators
Monte Carlo, Great Expectations,
Soda Data, Data Gravity
33.
34.
35. 5 Data Trends You Should Know
1.Data Pipelines – move data with code
2.Compute Engines – query cloud data
3.Modeling – defines metrics once
4.Data Products – squeeze insight from data
5.Data Quality – keep data accurate
37. FiveDataTrends You Should Know
Tomasz Tunguz, Managing Director, Redpoint Ventures
@ttunguz & tomtunguz.com
Hinweis der Redaktion
Thank you for the warm introduction, Jason.
I’m thrilled to be here.
My name is Tomasz Tunguz. I’m a managing director at Redpoint Ventures and I write a blog at tomtunguz.com
It’s a data infused collection of posts about startups.
Let me tell you about Redpoint.
Redpoint is a venture firm based in Silicon Valley.
Invest anywhere from 1m to 50m in companies primarily in the US.
We’re a group of founders and operators who have founded startups, operated at hypergrowth companies, and helped startups scale to terrific heights.
We work or have worked with 26 Unicorns and some iconic companies with more than 25b in market cap.
Including Stripe, Hashicorp, Twilio, Duo Security and Zendesk.
We have deep domain experience in data. We were early investors in Looker, Snowflake and Dremio.
We evaluate about 7000 investment opportunities annually and this presentation is meant to distill some of the trends we see in market.
I’m passionate about data.
I was first exposed to the power of data studying machine learning at college. I studied control systems for satellites and saw how that technology could be used in the stock market.
Then went to Google. Google’s business is entirely predicated on data.
I saw first hand the impact and the leverage we could drive from great data if properly managed through the right systems and tools.
I have been deep in in data ever since
I co-authored a book on data called Winning with Data, that researched the challenges modern organizations face with data and how the best companies in the world mitigate those challenges and transform data into competitive advantage.
Like you all, I love data and the power & insight it can give businesses.
Today, I’ll share with you 5 trends we’re seeing in the data world. But you should know there is one megatrend, a huge wave, furthering these trends.
That trend is the rise of data engineering as a new craft. The word data engineer is new, and the idea is important.
Data engineers will define the next decade.
Ten years ago, the people working with data, moving it, shaping it, slicing it, came from many different backgrounds.
Some came from finance; others have statistics backgrounds; still others came from customer support (like me) and they all found themselves in data roles.
This convergence across disciplines occurred because data has become a critical part of every modern company’s technology stack.
Data has become essential. So companies invest in specialized people, processes and systems to maximize the benefit they can squeeze from data.
Data engineering has come about because data is everywhere. And every bit of a business’ data is valuable,.
The reason data has become so ubiquitous is it costs much less to store than it did 20 years ago.
20 years ago, we stored data in Oracle databases that were expensive and required new licenses as data scaled. So we filtered it aggressively.
Today, we store exabytes data in files on S3. Because we can afford it. For the price of two oat-milk macchiatos at BlueBottle, I can store half a terabyte of data on Amazon for a month. So, we store data because we can afford it. And we store buckets, reams, mountains of it.
Since we have all that data at hand, we decided to use it. 20 years ago, IT bought the systems to extract value from data. They procured them, installed them, and managed these systems. But ten years ago, forward thinking teams decided to do it for themselves. IT was too slow. A modern marketing team can’t wait 3 to 6 months to get the answers to their questions. They’ll be toast. So the marketing team bought their own system.
And, then the marketing team created data products. At first, these data products were dashboards. How many new clicks? How many leads? How many customers? How much ad spend? Then marketing operations teams became more sophisticated. They started to run scenarios to test different ideas, and experiment with new techniques. Today, marketing is a panoply of machine learning algorithms stuffed to the gills with first-party data, a quantitative hedge fund for buying online ads. All in 15 years.
Those predictive systems create data of their own, which is stored and processed. This is more than a process; it’s a flywheel that goes faster and faster and faster. A massive digital boulder of ones and zeros coming down the hill at top speed.
The problem is that this boulder isn’t just in marketing. It’s everywhere within a company.
Let me explain. 20 years ago this is how the data world worked at the highest level.
Systems produced data: system logs, transactions, customer actions on websites.
The data was filtered into an enterprise data warehouse and data cube because of cost.
And then pumped into a legacy output system like a Tableau or Cognos.
This worked for small data volumes. But it’s expensive, inflexible, closed and slow. Pop quiz: how long does it take to update a report in Cognos? Too long. Your business is dead.
But this was state of the art.
And everyone wanted one.
Each team manager saw success with data. The authority, the command of the business, the ideas that flowed from the data. It’s intoxicating when you can use data to see around corners, inspire confidence, and lead teams boldly into the future.
So each team developed their own data systems. IT couldn’t keep up. And consumerization of IT was born. For every $1 IT spent on technology, department heads spent an additional 47 cents to outfit their teams with the best kit.
At the outset, departments built small systems. But then each hired operations teams, doublespeak for data analysts and data engineers to help them understand the data, predict the future, and build data products on top.
A thousand digital flowers bloomed. And they grew and grew and grew.
And that garden quickly became overrun with complexity. Leaves and thorns everywhere.
The marketing team decided they needed data from other places; not just the central data store administered by IT. The marketing team needs access to the CRM data base to understand customer value. Oh, and customer support data to understand customer lifecycles. Plus, billing data from the finance team. And a bit of product data: those web analytics inform customer conversion.
It wasn’t just marketing that was sapping data from other teams. Each department needed data from the other to operate their businesses best. Which created a completely new concept.
This idea has a name. It is called the data mesh. It is a network of data producers and consumers within an organization.
Each team is responsible for producing its own data, publishing data via some API or common format. It’s responsible for documenting the data, explaining the lineage, keeping it up to date, so other teams can use it and rely on it to decide.
In exchange, other teams do the same. This creates a mesh, and enables the organization to send the data, use APIs, and develop increasingly sophisticated data products at scale.
And then, importantly, modern companies move this all to a cloud data lake. In the cloud, data is elastic, cheap, maintained by someone else, and accessible by everyone (with the right IAM permissions of course).
More importantly, teams stored data in these cloud data lakes in standard, open-source formats like Parquet and Arrow. These formats accelerate queries, create a single standard which makes it easier to work with tools that you have today and tools that have not been invented yet
That’s the vision. That’s where the industry is going. But we are all in different states of getting there. And the reality is more complicated than these beautiful diagrams.
In fact today, many companies don’t have a data mesh, they have a date mess.
Each team has their own tools, data storage depots and infrastructure. It’s a big bucket of Legos. Systems that don’t talk to each other. Confusion about three different definitions of revenue. Where is the customer support data table? Oh, that’s the old version. And that column that reads date_final_final is actually the wrong format. We moved it to a new column called dff..f. And to access that table you need to speak COBOL. But we lost the COBOL/NodeJS connector.
Data Messes have 4 consistent problems
data breadlines: I have a question about the business. Let me go and ask the engineer I met at lunch if she’ll do me the favor of pulling the data, again. Data breadlines are the invisible people people waiting around for answers to their data questions, who ask a question and go to the back of the line when they need a refinement.
Data obscurity or rogue databases: when I was at google, I operated a rogue database. I asked an engineer to run a MapReduce job to help me understand the competition and dump that table on a server underneath my desk. Then I bult reports on that table and we used it to prioritize customer acquisition techniques. No one knew it was there. No one validated the data.
Data fragmentation is the challenge of finding out where data is. You see the dashboard in front of you. You know the data is stored somewhere in the company, but where? Who owns it?
Data brawls: the fights between teams about the definition of payback period).
The vision, as it has always been with data systems is to put it all together and develop a breathtaking machine that enables a company to grow significantly faster. I can tell you from working with some of the leading companies in the data world, when you do achieve this vision, it’s a transformation. It enables teams to move faster, execute better, and outperform the competition.
I saw it at Google. I saw it at Looker and many of Looker’s customers. And we’re seeing it at Dremio too.
Companies that can migrate to data meshes suddenly unlock hidden productivity, It’s a big leap and challenge.
But, getting there and building a machine is not easy. So the question is when you put the bat signal who will come to save the day?
There’s a simple answer. It’s the data engineer. This is why this role has evolved. Because the complexity has gotten to a point that we need specialized people to manage this infrastructure and empower everyone within a company to use data effectively.
We believe that data engineering is the customer success of this decade. A new role, is critically important to a company, that will champion a discipline of the future. Although I can’t see you, I’m confident many of you in the audience are exactly the superhero, maybe minus the Batmobile.
What is a data engineer? They are the people who move, shape, and transform data from where it is generated to where it is needed and do it consistently, efficiently, scalability, accurately and compliantly.
Date engineers have many different skills. Some of them are infrastructure specialists. Others have focused on reporting and the tools associated with analytics. Still others develop and host and maintain machine learning infrastructure. It is a broad discipline of very smart people who are going to be key to business success in the next 10 years.
In other words, these people are software engineers who are deep data around
In researching this market, we had an insight. Software engineers have decades of experience writing software, building tooling and patterns of writing code.
The ost recent example of this is the cloud native computing foundations software development lifecycle.
This is a ourobouros, a snake eating its tail, an infinite cycle.
It is a consistent process for how to manage software releases in the most modern way. Vendors within that ecosystem use this diagram within their pitches to customers and describe exactly which part of the processes they address.
Managers use this process to talk about tooling at different steps of the engineering process. It has 8 steps.
Plan the software you want to build
Code it
Build it and package it to ship
Test it with a testing harness
Release the software by pushing into the production environment
Deploy the software across your cloud
Operate it
Monitor it
And repeat
If data engineering really is software engineers deep in data, what is the data engineering equivalent of the software development lifecycle? I haven’t been able to find one. But, in talking to hundreds of potential buyers of this kind of software, we have a hypothesis of what it should look like.
This is what we observe the market for the data engineering cycle. It has six steps
Ingesting data from whatever data producer is spewing data into storage system like Amazon S3
Planning: this is the phase of deciding what it is that you want to do with this data
Query: modern computer engines run over the data to filter and aggregate the data in a way that’s useful to a particular product.
Data Modeling: is the work of defining a metric once in a central place so that everyone within the company can benefit from it.
Developing Product: Is the work of actually building a product around the data and the insights contained within that data
Monitoring: the act and process of ensuring data is flowing normally and is accurate at all times
This cycle creates more data which is then ingested saved and pumped back into the rest of the cycle.
In each of the steps of the data engineering lifecycle, new tools are emerging to support the work of the data engineer. These are the five major trends within the data world.
First, data pipelines. These are the watermains of data moving data from where it’s produced to where it can be leveraged.
Data pipelines have been around forever. The main advance in these data pipelines are
Using modern computing languages
Creating higher levels of abstractions to enable engineers to reuse code across different data pipelines to improve productivity
Monitoring within these data pipelines
Visualization of the DAGs a directed acyclic graph, all the steps involved
Here are screenshots of Prefect’s products which ingests code and then creates a DAG visualization. You can see the different steps in the data process.
And on the right, there is a monitoring dashboard that shows the state of the data pipeline, the errors, and the activity.
The idea is to treat data pipelines as real code with true monitoring to ensure data is always accurate.
Some of these systems
Computer engines query the data within the cloud without moving it. This enables teams to get access to all the information they need from a single place in a cost-effective, compliant, and fast way.
These computer engines are the execution layer that sits on top all of the open format files. Compute engines accelerate queries. They make them faster not just for a single user vote for everybody. They reduce cost because you’re not having to move data around. They eliminate data lock-in because any tool can talk to them provided they use an open format like arrow.
We’ve been lucky enough work with Dremio from the beginning, and it’s our company that has seen this trend years ago and develop the infrastructure to enable you to achieve this vision
The next step in the process is data modeling. The idea is to define revenue once so that the sales team and the marketing team both have the same definition and don’t get into data disputes with each other. Make sure that the entire company is aligned on a single number.
I’m sure we’ve all lived through a meeting where we are arguing about a topic, and we’ve each got a different number for revenue or lead count or payback period. Modeling is all about creating an owner of a metric, explaining what that metric is, describing the lineage, so that everybody is on the same page and using the right number in the right column to make the best decision
The other important part of data modeling is to ensure you undersntand what your data is telling you. Variations inData definitions can have meaningful impact on how you interpret a number. So, companies like transform develop systems where dimensions and metrics of data are defined once, in a central place. This code is checked into Github.
Then, whenever you need data you query the data modeling interface, which ensures you know the revenue metric you are asking for is the revenue metric that everybody else is using and the one that has been approved by finance.
Data products are the insights, analytics, and software built using data within a company. There are two big buckets of these I’ll talk about today:
Their next generation data visualization companies like preset that enable teams to visualize trends within the data, share this insight with others, and publish them on an ongoing basis to key stakeholders. Preset is a company commercializing an open source software called superset which was created at AirBnB. In fact, the founder of the project and the company Max, spoke to you earlier today. Preset adopts many of the open principles that are consistent with the rest of this ecosystem and applies it to data visualization and data exploration
in addition, there is a parallel world within machine learning tooling. This world is huge and purity with the key players in it. Streamlet enables machine learning engineers to share their models with non-technical users either for direct consumption of those models like a recommendation system within a customer support tool for recommending email responses, or for help treating a model in a autonomous vehicle use case for example.
To give you a sense, this is a screenshot of preset’s mapping capability in San Francisco. This is entirely open-source software
This is an example of a StreamlitData product. On the left is the code written in Python. On the right, you see the web UI that is created. In this case, it is an example that allows an end user to tweak and tune data scientist’s machine learning model. And that user doesn’t have to be a technical user. It could be someone who operates autonomous vehicles helping data scientists to the object avoidance algorithm.
Last, data quality. Data quality was a wave in the late 90s. But it is disappeared for about 20 years, or at least hasn’t been adoptied within modern data stacks until now.
software engineering has many different systems to ensure new code operates well. There are a battery of performance tests, functional test, unit test, progression tests, concepts of test coverage, monitoring tools and anomaly detection tools. But we don’t have that today. And it manifests itself in the worst way.
Has your CEO ever looked at a report you showed him and said the numbers look way off? Has a customer ever called out incorrect data in your product’s dashboards? Data quality is meant to solve that issue and restore consistent credibility within people who use data.
There are two different approaches to data quality. The first is to write explicit tests. This is an expectation from Great Expectations.
It says the column room temperature should remain between 60 & 75 degrees for 95% of instances. This type of data integrity testing is like functional testing in software. If engineers know what to expect, this is an effective tool. It does require writing a huge battery of tests and having a test coverage metric similar to software.
There’s another approach using machine learning. Companies like Soda Data and Monte Carlo use ML to understand data patterns and then discover anomalies. These anomalies might be differences in data volumes. A data feed is broken. Or there’s a change in distribution of the data. Instead of a gaussian distribution in the data, now it’s a zipf and which has implications for analysis downstream.
The machine learning approach comes from anomaly detection in security systems. And the benefit is the system is automonous. The challenge is ensuring the signal to noise ratio is strong and meaningful, otherwise, users won’t pay attention to the results.
So, in summary, these are the five data trends you should know. These are the data trends that we have observed after meeting thousands of companies and talking to hundreds of prospective buyers. These are the technologies that we expect will define the data world over the next 10 years.
These five trends are not enough.
It’s really early in this decade of data engineering. We are 6 months into her 10 year long movement. The future depends on you. We need engineers to weave all these different technologies together into a beautiful data tapestry. These are not easy problems, and the landscape underneath you is changing all the time. There are new software tools, legacy applications, lots of demands from everybody around you to get them exactly what they need when they need it which is yesterday.
But at Redpoint, we believe this decade is the decade of the data engineer. An entirely new role that specializes in the critically important functions of getting data from the places it is generated to the places it creates insights and unlocks powerful decision-making ability within businesses
The future depends on you.