5 Major Trends in Data You Should Know

•Als PPTX, PDF herunterladen•

7 gefällt mir•11,905 views

The document discusses 5 data trends that data engineers should know: 1. Rise of data pipelines to repeatedly move data around using code. 2. Use of compute engines to query cloud data without moving it by separating data and compute. 3. Data modeling to define metrics once for the entire organization. 4. Building of internal and external data products to extract insights from large amounts of data. 5. Ensuring data quality by developing tests and monitoring data flows to maintain data integrity.

Daten & Analysen

FiveDataTrends You Should Know
Tomasz Tunguz, Managing Director, Redpoint Ventures
@ttunguz & tomtunguz.com

Metatrend:
Rise of Data Engineering as Craft

Rise of Data Engineering as Craft
Why has
Data Become
So
Ubiquitous?

Rise of Data Engineering as Craft
Aggregated into
EDW
Output
Oracle SAP
Logs
TX
Actions
Cognos
Tableau
Data Produced
When a Single Monolithic Pipeline
Worked, It Looked Like This

Rise of Data Engineering as CraftBut Everyone Wanted One
Exec Team
Marketing Product Sales

Rise of Data Engineering as CraftAnd They Each Need Data from the
Others
Exec Team
Marketing Product Sales

Rise of Data Engineering as Craft
This is a Data Mesh:
A Network of Data Producers &
Consumers

Centralize and Move it to a
Cloud Data Lake

Rise of Data Engineering as Craft
Without the right tooling, you
have a Data Mess

Rise of Data Engineering as CraftBut You Could Have a Breathtaking
Machine, When It All Comes Together?

Rise of Data Engineering as CraftWhat is a Data Engineer?
Data Engineers: the people who move, shape, and
transform data from where it is generated to
where it is needed, and do it
1.Consistently
2.Efficiently
3.Scalably
4.Accurately
5.Compliantly

Rise of Data Engineering as Craft
aka
Software Engineers Deep in Data

Insight: Software Engineers Have
Experience, Tools, and Patterns
Writing Code

Rise of Data Engineering as Craft
What is the Data Engineering
Equivalent?

Rise of Data Engineering as Craft
The Data
Engineering
Lifecycle

Rise of Data Engineering as Craft
Each Step of the DELC Needs
New Tools

Rise of Data Engineering as CraftData Pipelines:
Watermains of Data
Code in a modern language to
repeatably move data around
Innovators
Airflow, Elementl, Prefect

Rise of Data Engineering as CraftCompute Engines:
Access Cloud Data
Query data in the cloud, without
moving it. Key insight: separation
of data and compute.
Innovators
Dremio, Databricks

Rise of Data Engineering as CraftData Modeling:
Universal Metrics Library
Define metrics once for the entire
organization
Innovators
Transform Data, Looker

Rise of Data Engineering as CraftData Products:
Stand on the Shoulders of Gigabytes
Build and deploy data products
internally and externally
Innovators
BI: Preset
ML: Streamlit, Tecton

Rise of Data Engineering as CraftData Quality:
Harness & Tame Error
Develop tests and monitor data
flows to ensure data integrity
Innovators
Monte Carlo, Great Expectations,
Soda Data, Data Gravity

5 Data Trends You Should Know
1.Data Pipelines – move data with code
2.Compute Engines – query cloud data
3.Modeling – defines metrics once
4.Data Products – squeeze insight from data
5.Data Quality – keep data accurate

Empfohlen

Big Data Trends - WorldFuture 2015 ConferenceDavid Feinleib

Leveraging IOT and Latest TechnologiesMithileysh Sathiyanarayanan

Trends in Big Data & Business Challenges Experian_US

Technology Trends for 2019 and BeyondInfopulse

Cisco niels vd bergBigDataExpo

Big Data LDN 2017: Deep Learning DemystifiedMatt Stubbs

SmartData Webinar: Cognitive Computing in the Mobile App EconomyDATAVERSITY

Big Data and The Future of Insight - Future FoundationForesight Factory

Empfohlen

Big Data Trends - WorldFuture 2015 ConferenceDavid Feinleib

Leveraging IOT and Latest TechnologiesMithileysh Sathiyanarayanan

Trends in Big Data & Business Challenges Experian_US

Technology Trends for 2019 and BeyondInfopulse

Cisco niels vd bergBigDataExpo

Big Data LDN 2017: Deep Learning DemystifiedMatt Stubbs

SmartData Webinar: Cognitive Computing in the Mobile App EconomyDATAVERSITY

Big Data and The Future of Insight - Future FoundationForesight Factory

Building an AI Startup: Realities & TacticsMatt Turck

Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes

Disruptive TechnologiesMithileysh Sathiyanarayanan

Big Data : Risks and OpportunitiesKenny Huang Ph.D.

Fundamentals of Big Data in 2 minutes!!Simplify360

Conversational Architecture, CAVE Language, Data StewardshipLoren Davie

Cognitive computing big_data_statistical_analyticsPietro Leo

AI at the EdgeDATAVERSITY

Digital twinBrainware University

Business analyticsSwarnaLatha177

Big, small or just complex data?panoratio

AI-SDV 2020: AI, IoT, Blockchain & Co: How to keep track and take advantage o...Dr. Haxel Consult

Digital twinsChaand Chopra

How does big data impact youAnnzalie (Ann) Barrett

Summiting the Mountain of Big DataIntegra

The Business of Big Data - IA VenturesBen Siscovick

Impact of big data on analyticsCapgemini

Evolution of big data technologyMarket Analyzer

Living in a data driven world by V Laxmikanth BroadridgeZinnov

DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY

Data Engineering Proposal for Homerunner.pptxDamilolaLana1

How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo

Weitere ähnliche Inhalte

Was ist angesagt?

Building an AI Startup: Realities & TacticsMatt Turck

Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes

Disruptive TechnologiesMithileysh Sathiyanarayanan

Big Data : Risks and OpportunitiesKenny Huang Ph.D.

Fundamentals of Big Data in 2 minutes!!Simplify360

Conversational Architecture, CAVE Language, Data StewardshipLoren Davie

Cognitive computing big_data_statistical_analyticsPietro Leo

AI at the EdgeDATAVERSITY

Digital twinBrainware University

Business analyticsSwarnaLatha177

Big, small or just complex data?panoratio

AI-SDV 2020: AI, IoT, Blockchain & Co: How to keep track and take advantage o...Dr. Haxel Consult

Digital twinsChaand Chopra

How does big data impact youAnnzalie (Ann) Barrett

Summiting the Mountain of Big DataIntegra

The Business of Big Data - IA VenturesBen Siscovick

Impact of big data on analyticsCapgemini

Evolution of big data technologyMarket Analyzer

Living in a data driven world by V Laxmikanth BroadridgeZinnov

Was ist angesagt? (19)

Building an AI Startup: Realities & Tactics

Big data introduction - Big Data from a Consulting perspective - Sogeti

Disruptive Technologies

Big Data : Risks and Opportunities

Fundamentals of Big Data in 2 minutes!!

Conversational Architecture, CAVE Language, Data Stewardship

Cognitive computing big_data_statistical_analytics

AI at the Edge

Digital twin

Business analytics

Big, small or just complex data?

AI-SDV 2020: AI, IoT, Blockchain & Co: How to keep track and take advantage o...

Digital twins

How does big data impact you

Summiting the Mountain of Big Data

The Business of Big Data - IA Ventures

Impact of big data on analytics

Evolution of big data technology

Living in a data driven world by V Laxmikanth Broadridge

Ähnlich wie 5 Major Trends in Data You Should Know

DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY

Data Engineering Proposal for Homerunner.pptxDamilolaLana1

How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo

Taming Big Data With Modern Software ArchitectureBig Data User Group Karlsruhe/Stuttgart

Building a Data Platform Strata SF 2019mark madsen

State of Big Data MarketsKyle Redinger

Why Data Virtualization? An IntroductionDenodo

How to design ai functions to the cloud native infraChun Myung Kyu

Qo Introduction V2Joe_F

TAKE A LOOK AT THE TOP 7 SKILLS THAT A DATA ENGINEER CERTAINLY HAS TO HAVEEmilySmith271958

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten

The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S

The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis

Enabling Data centric TeamsData Con LA

VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB

The Double win business transformation and in-year ROI and TCO reductionMongoDB

Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo

Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY

Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S

Ähnlich wie 5 Major Trends in Data You Should Know (20)

DataOps - The Foundation for Your Agile Data Architecture

Data Engineering Proposal for Homerunner.pptx

How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...

Taming Big Data With Modern Software Architecture

Building a Data Platform Strata SF 2019

State of Big Data Markets

Why Data Virtualization? An Introduction

How to design ai functions to the cloud native infra

Qo Introduction V2

TAKE A LOOK AT THE TOP 7 SKILLS THAT A DATA ENGINEER CERTAINLY HAS TO HAVE

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

The Evolving Role of the Data Engineer - Whitepaper | Qubole

The Maturity Model: Taking the Growing Pains Out of Hadoop

Enabling Data centric Teams

VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data

The Double win business transformation and in-year ROI and TCO reduction

Accelerate Self-Service Analytics with Data Virtualization and Visualization

Data Architecture, Solution Architecture, Platform Architecture — What’s the ...

Big Data Driven Solutions to Combat Covid' 19

Kürzlich hochgeladen

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

怎样办理旧金山城市学院毕业证（CCSF毕业证书）成绩单学校原版复制vexqp

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制vexqp

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg

Switzerland Constitution 2002.pdf.........EfruzAsilolu

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa

Discover Why Less is More in B2B Researchmichael115558

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu

怎样办理伦敦大学毕业证（UoL毕业证书）成绩单学校原版复制vexqp

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Kürzlich hochgeladen (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

怎样办理旧金山城市学院毕业证（CCSF毕业证书）成绩单学校原版复制

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...

Switzerland Constitution 2002.pdf.........

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...

7. Epi of Chronic respiratory diseases.ppt

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION

Discover Why Less is More in B2B Research

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation

怎样办理伦敦大学毕业证（UoL毕业证书）成绩单学校原版复制

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

5 Major Trends in Data You Should Know

1. FiveDataTrends You Should Know Tomasz Tunguz, Managing Director, Redpoint Ventures @ttunguz & tomtunguz.com

5. Metatrend: Rise of Data Engineering as Craft

6. Rise of Data Engineering as Craft Why has Data Become So Ubiquitous?

7. Rise of Data Engineering as Craft Aggregated into EDW Output Oracle SAP Logs TX Actions Cognos Tableau Data Produced When a Single Monolithic Pipeline Worked, It Looked Like This

8. Rise of Data Engineering as CraftBut Everyone Wanted One Exec Team Marketing Product Sales

9. Rise of Data Engineering as CraftAnd They Each Need Data from the Others Exec Team Marketing Product Sales

10. Rise of Data Engineering as Craft This is a Data Mesh: A Network of Data Producers & Consumers

11. Centralize and Move it to a Cloud Data Lake

12. Rise of Data Engineering as Craft Without the right tooling, you have a Data Mess

13.

14. Rise of Data Engineering as CraftBut You Could Have a Breathtaking Machine, When It All Comes Together?

15. Who Will Come to Save the Day?

16. Rise of Data Engineering as CraftWhat is a Data Engineer? Data Engineers: the people who move, shape, and transform data from where it is generated to where it is needed, and do it 1.Consistently 2.Efficiently 3.Scalably 4.Accurately 5.Compliantly

17. Rise of Data Engineering as Craft aka Software Engineers Deep in Data

18. Insight: Software Engineers Have Experience, Tools, and Patterns Writing Code

19. Ex: the Software Development Lifecycle

20. Rise of Data Engineering as Craft What is the Data Engineering Equivalent?

21. Rise of Data Engineering as Craft The Data Engineering Lifecycle

22. Rise of Data Engineering as Craft Each Step of the DELC Needs New Tools

23. Rise of Data Engineering as CraftData Pipelines: Watermains of Data Code in a modern language to repeatably move data around Innovators Airflow, Elementl, Prefect

24. Data Pipelines: Watermains of Data

25. Rise of Data Engineering as CraftCompute Engines: Access Cloud Data Query data in the cloud, without moving it. Key insight: separation of data and compute. Innovators Dremio, Databricks

26.

27. Rise of Data Engineering as CraftData Modeling: Universal Metrics Library Define metrics once for the entire organization Innovators Transform Data, Looker

28.

29. Rise of Data Engineering as CraftData Products: Stand on the Shoulders of Gigabytes Build and deploy data products internally and externally Innovators BI: Preset ML: Streamlit, Tecton

30.

31.

32. Rise of Data Engineering as CraftData Quality: Harness & Tame Error Develop tests and monitor data flows to ensure data integrity Innovators Monte Carlo, Great Expectations, Soda Data, Data Gravity

33.

34.

35. 5 Data Trends You Should Know 1.Data Pipelines – move data with code 2.Compute Engines – query cloud data 3.Modeling – defines metrics once 4.Data Products – squeeze insight from data 5.Data Quality – keep data accurate

36. The Future Depends on You

37. FiveDataTrends You Should Know Tomasz Tunguz, Managing Director, Redpoint Ventures @ttunguz & tomtunguz.com

Hinweis der Redaktion

Thank you for the warm introduction, Jason. I’m thrilled to be here. My name is Tomasz Tunguz. I’m a managing director at Redpoint Ventures and I write a blog at tomtunguz.com It’s a data infused collection of posts about startups.
Let me tell you about Redpoint. Redpoint is a venture firm based in Silicon Valley. Invest anywhere from 1m to 50m in companies primarily in the US. We’re a group of founders and operators who have founded startups, operated at hypergrowth companies, and helped startups scale to terrific heights.
We work or have worked with 26 Unicorns and some iconic companies with more than 25b in market cap. Including Stripe, Hashicorp, Twilio, Duo Security and Zendesk. We have deep domain experience in data. We were early investors in Looker, Snowflake and Dremio. We evaluate about 7000 investment opportunities annually and this presentation is meant to distill some of the trends we see in market.
I’m passionate about data. I was first exposed to the power of data studying machine learning at college. I studied control systems for satellites and saw how that technology could be used in the stock market. Then went to Google. Google’s business is entirely predicated on data. I saw first hand the impact and the leverage we could drive from great data if properly managed through the right systems and tools. I have been deep in in data ever since I co-authored a book on data called Winning with Data, that researched the challenges modern organizations face with data and how the best companies in the world mitigate those challenges and transform data into competitive advantage. Like you all, I love data and the power & insight it can give businesses.
Today, I’ll share with you 5 trends we’re seeing in the data world. But you should know there is one megatrend, a huge wave, furthering these trends. That trend is the rise of data engineering as a new craft. The word data engineer is new, and the idea is important. Data engineers will define the next decade. Ten years ago, the people working with data, moving it, shaping it, slicing it, came from many different backgrounds. Some came from finance; others have statistics backgrounds; still others came from customer support (like me) and they all found themselves in data roles. This convergence across disciplines occurred because data has become a critical part of every modern company’s technology stack. Data has become essential. So companies invest in specialized people, processes and systems to maximize the benefit they can squeeze from data.
Data engineering has come about because data is everywhere. And every bit of a business’ data is valuable,. The reason data has become so ubiquitous is it costs much less to store than it did 20 years ago. 20 years ago, we stored data in Oracle databases that were expensive and required new licenses as data scaled. So we filtered it aggressively. Today, we store exabytes data in files on S3. Because we can afford it. For the price of two oat-milk macchiatos at BlueBottle, I can store half a terabyte of data on Amazon for a month. So, we store data because we can afford it. And we store buckets, reams, mountains of it. Since we have all that data at hand, we decided to use it. 20 years ago, IT bought the systems to extract value from data. They procured them, installed them, and managed these systems. But ten years ago, forward thinking teams decided to do it for themselves. IT was too slow. A modern marketing team can’t wait 3 to 6 months to get the answers to their questions. They’ll be toast. So the marketing team bought their own system. And, then the marketing team created data products. At first, these data products were dashboards. How many new clicks? How many leads? How many customers? How much ad spend? Then marketing operations teams became more sophisticated. They started to run scenarios to test different ideas, and experiment with new techniques. Today, marketing is a panoply of machine learning algorithms stuffed to the gills with first-party data, a quantitative hedge fund for buying online ads. All in 15 years. Those predictive systems create data of their own, which is stored and processed. This is more than a process; it’s a flywheel that goes faster and faster and faster. A massive digital boulder of ones and zeros coming down the hill at top speed. The problem is that this boulder isn’t just in marketing. It’s everywhere within a company.
Let me explain. 20 years ago this is how the data world worked at the highest level. Systems produced data: system logs, transactions, customer actions on websites. The data was filtered into an enterprise data warehouse and data cube because of cost. And then pumped into a legacy output system like a Tableau or Cognos. This worked for small data volumes. But it’s expensive, inflexible, closed and slow. Pop quiz: how long does it take to update a report in Cognos? Too long. Your business is dead. But this was state of the art.
And everyone wanted one. Each team manager saw success with data. The authority, the command of the business, the ideas that flowed from the data. It’s intoxicating when you can use data to see around corners, inspire confidence, and lead teams boldly into the future. So each team developed their own data systems. IT couldn’t keep up. And consumerization of IT was born. For every $1 IT spent on technology, department heads spent an additional 47 cents to outfit their teams with the best kit. At the outset, departments built small systems. But then each hired operations teams, doublespeak for data analysts and data engineers to help them understand the data, predict the future, and build data products on top. A thousand digital flowers bloomed. And they grew and grew and grew.
And that garden quickly became overrun with complexity. Leaves and thorns everywhere. The marketing team decided they needed data from other places; not just the central data store administered by IT. The marketing team needs access to the CRM data base to understand customer value. Oh, and customer support data to understand customer lifecycles. Plus, billing data from the finance team. And a bit of product data: those web analytics inform customer conversion. It wasn’t just marketing that was sapping data from other teams. Each department needed data from the other to operate their businesses best. Which created a completely new concept.
This idea has a name. It is called the data mesh. It is a network of data producers and consumers within an organization. Each team is responsible for producing its own data, publishing data via some API or common format. It’s responsible for documenting the data, explaining the lineage, keeping it up to date, so other teams can use it and rely on it to decide. In exchange, other teams do the same. This creates a mesh, and enables the organization to send the data, use APIs, and develop increasingly sophisticated data products at scale.
And then, importantly, modern companies move this all to a cloud data lake. In the cloud, data is elastic, cheap, maintained by someone else, and accessible by everyone (with the right IAM permissions of course). More importantly, teams stored data in these cloud data lakes in standard, open-source formats like Parquet and Arrow. These formats accelerate queries, create a single standard which makes it easier to work with tools that you have today and tools that have not been invented yet That’s the vision. That’s where the industry is going. But we are all in different states of getting there. And the reality is more complicated than these beautiful diagrams.
In fact today, many companies don’t have a data mesh, they have a date mess. Each team has their own tools, data storage depots and infrastructure. It’s a big bucket of Legos. Systems that don’t talk to each other. Confusion about three different definitions of revenue. Where is the customer support data table? Oh, that’s the old version. And that column that reads date_final_final is actually the wrong format. We moved it to a new column called dff..f. And to access that table you need to speak COBOL. But we lost the COBOL/NodeJS connector.
Data Messes have 4 consistent problems data breadlines: I have a question about the business. Let me go and ask the engineer I met at lunch if she’ll do me the favor of pulling the data, again. Data breadlines are the invisible people people waiting around for answers to their data questions, who ask a question and go to the back of the line when they need a refinement. Data obscurity or rogue databases: when I was at google, I operated a rogue database. I asked an engineer to run a MapReduce job to help me understand the competition and dump that table on a server underneath my desk. Then I bult reports on that table and we used it to prioritize customer acquisition techniques. No one knew it was there. No one validated the data. Data fragmentation is the challenge of finding out where data is. You see the dashboard in front of you. You know the data is stored somewhere in the company, but where? Who owns it? Data brawls: the fights between teams about the definition of payback period).
The vision, as it has always been with data systems is to put it all together and develop a breathtaking machine that enables a company to grow significantly faster. I can tell you from working with some of the leading companies in the data world, when you do achieve this vision, it’s a transformation. It enables teams to move faster, execute better, and outperform the competition. I saw it at Google. I saw it at Looker and many of Looker’s customers. And we’re seeing it at Dremio too. Companies that can migrate to data meshes suddenly unlock hidden productivity, It’s a big leap and challenge.
But, getting there and building a machine is not easy. So the question is when you put the bat signal who will come to save the day? There’s a simple answer. It’s the data engineer. This is why this role has evolved. Because the complexity has gotten to a point that we need specialized people to manage this infrastructure and empower everyone within a company to use data effectively. We believe that data engineering is the customer success of this decade. A new role, is critically important to a company, that will champion a discipline of the future. Although I can’t see you, I’m confident many of you in the audience are exactly the superhero, maybe minus the Batmobile.
What is a data engineer? They are the people who move, shape, and transform data from where it is generated to where it is needed and do it consistently, efficiently, scalability, accurately and compliantly. Date engineers have many different skills. Some of them are infrastructure specialists. Others have focused on reporting and the tools associated with analytics. Still others develop and host and maintain machine learning infrastructure. It is a broad discipline of very smart people who are going to be key to business success in the next 10 years.
In other words, these people are software engineers who are deep data around
In researching this market, we had an insight. Software engineers have decades of experience writing software, building tooling and patterns of writing code.
The ost recent example of this is the cloud native computing foundations software development lifecycle. This is a ourobouros, a snake eating its tail, an infinite cycle. It is a consistent process for how to manage software releases in the most modern way. Vendors within that ecosystem use this diagram within their pitches to customers and describe exactly which part of the processes they address. Managers use this process to talk about tooling at different steps of the engineering process. It has 8 steps. Plan the software you want to build Code it Build it and package it to ship Test it with a testing harness Release the software by pushing into the production environment Deploy the software across your cloud Operate it Monitor it And repeat
If data engineering really is software engineers deep in data, what is the data engineering equivalent of the software development lifecycle? I haven’t been able to find one. But, in talking to hundreds of potential buyers of this kind of software, we have a hypothesis of what it should look like.
This is what we observe the market for the data engineering cycle. It has six steps Ingesting data from whatever data producer is spewing data into storage system like Amazon S3 Planning: this is the phase of deciding what it is that you want to do with this data Query: modern computer engines run over the data to filter and aggregate the data in a way that’s useful to a particular product. Data Modeling: is the work of defining a metric once in a central place so that everyone within the company can benefit from it. Developing Product: Is the work of actually building a product around the data and the insights contained within that data Monitoring: the act and process of ensuring data is flowing normally and is accurate at all times This cycle creates more data which is then ingested saved and pumped back into the rest of the cycle.
In each of the steps of the data engineering lifecycle, new tools are emerging to support the work of the data engineer. These are the five major trends within the data world.
First, data pipelines. These are the watermains of data moving data from where it’s produced to where it can be leveraged. Data pipelines have been around forever. The main advance in these data pipelines are Using modern computing languages Creating higher levels of abstractions to enable engineers to reuse code across different data pipelines to improve productivity Monitoring within these data pipelines Visualization of the DAGs a directed acyclic graph, all the steps involved
Here are screenshots of Prefect’s products which ingests code and then creates a DAG visualization. You can see the different steps in the data process. And on the right, there is a monitoring dashboard that shows the state of the data pipeline, the errors, and the activity. The idea is to treat data pipelines as real code with true monitoring to ensure data is always accurate. Some of these systems
Computer engines query the data within the cloud without moving it. This enables teams to get access to all the information they need from a single place in a cost-effective, compliant, and fast way. These computer engines are the execution layer that sits on top all of the open format files. Compute engines accelerate queries. They make them faster not just for a single user vote for everybody. They reduce cost because you’re not having to move data around. They eliminate data lock-in because any tool can talk to them provided they use an open format like arrow. We’ve been lucky enough work with Dremio from the beginning, and it’s our company that has seen this trend years ago and develop the infrastructure to enable you to achieve this vision
The next step in the process is data modeling. The idea is to define revenue once so that the sales team and the marketing team both have the same definition and don’t get into data disputes with each other. Make sure that the entire company is aligned on a single number. I’m sure we’ve all lived through a meeting where we are arguing about a topic, and we’ve each got a different number for revenue or lead count or payback period. Modeling is all about creating an owner of a metric, explaining what that metric is, describing the lineage, so that everybody is on the same page and using the right number in the right column to make the best decision
The other important part of data modeling is to ensure you undersntand what your data is telling you. Variations inData definitions can have meaningful impact on how you interpret a number. So, companies like transform develop systems where dimensions and metrics of data are defined once, in a central place. This code is checked into Github. Then, whenever you need data you query the data modeling interface, which ensures you know the revenue metric you are asking for is the revenue metric that everybody else is using and the one that has been approved by finance.
Data products are the insights, analytics, and software built using data within a company. There are two big buckets of these I’ll talk about today: Their next generation data visualization companies like preset that enable teams to visualize trends within the data, share this insight with others, and publish them on an ongoing basis to key stakeholders. Preset is a company commercializing an open source software called superset which was created at AirBnB. In fact, the founder of the project and the company Max, spoke to you earlier today. Preset adopts many of the open principles that are consistent with the rest of this ecosystem and applies it to data visualization and data exploration in addition, there is a parallel world within machine learning tooling. This world is huge and purity with the key players in it. Streamlet enables machine learning engineers to share their models with non-technical users either for direct consumption of those models like a recommendation system within a customer support tool for recommending email responses, or for help treating a model in a autonomous vehicle use case for example.
To give you a sense, this is a screenshot of preset’s mapping capability in San Francisco. This is entirely open-source software
This is an example of a StreamlitData product. On the left is the code written in Python. On the right, you see the web UI that is created. In this case, it is an example that allows an end user to tweak and tune data scientist’s machine learning model. And that user doesn’t have to be a technical user. It could be someone who operates autonomous vehicles helping data scientists to the object avoidance algorithm.
Last, data quality. Data quality was a wave in the late 90s. But it is disappeared for about 20 years, or at least hasn’t been adoptied within modern data stacks until now. software engineering has many different systems to ensure new code operates well. There are a battery of performance tests, functional test, unit test, progression tests, concepts of test coverage, monitoring tools and anomaly detection tools. But we don’t have that today. And it manifests itself in the worst way. Has your CEO ever looked at a report you showed him and said the numbers look way off? Has a customer ever called out incorrect data in your product’s dashboards? Data quality is meant to solve that issue and restore consistent credibility within people who use data.
There are two different approaches to data quality. The first is to write explicit tests. This is an expectation from Great Expectations. It says the column room temperature should remain between 60 & 75 degrees for 95% of instances. This type of data integrity testing is like functional testing in software. If engineers know what to expect, this is an effective tool. It does require writing a huge battery of tests and having a test coverage metric similar to software.
There’s another approach using machine learning. Companies like Soda Data and Monte Carlo use ML to understand data patterns and then discover anomalies. These anomalies might be differences in data volumes. A data feed is broken. Or there’s a change in distribution of the data. Instead of a gaussian distribution in the data, now it’s a zipf and which has implications for analysis downstream. The machine learning approach comes from anomaly detection in security systems. And the benefit is the system is automonous. The challenge is ensuring the signal to noise ratio is strong and meaningful, otherwise, users won’t pay attention to the results.
So, in summary, these are the five data trends you should know. These are the data trends that we have observed after meeting thousands of companies and talking to hundreds of prospective buyers. These are the technologies that we expect will define the data world over the next 10 years. These five trends are not enough.
It’s really early in this decade of data engineering. We are 6 months into her 10 year long movement. The future depends on you. We need engineers to weave all these different technologies together into a beautiful data tapestry. These are not easy problems, and the landscape underneath you is changing all the time. There are new software tools, legacy applications, lots of demands from everybody around you to get them exactly what they need when they need it which is yesterday. But at Redpoint, we believe this decade is the decade of the data engineer. An entirely new role that specializes in the critically important functions of getting data from the places it is generated to the places it creates insights and unlocks powerful decision-making ability within businesses The future depends on you.