SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
1A S T R O N O M E R . I O
From Volume to Value
A Guide to Data Engineering
2Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Table of Contents
Introduction........................................................................................................................................ 3
Information Overload......................................................................................................................5
Talent Gap..........................................................................................................................................6
A New Role: Data Engineering........................................................................................................8
Data Maturity Goals........................................................................................................................10
Starting to Climb..............................................................................................................................12
Next Steps..........................................................................................................................................15
Connect and Route Your Data with Astronomer........................................................................16
Conclusion (TL;DR)..........................................................................................................................17
About Astronomer............................................................................................................................18
Sources...............................................................................................................................................19
3Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Introduction
In today’s digital age, getting ahead depends on leveraging data better than competitors. Take
Amazon’s acquisition of Whole Foods that caused competitors’ stock to drop significantly. Why?
Because shareholders understand that when Amazon adds this plethora of storefront data to its
abundance of virtual-buyer data, they will discover exclusive insights to drive business.1
And while reaching the peak of success and retaining the lead in the race to the summit look
different based on industry, geography and other factors, some commonalities hold true. At
Astronomer, we’ve mapped out the journey to becoming more mature with data—in other
words, the path to gaining a competitive advantage.
4Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
No matter where organizations are on their journey, next steps will require more data sets to
deal with and more preparation to ready that data for analytics. Before moving toward the
summit, it’s important to consider some key questions:
•	 What metrics are most important to measure in my business?
•	 What data sets are needed to measure them?
•	 How can those data sets be accessed?
•	 Who’s responsible to clean, reformat, organize, transform and otherwise prepare the data
for analysis?
Answering these questions is certainly challenging, which perhaps explains why only 4% of
companies actively use their data. The remaining 96% includes thousands of companies that
collect data but haven’t quite figured out how to derive maximum value from it.2
Those who
have, however, will quickly gain a competitive advantage and see their early efforts pay off in
the long run.
In this guide, we’ll discuss three things to get you there:
1. Core challenges to extracting value from data
2. Practical ways to overcome those challenges and get to value
3. Actionable next steps for your organization
Only 4 percent of companies are actively using
their data. Are you?
(Bain and Company)
5Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Information Overload
According to a McKinsey Global Institute (MGI) report, “data have
swept into every industry and business function and are now an
important factor of production, alongside labor and capital.” MGI
estimated that retailers using big data to its fullest potential could
increase operating margins by more than 60 percent, and that
both businesses and consumers would benefit from leveraging the
exponentially increasing data sets.3
And that was back in 2011.
In 2016, a Gartner analysis further defined the need for data:
organizations that provide agile, curated internal and external data
sets for a variety of content authors will realize twice the business
benefits of those that don’t.4
So why isn’t everybody curating these data sets and enabling individual analysts to not only ac-
cess information but also contribute back to models? Because the many data sets available to
companies between legacy systems, cloud-based tools, CRMs, databases, websites and other
data-generating sources create a mass of structured, unstructured and siloed data sets that
don’t “talk” to each other. Consolidating data is a critical first step, but it costs companies count-
less hours of cleaning, enriching, and formatting.
Simply put, data is a mess.
Do you have data in a ...
•	 legacy system?
•	 cloud-based tool?
•	 CRM?
•	 database?
•	 data lake?
•	 website?
•	 app?
•	 more than one of any of the
above?
It’s likely you have a LOT of data. In
various forms. Accumulating quickly.

6Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Talent Gap
Of course, any mess can be cleaned up. The state of the mess—
commonly described as the “three v’s of data” (volume, velocity
and variety) aren’t the only obstacles. There’s another problem:
the deep technical skills required to build, deploy and maintain
a modern data infrastructure that can handle big data, and fast,
are rare. In fact, the MGI analysis predicted that by 2018, the
United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills and a shortage of 1.5 million
managers and analysts who understand how to make effective
decisions based on data.
To contend with this, many companies have created a new
role: the data scientist. Data scientists, according to the
Harvard Business Review, are a “hybrid of data hacker, analyst,
communicator and trusted adviser” with skills like programming, multivariable calculus and
linear algebra and an understanding of machine learning. They can find patterns and extract
insights from a giant body of data and write algorithms to run over these data sets.5
Becoming
mature with data is impossible without these capabilities.
There’s just one problem: data scientists aren’t
spending their time creating algorithms, mining
data for patterns or interpreting insights.
Do you have a data
scientist on staff? Ask them
how much time they spend ...
•	 Building training sets
•	 Cleaning and organizing data
•	 Collecting data sets
•	 Mining data for patterns
•	 Refining algorithms
•	 Articulating analysis
If you don’t have a data scientist on staff,
who does these tasks? And how much of
their time is devoted to each one?

7Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Eighty percent of a data scientist’s time is spent collecting data
sets and cleaning and organizing them.6
It takes a high level of
skill to do, but it’s not data science.
So having a data science team isn’t enough. Every company
must take a step back and clean, enrich, reformat and otherwise
prepare data for the data scientists and analysts. All these
activities fall into the category of data engineering.
To maximize insights from
data and get to value faster,
forward-thinking organizations
are creating a new role: the data
engineer.
Data engineering
[dat-uh en-juh-neer-ing]:
verb. the act of
accessing, processing,
enriching, cleaning and/
or otherwise orchestrating
data analysis

8Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
A New Role: Data Engineering
So what is data engineering, exactly? And why is it so important? Data engineering is the act
of accessing, processing, enriching, cleaning and/or otherwise orchestrating data analysis.
Data engineers build tools, infrastructure,
frameworks, and services. In smaller companies—
where no data infrastructure team has yet been
formalized—the data engineering role may also
cover the workload around setting up and operat-
ing the organization’s data infrastructure.
( Maxime Beauchemin, Airbnb. The Rise of the Data Engineer)
Maxime joined Facebook as a business intelligence engineer in 2011 and left as a data engi-
neer two years later. The need for more complex, code-based ETL and changing data mod-
eling drove the demand for data engineering.7
Even though data engineering alone doesn’t
reveal insights, it readies your data to be analyzed reliably. Without it, there’s no possibility for
meaningful analysis or data science.
9Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Data ScientistsData Engineers
Prepare data
for analysis
Process
raw data
Function behind
the scenes
Build infrastructure to
consolidate and enrich
numerous data sets
Handle large-scale
data processing
Monitor and
maintain systems
Probe for
insights
Deliver results to
business users
Apply machine learning,
algorithms and other
analytics approaches
Uncover meaning in
large amounts of data
Articulate analysis,
often visually
Interpret results
of analysis
In simple terms, data engineers and data scientists work together like this:
When both data engineering and data science are priorities for an organization,
getting more mature with data is inevitable.
10Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Data Maturity Goals
In considering how to become more mature with data, it can be
helpful to look to practical examples of companies who have done
it well. Airbnb is near the summit of the data maturity mountain. It’s
reached heights most companies can’t yet fathom—heights to the
tune of $3.5 billion in projected earnings in 2020, which exceeds the
bottom lines of 85% of Fortune 500 companies.8
For them, data engineering isn’t a black box; it’s cultural.9
Access
to data and the ability to contribute to business logic have been
democratized.
As the company’s size and reach (and number of employees)
increased, so did its available data sets. Making the right
data available across the organization required strategic data engineering. First, Airbnb
established what they called “Core Data,” a single source of truth for everyone.
To do this, they created Airflow, a workflow management system that programmatically authors,
schedules and monitors dependency-based data pipelines, without running unnecessarily. This
technology allows them to schedule all their data to flow to a single data-space.10
They also built a
data portal for employees, a “search and discovery tool” through which they can pull the numbers
they need on their own. It puts the power of real-time data analytics into the hands of everyone
working to make the company successful.
Now everyday decision-makers have access to information on the spot, but at the same
time, a data engineering team maintains quality control by managing data warehousing,
enhancing the performance of core data infrastructure, integrating data flow between
systems and tools and looking for new ways to automate their tasks.11
Airbnb is near the
summit of the data
maturity mountain.
WIth $3.5 billion in projected earnings,
what do they do differently?
Democratize data.
How? A single source of truth that is
searchable for everyone and a
“Data University” to make
sure everyone knows
how to use it.
11Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Of course, even the most reliable data portal is only as good as it
is useful, so the Airbnb data science team went a step further and
tracked the weekly active users (WAUs) logging into the portal,
then created a “Data University” with courses to teach those
employees how to use the portal and mine the data it holds.12
This has allowed the company to operate under a philosophy
of data democratization, giving every employee access to up-
to-date data and the power to make decisions based on that
data. And all of that happens without an Airbnb data scientist
in every department because each employee is empowered
at a larger scale to find and use data—they also understand
exactly how to do that thanks to the Data University.
Now, 45% of Airbnb employees are WAUs, and that particular
economy of scale has eliminated an information bottleneck and
freed up the data science team to focus on the most pressing
problems.
Airbnb is far from the only company to understand the appeal of data democratization. Other
tech giants like Facebook have pioneered the trend, but many others are jumping on board—
companies like Finish Line 13
,Chobani14
and even the government 15
.
TL;DR
*Some practical steps Airbnb took to get to
the summit
•	 Hired a data engineer
•	 Consolidated all data in one place
•	 Made data fully accessible
•	 Taught their employees to query
•	 Allowed multiple content authors
•	 Took action based on data
•	 Watched revenue grow
*Though this guide doesn’t get technical,
if you’re wondering how data flows,
Airbnb uses Apache Airflow, a workflow
management system.

Starting to Climb
Implementing a world-class culture of data engineering within your company requires scaling the
data maturity mountain.
If that seems daunting, take heart: remember that 96% of companies are not maximizing their da-
ta’s value. There are many points in between the base camp and the summit, and organizations can
pick up and move to the next campsite anytime. The first step is determining where you stand now:
0.0 Camp Flying Blind
Data initiatives are most likely not a priority for you, which means you’re probably not reading this.
1.0 Camp Frustrated
You collect data, but probably aren’t sure how to extract actionable business intelligence from it.
2.0 Camp In Control
Here, you’re using some tools to aggregate data and likely understand how to access the
information you need for your role. But you’re not totally sure it’s reliable and have no idea what
other teams are doing.
3.0 Camp Activated
With connected data, you’re looking for new and relevant data sets that you can
plug in for even greater insights. You’ve got basic
algorithms in place and are starting to explore
data science. But you’re spending more time
preparing data for analytics than analyzing it.
0.0
Flying Blind
1.0
Frustrated
2.0
In Control
CompetitiveAdvantage
12A S T R O N O M E R . I O
3.0
Activated
Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
4.0 Camp Intelligent
At this stage, you offer data visualization in several forms across your organization and rely on
predictive analytics—and maybe machine learning and artificial intelligence (AI) technology.
You’re probably enabling better data science through intentional, improved data engineering.
5.0 Camp Insane - Summit
Your organization is devoted to data engineering or data science, and insights drive and de-
fine every decision you make for your business. To enable that, there is a single source of truth
that is accessible to everyone. Anyone from marketers to data scientists can contribute back
to business logic.
If you’re not exactly sure which camp
you’re in, take the 60-second self-assessment.
astronomer.io/data-assessment
No matter where you’ve mapped yourself, remember: very few businesses
have reached the summit of “Insane”—and few are still stuck in the
doldrums “flying blind” at the zero spot—so it’s fair to assume that your
business’s data strategy, and that of your biggest competitors, is somewhere
in between these two extremes. And that’s a good thing; it means you can
scale up whenever you like.
4.0
Intelligent
5.0
Insane Mode!
A S T R O N O M E R . I O 13
14Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Next Steps
Like Stephen Covey says, begin with the end in mind. If Airbnb’s culture of data engineering
represents the summit, here’s a checklist of steps to getting there:
	 Read this guide
	 Commit to getting value from your data
	 Consider hiring a data scientist
	 Create a data engineering capability in your organization
This is where Astronomer can help!
	 Consolidate all data in one place
	 Route data to give decision-makers full access
	 Teach them to query (if necessary)
	 Empower business users to contribute to core tables
	 Once you trust and understand the data, probe for insights
	 Take action
	 Grow your revenue!
How does Astronomer fit in?
The rapid, agile, secure data routing and prep required for this to-do list relies on specialized
tools. For Airbnb, that’s Apache Airflow. Astronomer’s data engineering platform incorporates
all the strength of Apache Airflow with all the power of Astronomer to empower teams to con-
struct the data infrastructure they need for cross-organizational data democratization.
Astronomer’s data engineering platform streamlines and
amplifies your data engineering capabilities.
✔
15Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Connect and Route Your Data
with Astronomer
Astronomer is a data engineering platform that connects
data from legacy systems, BI tools, databases and other
sources—and routes it where it can be analyzed.
Astronomer offers complete customizability through its
use of open-source software, including Airbnb’s Apache
Airflow, and offers both a library of standard data
pipelines and full access to developers to write custom
pipelines, defined as code. A business user can set up
a standard pipe, like sending Facebook Ads to Redshift,
in minutes. Or a data scientist, analyst or data engineer
can author, schedule and monitor their own dependen-
cy-based data pipelines to centralize and route data
from analytics tools, legacy systems, apps and more.
Whatever camp you’re currently in,
Astronomer meets you where you are and
helps you get ahead.
16Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Conclusion (TL;DR)
•	 Digital Darwinism threatens every organization.
•	 For most companies, data is a mess.
•	 There is a shortage of folks with the skills to deal with data.
•	 Companies who get ahead now have a serious advantage.
•	 Getting ahead looks like:
1. making data engineering a priority.
2. consolidating data into a single source of truth.
3. democratizing data for the entire organization.
17Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
About Astronomer
Since our beginning in 2015, we have said we are with the machines. We believe the future of
work looks like machines + humans operating in their respective strengths and accomplishing
more, together. By assembling a world-class team of data engineers to program machines
to connect, process and route large amounts of data, we free humans up to do what they do
best: analyze data to discover insights and make essential decisions.
Learn more at astronomer.io or connect with us at humans@astronomer.io.
18Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O
Sources
1.	 “Big Prize in Amazon-Whole Foods Deal: Data” by Laura Stevens and Heather Haddon, Wall Street Journal,
2017, astrnmr.co/2uTXNdc
2.	 “The Value of Big Data: How analytics differentiates winners” by Rasmus Wegener and Velu Sinha, Bain &
Company, 2013, astrnmr.co/2uTRE0y
3.	 “Big data: The Next Frontier for Innovation, Competition and Productivity” by James Manyika, Michael Chui,
Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and Angela Hung Byers, McKinsey and Com-
pany, 2011, astrnmr.co/2sPDMrK
4.	 “Market Guide for Self-Service Data Preparation” by Rita L. Sallam et al, Gartner, 2016, astrnmr.co/2tzriSo
5.	 “Data Scientist: The Sexiest Job of the 21st Century” by Thomas H. Davenport and D.J. Patil, Harvard Business
Review, 2012, astrnmr.co/2syVbAW
6.	 “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says” by Gil Press,
Forbes, 2016, astrnmr.co/2uzVgWx
7.	 “The Rise of Data Engineering” by Maxime Beauchemin, 2017, astrnmr.co/2uTRiqV
8.	 “Airbnb’s Profits to Top $3 Billion by 2020” by Leigh Gallagher, Fortune, 2017, astrnmr.co/2syKtKR
9.	 “Democratizing Data at Airbnb” by Chris Williams, Eli Brumbaugh, Jeff Feng, John Bodley, and Michelle Thom-
as, Airbnb, 2017, astrnmr.co/2uzEt5V
10.	 “Airflow: A Workflow Management Platform” by Maxime Beauchemin, Airbnb, 2015, astrnmr.co/2uA286c
11.	 “How Airbnb Democratized Data” by Olivia Timson, Innovation Enterprise, 2016, astrnmr.co/2sPjEpI
12.	 “How Airbnb Democratizes Data with Data University” by Jeff Feng, Erin Coffman and Elena Grewal, Airbnb,
2017 https://astrnmr.co/2v2hY8F
13.	 “The Value of Democratizing Data” by Samuel Greengard, Baseline, 2015, astrnmr.co/2vBVVcn
14.	 “How Data Democratization Can Deliver a Healthy Breakfast” by Errol Apostolopoulos, DataInformed,
2016,astrnmr.co/2vBsffB
15.	 “Democratizing Big Data to Bring Government Ahead of the Curve” by Quinton Alsbury, Wired, astrnmr.
co/2vB4Uuf

Weitere ähnliche Inhalte

Was ist angesagt?

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
Open Analytics
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
Ravi Teja
 

Was ist angesagt? (18)

DataHub
DataHubDataHub
DataHub
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOps
 
Solution Architecture US healthcare
Solution Architecture US healthcare Solution Architecture US healthcare
Solution Architecture US healthcare
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Datascienceindia article
Datascienceindia articleDatascienceindia article
Datascienceindia article
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
2020 Big Data & Analytics Maturity Survey Results
2020 Big Data & Analytics Maturity Survey Results2020 Big Data & Analytics Maturity Survey Results
2020 Big Data & Analytics Maturity Survey Results
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 

Ähnlich wie From Volume to Value - A Guide to Data Engineering

Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
kavi172
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
Gilbert Rozario
 

Ähnlich wie From Volume to Value - A Guide to Data Engineering (20)

Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
 
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
Oea big-data-guide-1522052
Oea big-data-guide-1522052Oea big-data-guide-1522052
Oea big-data-guide-1522052
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
Tips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data AnalyticsTips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data Analytics
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big Data
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Achieving Business Success with Data.pdf
Achieving Business Success with Data.pdfAchieving Business Success with Data.pdf
Achieving Business Success with Data.pdf
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big Data
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

From Volume to Value - A Guide to Data Engineering

  • 1. 1A S T R O N O M E R . I O From Volume to Value A Guide to Data Engineering
  • 2. 2Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Table of Contents Introduction........................................................................................................................................ 3 Information Overload......................................................................................................................5 Talent Gap..........................................................................................................................................6 A New Role: Data Engineering........................................................................................................8 Data Maturity Goals........................................................................................................................10 Starting to Climb..............................................................................................................................12 Next Steps..........................................................................................................................................15 Connect and Route Your Data with Astronomer........................................................................16 Conclusion (TL;DR)..........................................................................................................................17 About Astronomer............................................................................................................................18 Sources...............................................................................................................................................19
  • 3. 3Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Introduction In today’s digital age, getting ahead depends on leveraging data better than competitors. Take Amazon’s acquisition of Whole Foods that caused competitors’ stock to drop significantly. Why? Because shareholders understand that when Amazon adds this plethora of storefront data to its abundance of virtual-buyer data, they will discover exclusive insights to drive business.1 And while reaching the peak of success and retaining the lead in the race to the summit look different based on industry, geography and other factors, some commonalities hold true. At Astronomer, we’ve mapped out the journey to becoming more mature with data—in other words, the path to gaining a competitive advantage.
  • 4. 4Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O No matter where organizations are on their journey, next steps will require more data sets to deal with and more preparation to ready that data for analytics. Before moving toward the summit, it’s important to consider some key questions: • What metrics are most important to measure in my business? • What data sets are needed to measure them? • How can those data sets be accessed? • Who’s responsible to clean, reformat, organize, transform and otherwise prepare the data for analysis? Answering these questions is certainly challenging, which perhaps explains why only 4% of companies actively use their data. The remaining 96% includes thousands of companies that collect data but haven’t quite figured out how to derive maximum value from it.2 Those who have, however, will quickly gain a competitive advantage and see their early efforts pay off in the long run. In this guide, we’ll discuss three things to get you there: 1. Core challenges to extracting value from data 2. Practical ways to overcome those challenges and get to value 3. Actionable next steps for your organization Only 4 percent of companies are actively using their data. Are you? (Bain and Company)
  • 5. 5Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Information Overload According to a McKinsey Global Institute (MGI) report, “data have swept into every industry and business function and are now an important factor of production, alongside labor and capital.” MGI estimated that retailers using big data to its fullest potential could increase operating margins by more than 60 percent, and that both businesses and consumers would benefit from leveraging the exponentially increasing data sets.3 And that was back in 2011. In 2016, a Gartner analysis further defined the need for data: organizations that provide agile, curated internal and external data sets for a variety of content authors will realize twice the business benefits of those that don’t.4 So why isn’t everybody curating these data sets and enabling individual analysts to not only ac- cess information but also contribute back to models? Because the many data sets available to companies between legacy systems, cloud-based tools, CRMs, databases, websites and other data-generating sources create a mass of structured, unstructured and siloed data sets that don’t “talk” to each other. Consolidating data is a critical first step, but it costs companies count- less hours of cleaning, enriching, and formatting. Simply put, data is a mess. Do you have data in a ... • legacy system? • cloud-based tool? • CRM? • database? • data lake? • website? • app? • more than one of any of the above? It’s likely you have a LOT of data. In various forms. Accumulating quickly. 
  • 6. 6Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Talent Gap Of course, any mess can be cleaned up. The state of the mess— commonly described as the “three v’s of data” (volume, velocity and variety) aren’t the only obstacles. There’s another problem: the deep technical skills required to build, deploy and maintain a modern data infrastructure that can handle big data, and fast, are rare. In fact, the MGI analysis predicted that by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills and a shortage of 1.5 million managers and analysts who understand how to make effective decisions based on data. To contend with this, many companies have created a new role: the data scientist. Data scientists, according to the Harvard Business Review, are a “hybrid of data hacker, analyst, communicator and trusted adviser” with skills like programming, multivariable calculus and linear algebra and an understanding of machine learning. They can find patterns and extract insights from a giant body of data and write algorithms to run over these data sets.5 Becoming mature with data is impossible without these capabilities. There’s just one problem: data scientists aren’t spending their time creating algorithms, mining data for patterns or interpreting insights. Do you have a data scientist on staff? Ask them how much time they spend ... • Building training sets • Cleaning and organizing data • Collecting data sets • Mining data for patterns • Refining algorithms • Articulating analysis If you don’t have a data scientist on staff, who does these tasks? And how much of their time is devoted to each one? 
  • 7. 7Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Eighty percent of a data scientist’s time is spent collecting data sets and cleaning and organizing them.6 It takes a high level of skill to do, but it’s not data science. So having a data science team isn’t enough. Every company must take a step back and clean, enrich, reformat and otherwise prepare data for the data scientists and analysts. All these activities fall into the category of data engineering. To maximize insights from data and get to value faster, forward-thinking organizations are creating a new role: the data engineer. Data engineering [dat-uh en-juh-neer-ing]: verb. the act of accessing, processing, enriching, cleaning and/ or otherwise orchestrating data analysis 
  • 8. 8Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O A New Role: Data Engineering So what is data engineering, exactly? And why is it so important? Data engineering is the act of accessing, processing, enriching, cleaning and/or otherwise orchestrating data analysis. Data engineers build tools, infrastructure, frameworks, and services. In smaller companies— where no data infrastructure team has yet been formalized—the data engineering role may also cover the workload around setting up and operat- ing the organization’s data infrastructure. ( Maxime Beauchemin, Airbnb. The Rise of the Data Engineer) Maxime joined Facebook as a business intelligence engineer in 2011 and left as a data engi- neer two years later. The need for more complex, code-based ETL and changing data mod- eling drove the demand for data engineering.7 Even though data engineering alone doesn’t reveal insights, it readies your data to be analyzed reliably. Without it, there’s no possibility for meaningful analysis or data science.
  • 9. 9Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Data ScientistsData Engineers Prepare data for analysis Process raw data Function behind the scenes Build infrastructure to consolidate and enrich numerous data sets Handle large-scale data processing Monitor and maintain systems Probe for insights Deliver results to business users Apply machine learning, algorithms and other analytics approaches Uncover meaning in large amounts of data Articulate analysis, often visually Interpret results of analysis In simple terms, data engineers and data scientists work together like this: When both data engineering and data science are priorities for an organization, getting more mature with data is inevitable.
  • 10. 10Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Data Maturity Goals In considering how to become more mature with data, it can be helpful to look to practical examples of companies who have done it well. Airbnb is near the summit of the data maturity mountain. It’s reached heights most companies can’t yet fathom—heights to the tune of $3.5 billion in projected earnings in 2020, which exceeds the bottom lines of 85% of Fortune 500 companies.8 For them, data engineering isn’t a black box; it’s cultural.9 Access to data and the ability to contribute to business logic have been democratized. As the company’s size and reach (and number of employees) increased, so did its available data sets. Making the right data available across the organization required strategic data engineering. First, Airbnb established what they called “Core Data,” a single source of truth for everyone. To do this, they created Airflow, a workflow management system that programmatically authors, schedules and monitors dependency-based data pipelines, without running unnecessarily. This technology allows them to schedule all their data to flow to a single data-space.10 They also built a data portal for employees, a “search and discovery tool” through which they can pull the numbers they need on their own. It puts the power of real-time data analytics into the hands of everyone working to make the company successful. Now everyday decision-makers have access to information on the spot, but at the same time, a data engineering team maintains quality control by managing data warehousing, enhancing the performance of core data infrastructure, integrating data flow between systems and tools and looking for new ways to automate their tasks.11 Airbnb is near the summit of the data maturity mountain. WIth $3.5 billion in projected earnings, what do they do differently? Democratize data. How? A single source of truth that is searchable for everyone and a “Data University” to make sure everyone knows how to use it.
  • 11. 11Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Of course, even the most reliable data portal is only as good as it is useful, so the Airbnb data science team went a step further and tracked the weekly active users (WAUs) logging into the portal, then created a “Data University” with courses to teach those employees how to use the portal and mine the data it holds.12 This has allowed the company to operate under a philosophy of data democratization, giving every employee access to up- to-date data and the power to make decisions based on that data. And all of that happens without an Airbnb data scientist in every department because each employee is empowered at a larger scale to find and use data—they also understand exactly how to do that thanks to the Data University. Now, 45% of Airbnb employees are WAUs, and that particular economy of scale has eliminated an information bottleneck and freed up the data science team to focus on the most pressing problems. Airbnb is far from the only company to understand the appeal of data democratization. Other tech giants like Facebook have pioneered the trend, but many others are jumping on board— companies like Finish Line 13 ,Chobani14 and even the government 15 . TL;DR *Some practical steps Airbnb took to get to the summit • Hired a data engineer • Consolidated all data in one place • Made data fully accessible • Taught their employees to query • Allowed multiple content authors • Took action based on data • Watched revenue grow *Though this guide doesn’t get technical, if you’re wondering how data flows, Airbnb uses Apache Airflow, a workflow management system. 
  • 12. Starting to Climb Implementing a world-class culture of data engineering within your company requires scaling the data maturity mountain. If that seems daunting, take heart: remember that 96% of companies are not maximizing their da- ta’s value. There are many points in between the base camp and the summit, and organizations can pick up and move to the next campsite anytime. The first step is determining where you stand now: 0.0 Camp Flying Blind Data initiatives are most likely not a priority for you, which means you’re probably not reading this. 1.0 Camp Frustrated You collect data, but probably aren’t sure how to extract actionable business intelligence from it. 2.0 Camp In Control Here, you’re using some tools to aggregate data and likely understand how to access the information you need for your role. But you’re not totally sure it’s reliable and have no idea what other teams are doing. 3.0 Camp Activated With connected data, you’re looking for new and relevant data sets that you can plug in for even greater insights. You’ve got basic algorithms in place and are starting to explore data science. But you’re spending more time preparing data for analytics than analyzing it. 0.0 Flying Blind 1.0 Frustrated 2.0 In Control CompetitiveAdvantage 12A S T R O N O M E R . I O 3.0 Activated
  • 13. Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O 4.0 Camp Intelligent At this stage, you offer data visualization in several forms across your organization and rely on predictive analytics—and maybe machine learning and artificial intelligence (AI) technology. You’re probably enabling better data science through intentional, improved data engineering. 5.0 Camp Insane - Summit Your organization is devoted to data engineering or data science, and insights drive and de- fine every decision you make for your business. To enable that, there is a single source of truth that is accessible to everyone. Anyone from marketers to data scientists can contribute back to business logic. If you’re not exactly sure which camp you’re in, take the 60-second self-assessment. astronomer.io/data-assessment No matter where you’ve mapped yourself, remember: very few businesses have reached the summit of “Insane”—and few are still stuck in the doldrums “flying blind” at the zero spot—so it’s fair to assume that your business’s data strategy, and that of your biggest competitors, is somewhere in between these two extremes. And that’s a good thing; it means you can scale up whenever you like. 4.0 Intelligent 5.0 Insane Mode! A S T R O N O M E R . I O 13
  • 14. 14Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Next Steps Like Stephen Covey says, begin with the end in mind. If Airbnb’s culture of data engineering represents the summit, here’s a checklist of steps to getting there:  Read this guide  Commit to getting value from your data  Consider hiring a data scientist  Create a data engineering capability in your organization This is where Astronomer can help!  Consolidate all data in one place  Route data to give decision-makers full access  Teach them to query (if necessary)  Empower business users to contribute to core tables  Once you trust and understand the data, probe for insights  Take action  Grow your revenue! How does Astronomer fit in? The rapid, agile, secure data routing and prep required for this to-do list relies on specialized tools. For Airbnb, that’s Apache Airflow. Astronomer’s data engineering platform incorporates all the strength of Apache Airflow with all the power of Astronomer to empower teams to con- struct the data infrastructure they need for cross-organizational data democratization. Astronomer’s data engineering platform streamlines and amplifies your data engineering capabilities. ✔
  • 15. 15Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Connect and Route Your Data with Astronomer Astronomer is a data engineering platform that connects data from legacy systems, BI tools, databases and other sources—and routes it where it can be analyzed. Astronomer offers complete customizability through its use of open-source software, including Airbnb’s Apache Airflow, and offers both a library of standard data pipelines and full access to developers to write custom pipelines, defined as code. A business user can set up a standard pipe, like sending Facebook Ads to Redshift, in minutes. Or a data scientist, analyst or data engineer can author, schedule and monitor their own dependen- cy-based data pipelines to centralize and route data from analytics tools, legacy systems, apps and more. Whatever camp you’re currently in, Astronomer meets you where you are and helps you get ahead.
  • 16. 16Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Conclusion (TL;DR) • Digital Darwinism threatens every organization. • For most companies, data is a mess. • There is a shortage of folks with the skills to deal with data. • Companies who get ahead now have a serious advantage. • Getting ahead looks like: 1. making data engineering a priority. 2. consolidating data into a single source of truth. 3. democratizing data for the entire organization.
  • 17. 17Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O About Astronomer Since our beginning in 2015, we have said we are with the machines. We believe the future of work looks like machines + humans operating in their respective strengths and accomplishing more, together. By assembling a world-class team of data engineers to program machines to connect, process and route large amounts of data, we free humans up to do what they do best: analyze data to discover insights and make essential decisions. Learn more at astronomer.io or connect with us at humans@astronomer.io.
  • 18. 18Created by Astronomer, Inc. 2017 A S T R O N O M E R . I O Sources 1. “Big Prize in Amazon-Whole Foods Deal: Data” by Laura Stevens and Heather Haddon, Wall Street Journal, 2017, astrnmr.co/2uTXNdc 2. “The Value of Big Data: How analytics differentiates winners” by Rasmus Wegener and Velu Sinha, Bain & Company, 2013, astrnmr.co/2uTRE0y 3. “Big data: The Next Frontier for Innovation, Competition and Productivity” by James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and Angela Hung Byers, McKinsey and Com- pany, 2011, astrnmr.co/2sPDMrK 4. “Market Guide for Self-Service Data Preparation” by Rita L. Sallam et al, Gartner, 2016, astrnmr.co/2tzriSo 5. “Data Scientist: The Sexiest Job of the 21st Century” by Thomas H. Davenport and D.J. Patil, Harvard Business Review, 2012, astrnmr.co/2syVbAW 6. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says” by Gil Press, Forbes, 2016, astrnmr.co/2uzVgWx 7. “The Rise of Data Engineering” by Maxime Beauchemin, 2017, astrnmr.co/2uTRiqV 8. “Airbnb’s Profits to Top $3 Billion by 2020” by Leigh Gallagher, Fortune, 2017, astrnmr.co/2syKtKR 9. “Democratizing Data at Airbnb” by Chris Williams, Eli Brumbaugh, Jeff Feng, John Bodley, and Michelle Thom- as, Airbnb, 2017, astrnmr.co/2uzEt5V 10. “Airflow: A Workflow Management Platform” by Maxime Beauchemin, Airbnb, 2015, astrnmr.co/2uA286c 11. “How Airbnb Democratized Data” by Olivia Timson, Innovation Enterprise, 2016, astrnmr.co/2sPjEpI 12. “How Airbnb Democratizes Data with Data University” by Jeff Feng, Erin Coffman and Elena Grewal, Airbnb, 2017 https://astrnmr.co/2v2hY8F 13. “The Value of Democratizing Data” by Samuel Greengard, Baseline, 2015, astrnmr.co/2vBVVcn 14. “How Data Democratization Can Deliver a Healthy Breakfast” by Errol Apostolopoulos, DataInformed, 2016,astrnmr.co/2vBsffB 15. “Democratizing Big Data to Bring Government Ahead of the Curve” by Quinton Alsbury, Wired, astrnmr. co/2vB4Uuf