SlideShare a Scribd company logo
1 of 31
Download to read offline
Data Pipeline ArchitectData Pipeline Architect
Data Pipelines
For small, messy and tedious data.
Vladislav Supalov, 27th October 2016
Data Pipeline ArchitectData Pipeline Architect
How to tell if this talk is for you?
2
2
● Big Data
○ Pretty fascinating
○ “Good problem to have”
● Most companies
○ Not quite there
○ Should not start at this level
● This is for you, if you are close to the data at a
○ Startup
○ Growing company
○ Established company which is about to start an initiative
● Working with a new CDO, CAO, Head of BI
Data Pipeline ArchitectData Pipeline Architect
I want to help you achieve better results!
3
3
● What will help you to deal with …?
○ small data (not much is needed to be valuable)
○ messy data (multiple data sources, no overview)
○ tedious-to-handle data (multiple data sources, lots of manual work)
● “Use <tech X> in <way Y> and you will be fine”. Nope.
○ Just dealing with data is not a magic bullet
○ This will not guarantee good results for your company
○ You might get lucky of course. That’s not a safe bet.
● How can we improve your chances? Reduce risk.
○ Focus on what matters
Data Pipeline ArchitectData Pipeline Architect
Jumping to tech we would dive too deep, too early.
4
4
● What people tend to think about first:
○ Dashboards
○ Tools
○ Technical solutions, best practices & tricks
● That’s tactics
● We should not jump into implementation details right away.
● Let’s not.
Data Pipeline ArchitectData Pipeline Architect
The Craft of Designing & Building Data Pipelines
Should start with understanding the business.
Data Pipeline ArchitectData Pipeline Architect
Hi, I’m Vladislav!
6
6
● Data background
○ Machine learning, computer vision, data mining
● Fascination with DevOps
○ Efficient, reliable infrastructure setups
○ Monitoring, automation, processes
● Currently: Co-founding a startup - Pivii Technologies
○ Startup, accelerated by Axel Springer Plug and Play
○ Artificial intelligence for content marketing
○ AI, ML, CV, data!
○ pivii.co
● Previously: Building a data engineering consulting business
○ datapipelinearchitect.com
vsupalov
Data Pipeline ArchitectData Pipeline Architect
Preferred consulting situation:
7
7
● Mobile application marketing agency
○ Not necessarily huge data
○ Very valuable and worthwhile (from a certain point)
● “We built prototype analytics tools in-house and they are mostly functional”
○ “We have seen the value!”
○ But are painful to work with & broken
○ “Time and money is still being wasted.”
● Tools were created out of an actual need
○ Organically, little planning
○ “How can we do better?”
○ “Where do we go from here?”
Data Pipeline ArchitectData Pipeline Architect
Common Success Pattern: Business Value was Created.
Already achieved visible and measurable impact for the company.
Or have gotten VERY close to do so. Are thinking about ROI.
Data Pipeline ArchitectData Pipeline Architect
Business first. Tech follows.
9
9
● Key to successful data projects
○ Especially with limited resources
○ And small data
● Technical decisions should be informed by business needs and goals
● Handling data is a very small part of the whole
○ Straightforward once business needs are clear
● It starts with the mindset
○ Don't consider data plumbing in isolation
Data Pipeline ArchitectData Pipeline Architect
Key: being conscious and deliberate about the intention of
creating business value.
Let’s take a brief detour.
Data Pipeline ArchitectData Pipeline Architect
Consider sword fighting.
11
11
● A great samurai sword master
● 1584 - 1645
● Miyamoto Musashi
○ Martial artist
○ Tactician
○ Strategist
○ Artist
○ Sculptor
○ Calligrapher
○ Writer
○ Philosopher
○ ...
Images: Miyamoto Musashi, self-portrait, http://sv-musashi1.com/about_Musashi.htm,
Musashi Miyamoto with two Bokken, http://www.akinokai.org/images/Images.htm?Musashi.jpg
Data Pipeline ArchitectData Pipeline Architect
“The primary thing when you take a sword in your hands is
your intention to cut the enemy, whatever the means.”
- Miyamoto Musashi, The Book of Five Rings
Data Pipeline ArchitectData Pipeline Architect
“Whenever you parry, hit, spring, strike or touch the
enemy’s cutting sword, you must cut the enemy
in the same movement.”
- Miyamoto Musashi, The Book of Five Rings
Data Pipeline ArchitectData Pipeline Architect
“It is essential to attain this.
If you think only of hitting, springing, striking or touching
the enemy, you will not be able actually to cut him.”
- Miyamoto Musashi, The Book of Five Rings
Data Pipeline ArchitectData Pipeline Architect
“More than anything, you must be thinking
of carrying your movement through to cutting him.
You must thoroughly research this.”
- Miyamoto Musashi, The Book of Five Rings
Data Pipeline ArchitectData Pipeline Architect
The Goal of swordfighting is to cut the opponent.
16
16
● Stating this makes it seem very obvious.
○ Why the effort and emphasis?
● It’s not. Even for aspiring practitioners.
○ Results suffer.
● Mindset is essential for mastery
● The core advice (to my understanding):
○ Attain, cultivate and apply a goal-oriented mindset
○ Aim every step you take towards the goal
Data Pipeline ArchitectData Pipeline Architect
Back to the world of data-handling businesses!
17
17
● When working with company data
○ Before starting out on a project
○ Understand what you want and can achieve
○ Aim to create a positive impact on the business
○ Make it a constant, conscious goal
● The main tasks to do so are:
○ Understand the business
○ Understand the people
■ It’s about communication
○ Understand current processes
○ Be prepared to learn and revise
Data Pipeline ArchitectData Pipeline Architect
Use this process when approaching a new project:
18
18
● Qualify client/project
○ Does it make sense to get involved?
○ Is it evident that we can create value?
● Perform conversations/interviews
○ Find out more about the context
■ company, status, goals, limitations...
○ Learn from first-hand experience
● Summarize information, learnings and plans in writing
○ Roadmap document
○ Depicting the situation and ways forward
Data Pipeline ArchitectData Pipeline Architect
Is there potential
for a good fit?
Do budget, topic and goals seem in order?
Data Pipeline ArchitectData Pipeline Architect
Qualifying considerations. Learning about the client and project.
20
20
● What are you working on?
● What part of the project would you like help with?
● What needs to happen to make this a success for you?
● Why was this project started? What are the business goals?
● Is there an event that triggered it?
● Why especially now?
● What’s the budget? (ballpark estimate)
● When are you looking to get started?
Data Pipeline ArchitectData Pipeline Architect
Still good? Let’s start a
business relationship.
Initial research and planning. Roadmapping consulting package.
Data Pipeline ArchitectData Pipeline Architect
Four people to talk to:
22
22
● Project owner
○ We want this guy to be successful
● Business owner or C-level perspective
○ Knows what’s best for the business
○ "What could the ceo ask you in the hallway"
● Data wrangler - tales from the trenches
○ Insights into day-to-day business and data details
● Engineering Side
○ Current tech stack
○ Infos on constraints and preferences
○ Last touches
● Conversation focus, questions and duration vary from person to person.
Data Pipeline ArchitectData Pipeline Architect
Interviews completed, situation understood and put into writing.
23
● A bit of focused communication, we have a great foundation!
○ Project motivation
○ Business goals
○ Who should benefit
○ How to make it happen
● Different perspectives on the project and business.
● Time for tech!
○ Context clear (goals, constraints)
● Best case:
○ Very few choices left to make
Data Pipeline ArchitectData Pipeline Architect
Here’s what I would have told myself when starting out:
24
● Learn about the company
○ Easier with fresh eyes
● Understand the business
○ Multiple perspectives
● Keep the goal in mind
○ Helps learning the right things
○ Cultivate a business mindset (help earn more/lose less)
○ Aim for results
■ I will not stop saying this anytime soon :)
● Have a process laid out
24
Data Pipeline ArchitectData Pipeline Architect
Finally: Tactical Advice Which Fits the Remaining Time.
That’s the right proportion :)
Data Pipeline ArchitectData Pipeline Architect
Don’t roll your own home-baked scripts.
26
26
● "Quick and easy" isn't
● Uniqueness is bad, boring is good
○ Learning curve for others
○ Original author leaving
○ Maintenance time, tricky bugs, code duplication
○ Unexpected failure modes
● Extensibility?
● Growth?
● Metadata?
Data Pipeline ArchitectData Pipeline Architect
You should know about workflow engines.
27
27
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]
● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs
○ complex, long-running
● Dependency management
● Reusability of intermediate steps
● Logging and alerting
● Failure handling
● Monitoring
● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow
[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
Data Pipeline ArchitectData Pipeline Architect
If in doubt, try Luigi.
28
28
● Spotify
○ Lots of data!
○ 10k+ Hadoop jobs every day [1]
● Battle hardened
○ Published 2009
○ Has been used in production by large companies for a while
● Python
● Modular & extensible
● Dependency graph
● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
Data Pipeline ArchitectData Pipeline Architect
Usually worthwhile pipeline properties:
29
29
● Keep it small and lean
● Make learning and iterating easy
○ Changes should be cheap to accommodate for (both time and money)
● Build something to start learning
● Get data into one place
● Don’t reinvent the wheel
○ The tools are out there
○ ETL and workflow engines
● Create quick positive results, be efficient (lazy)
○ Many small improvements everywhere
○ Instead of solving everything for one group
○ More bang-for-the-buck
Data Pipeline ArchitectData Pipeline Architect
In conclusion:
30
● Don’t dive into tactics right away
● Aim to create business value
○ Make it a conscious goal
● Understand the business, people and processes
○ This will take some time. It’s a good investment.
○ Have a process yourself
○ Tech choices will follow
● Try to make it easy to learn and iterate
● Get data in one place
● Don’t go with home-baked scripts
● Consider workflow engines
○ Luigi in particular30
Data Pipeline ArchitectData Pipeline Architect
Thanks! Want to learn more?
“What questions to ask? Am I missing something?”
For your future interviews and planning:
I want to share my seed-question lists with you!
Just drop me your email address at:
http://datapipelinearchitect.com/datanatives/

More Related Content

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Data pipelines for small, messy and tedious data

  • 1. Data Pipeline ArchitectData Pipeline Architect Data Pipelines For small, messy and tedious data. Vladislav Supalov, 27th October 2016
  • 2. Data Pipeline ArchitectData Pipeline Architect How to tell if this talk is for you? 2 2 ● Big Data ○ Pretty fascinating ○ “Good problem to have” ● Most companies ○ Not quite there ○ Should not start at this level ● This is for you, if you are close to the data at a ○ Startup ○ Growing company ○ Established company which is about to start an initiative ● Working with a new CDO, CAO, Head of BI
  • 3. Data Pipeline ArchitectData Pipeline Architect I want to help you achieve better results! 3 3 ● What will help you to deal with …? ○ small data (not much is needed to be valuable) ○ messy data (multiple data sources, no overview) ○ tedious-to-handle data (multiple data sources, lots of manual work) ● “Use <tech X> in <way Y> and you will be fine”. Nope. ○ Just dealing with data is not a magic bullet ○ This will not guarantee good results for your company ○ You might get lucky of course. That’s not a safe bet. ● How can we improve your chances? Reduce risk. ○ Focus on what matters
  • 4. Data Pipeline ArchitectData Pipeline Architect Jumping to tech we would dive too deep, too early. 4 4 ● What people tend to think about first: ○ Dashboards ○ Tools ○ Technical solutions, best practices & tricks ● That’s tactics ● We should not jump into implementation details right away. ● Let’s not.
  • 5. Data Pipeline ArchitectData Pipeline Architect The Craft of Designing & Building Data Pipelines Should start with understanding the business.
  • 6. Data Pipeline ArchitectData Pipeline Architect Hi, I’m Vladislav! 6 6 ● Data background ○ Machine learning, computer vision, data mining ● Fascination with DevOps ○ Efficient, reliable infrastructure setups ○ Monitoring, automation, processes ● Currently: Co-founding a startup - Pivii Technologies ○ Startup, accelerated by Axel Springer Plug and Play ○ Artificial intelligence for content marketing ○ AI, ML, CV, data! ○ pivii.co ● Previously: Building a data engineering consulting business ○ datapipelinearchitect.com vsupalov
  • 7. Data Pipeline ArchitectData Pipeline Architect Preferred consulting situation: 7 7 ● Mobile application marketing agency ○ Not necessarily huge data ○ Very valuable and worthwhile (from a certain point) ● “We built prototype analytics tools in-house and they are mostly functional” ○ “We have seen the value!” ○ But are painful to work with & broken ○ “Time and money is still being wasted.” ● Tools were created out of an actual need ○ Organically, little planning ○ “How can we do better?” ○ “Where do we go from here?”
  • 8. Data Pipeline ArchitectData Pipeline Architect Common Success Pattern: Business Value was Created. Already achieved visible and measurable impact for the company. Or have gotten VERY close to do so. Are thinking about ROI.
  • 9. Data Pipeline ArchitectData Pipeline Architect Business first. Tech follows. 9 9 ● Key to successful data projects ○ Especially with limited resources ○ And small data ● Technical decisions should be informed by business needs and goals ● Handling data is a very small part of the whole ○ Straightforward once business needs are clear ● It starts with the mindset ○ Don't consider data plumbing in isolation
  • 10. Data Pipeline ArchitectData Pipeline Architect Key: being conscious and deliberate about the intention of creating business value. Let’s take a brief detour.
  • 11. Data Pipeline ArchitectData Pipeline Architect Consider sword fighting. 11 11 ● A great samurai sword master ● 1584 - 1645 ● Miyamoto Musashi ○ Martial artist ○ Tactician ○ Strategist ○ Artist ○ Sculptor ○ Calligrapher ○ Writer ○ Philosopher ○ ... Images: Miyamoto Musashi, self-portrait, http://sv-musashi1.com/about_Musashi.htm, Musashi Miyamoto with two Bokken, http://www.akinokai.org/images/Images.htm?Musashi.jpg
  • 12. Data Pipeline ArchitectData Pipeline Architect “The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means.” - Miyamoto Musashi, The Book of Five Rings
  • 13. Data Pipeline ArchitectData Pipeline Architect “Whenever you parry, hit, spring, strike or touch the enemy’s cutting sword, you must cut the enemy in the same movement.” - Miyamoto Musashi, The Book of Five Rings
  • 14. Data Pipeline ArchitectData Pipeline Architect “It is essential to attain this. If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him.” - Miyamoto Musashi, The Book of Five Rings
  • 15. Data Pipeline ArchitectData Pipeline Architect “More than anything, you must be thinking of carrying your movement through to cutting him. You must thoroughly research this.” - Miyamoto Musashi, The Book of Five Rings
  • 16. Data Pipeline ArchitectData Pipeline Architect The Goal of swordfighting is to cut the opponent. 16 16 ● Stating this makes it seem very obvious. ○ Why the effort and emphasis? ● It’s not. Even for aspiring practitioners. ○ Results suffer. ● Mindset is essential for mastery ● The core advice (to my understanding): ○ Attain, cultivate and apply a goal-oriented mindset ○ Aim every step you take towards the goal
  • 17. Data Pipeline ArchitectData Pipeline Architect Back to the world of data-handling businesses! 17 17 ● When working with company data ○ Before starting out on a project ○ Understand what you want and can achieve ○ Aim to create a positive impact on the business ○ Make it a constant, conscious goal ● The main tasks to do so are: ○ Understand the business ○ Understand the people ■ It’s about communication ○ Understand current processes ○ Be prepared to learn and revise
  • 18. Data Pipeline ArchitectData Pipeline Architect Use this process when approaching a new project: 18 18 ● Qualify client/project ○ Does it make sense to get involved? ○ Is it evident that we can create value? ● Perform conversations/interviews ○ Find out more about the context ■ company, status, goals, limitations... ○ Learn from first-hand experience ● Summarize information, learnings and plans in writing ○ Roadmap document ○ Depicting the situation and ways forward
  • 19. Data Pipeline ArchitectData Pipeline Architect Is there potential for a good fit? Do budget, topic and goals seem in order?
  • 20. Data Pipeline ArchitectData Pipeline Architect Qualifying considerations. Learning about the client and project. 20 20 ● What are you working on? ● What part of the project would you like help with? ● What needs to happen to make this a success for you? ● Why was this project started? What are the business goals? ● Is there an event that triggered it? ● Why especially now? ● What’s the budget? (ballpark estimate) ● When are you looking to get started?
  • 21. Data Pipeline ArchitectData Pipeline Architect Still good? Let’s start a business relationship. Initial research and planning. Roadmapping consulting package.
  • 22. Data Pipeline ArchitectData Pipeline Architect Four people to talk to: 22 22 ● Project owner ○ We want this guy to be successful ● Business owner or C-level perspective ○ Knows what’s best for the business ○ "What could the ceo ask you in the hallway" ● Data wrangler - tales from the trenches ○ Insights into day-to-day business and data details ● Engineering Side ○ Current tech stack ○ Infos on constraints and preferences ○ Last touches ● Conversation focus, questions and duration vary from person to person.
  • 23. Data Pipeline ArchitectData Pipeline Architect Interviews completed, situation understood and put into writing. 23 ● A bit of focused communication, we have a great foundation! ○ Project motivation ○ Business goals ○ Who should benefit ○ How to make it happen ● Different perspectives on the project and business. ● Time for tech! ○ Context clear (goals, constraints) ● Best case: ○ Very few choices left to make
  • 24. Data Pipeline ArchitectData Pipeline Architect Here’s what I would have told myself when starting out: 24 ● Learn about the company ○ Easier with fresh eyes ● Understand the business ○ Multiple perspectives ● Keep the goal in mind ○ Helps learning the right things ○ Cultivate a business mindset (help earn more/lose less) ○ Aim for results ■ I will not stop saying this anytime soon :) ● Have a process laid out 24
  • 25. Data Pipeline ArchitectData Pipeline Architect Finally: Tactical Advice Which Fits the Remaining Time. That’s the right proportion :)
  • 26. Data Pipeline ArchitectData Pipeline Architect Don’t roll your own home-baked scripts. 26 26 ● "Quick and easy" isn't ● Uniqueness is bad, boring is good ○ Learning curve for others ○ Original author leaving ○ Maintenance time, tricky bugs, code duplication ○ Unexpected failure modes ● Extensibility? ● Growth? ● Metadata?
  • 27. Data Pipeline ArchitectData Pipeline Architect You should know about workflow engines. 27 27 ● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1] ● Data flow = “bunch of data processing tasks with inter-dependencies” [2] ● Pipelines of batch jobs ○ complex, long-running ● Dependency management ● Reusability of intermediate steps ● Logging and alerting ● Failure handling ● Monitoring ● Lots of effort went into them (Broken data? Crashes? Partial failures?) [1] https://en.wikipedia.org/wiki/Workflow [2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
  • 28. Data Pipeline ArchitectData Pipeline Architect If in doubt, try Luigi. 28 28 ● Spotify ○ Lots of data! ○ 10k+ Hadoop jobs every day [1] ● Battle hardened ○ Published 2009 ○ Has been used in production by large companies for a while ● Python ● Modular & extensible ● Dependency graph ● Not just for data tasks [1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
  • 29. Data Pipeline ArchitectData Pipeline Architect Usually worthwhile pipeline properties: 29 29 ● Keep it small and lean ● Make learning and iterating easy ○ Changes should be cheap to accommodate for (both time and money) ● Build something to start learning ● Get data into one place ● Don’t reinvent the wheel ○ The tools are out there ○ ETL and workflow engines ● Create quick positive results, be efficient (lazy) ○ Many small improvements everywhere ○ Instead of solving everything for one group ○ More bang-for-the-buck
  • 30. Data Pipeline ArchitectData Pipeline Architect In conclusion: 30 ● Don’t dive into tactics right away ● Aim to create business value ○ Make it a conscious goal ● Understand the business, people and processes ○ This will take some time. It’s a good investment. ○ Have a process yourself ○ Tech choices will follow ● Try to make it easy to learn and iterate ● Get data in one place ● Don’t go with home-baked scripts ● Consider workflow engines ○ Luigi in particular30
  • 31. Data Pipeline ArchitectData Pipeline Architect Thanks! Want to learn more? “What questions to ask? Am I missing something?” For your future interviews and planning: I want to share my seed-question lists with you! Just drop me your email address at: http://datapipelinearchitect.com/datanatives/