SlideShare ist ein Scribd-Unternehmen logo
1 von 15
W E L C O M E !
4TH DATA DRIVEN RIJNMOND
P R O G R A M
‣ Apache Airflow & Apache Spark data pipelines in the cloud
‣ Collecting data in the food domain with apps
‣ Large-scale outlet matching and enrichment in the food service domain
D A T L I N Q
A I R F L O W & S P A R K I N T H E C L O U D
D A T A I S G A R B A G E
D A T A
I N F O R M A T I O N
K N O W L E D G E
I N S I G H T
B E T T E R C O M B I N E D
C L E A N I N G D A T A I S H A R D
C O N T I N U O U S I N F L O W
A P A C H E S P A R K
D E C E N T R A L I S E & A T O M I C I S E
A P A C H E A I R F L O W
G O O G L E C L O U D P L A T F O R M
D E M O
Q U E S T I O N S ?
S L I D E S & L I N K S
W I L L B E P O S T E D
O N L I N E

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...aditipandeya
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goasexy call girls service in goa
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts servicesonalikaur4
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Standkumarajju5765
 

Kürzlich hochgeladen (20)

Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 

Empfohlen

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Empfohlen (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Apache Airflow & Apache Spark data pipelines in the cloud

  • 1. W E L C O M E ! 4TH DATA DRIVEN RIJNMOND
  • 2. P R O G R A M ‣ Apache Airflow & Apache Spark data pipelines in the cloud ‣ Collecting data in the food domain with apps ‣ Large-scale outlet matching and enrichment in the food service domain
  • 3. D A T L I N Q
  • 4. A I R F L O W & S P A R K I N T H E C L O U D
  • 5. D A T A I S G A R B A G E
  • 6. D A T A I N F O R M A T I O N K N O W L E D G E I N S I G H T
  • 7. B E T T E R C O M B I N E D
  • 8. C L E A N I N G D A T A I S H A R D
  • 9. C O N T I N U O U S I N F L O W
  • 10. A P A C H E S P A R K
  • 11. D E C E N T R A L I S E & A T O M I C I S E
  • 12. A P A C H E A I R F L O W
  • 13. G O O G L E C L O U D P L A T F O R M
  • 14. D E M O
  • 15. Q U E S T I O N S ? S L I D E S & L I N K S W I L L B E P O S T E D O N L I N E

Hinweis der Redaktion

  1. Welcome to the 4th edition of Data Driven Rijnmond! Glad you all survived the storm and were still hyped enough to come by and listen to a talk about Airflow ;-) We are very proud to welcome you all at our new office building and we hope it will be the place of many more meet ups in the future. For this occasion the meet up is somewhat Datlinq themed, but we will stay away from sales pitches Tonight we like to share with you some of the tools and ideas Datlinq is using and why.
  2. As always we’ll have both an engineering and a data science talk As one of the data engineers at Datlinq I’ll start you of with the engineering talk. After a small break ,Andrew Ho, our product manager apps, that will give a short energising talk about collecting data in the food domain with apps. Finally our data scientist Martijn Spitters will finish the evening with a talk about outlet matching & enrichment
  3. For these talks to make sense you probably have to know a little about what it is we do here at Datlinq. As promised no sales talk, but it will give the context necessary to follow the overarching story of the talks of my colleagues and I Datlinq is a company that operates in the food service domain In short: we help foodservice professionals by informing them with data and supporting them with tools about opportunities in this domain. The data we use and supply is comprehensive location data in the most of Europe, like restaurants, coffee bars, stores, bakeries and other places that are potential outlets for food service brands We work for brands like … and we use our data (combined with theirs) to make matches between their brands and locations Our data is gathered, process and enrich this location data from a various range of (digital) online sources. So without further ado, let’s jump in this data gathering & enriching process
  4. I want to take you on a journey of building a Spark pipeline in Google Cloud orchestrated via Airflow. The first halve of this talk I will present slides about how we came to use Spark & Airflow in Google Cloud, the next part I’ll try to give a real life demo of the stuff I just described It’s ok that at this time you have no idea what these tools and systems are. I’ll hope to explain to you bottom up what our challenges are and how we deemed to solve these and how these tools fit in solving these challenges Our journey starts with data.
  5. Everybody is in love with data, big data is the new oil they say. But I’m incline to believe that these people know as much about working with big data as I do with oil Data is in itself complete and utter useless. Data is garbage. One of the problems with data is that it’s stale the moment you get it. Your source says it’s new, but who says they know? There is no chain of custody, or any indication that the data you receive is accurate, up to date or even usable. Even different sources may copy of each other perpetuating the problem. So you store data from different source somewhere in some files, a database, of maybe even Hadoop. Maybe you’ll use it at some point, maybe you don’t. But with the price of storage plummeting continuously you never throw it away. That would be wasteful… It’s not hard to get data nowadays. We use many open data sources and API’s to ingest bulks of data. Think for example about … data, which we’ll use in the demo. The moment you get your Json response with a like count and some detailed information, it’s dead data and will have a half-life that determines it usability in the future. But this data will also contain information typed by the owner that can contain errors (wrong zipcodes or misspelled streetnames), lies (best pizza in town) or inaccuracies (not up to date menu’s and pricing). There may also be confusion by duplicated data Event Locations that duplicate their location on … for each event. So the data we get from sources is in itself quite worthless.
  6. Then why work with data at all? We do believe that somewhere in these mountains of data garbage some useful nuggets of data are hidden that we can recycle out of this dump and turn into information. This information can be used to generate knowledge, which in its turn can be used for creating insights. To do this requires huge amounts of pre processing, cleaning en transforming of the data. In the demo I’ll show you how you can build these ETL jobs (Extract Transform Load) And how a … data json source can be turned into a Datlinq Location with basic location information (address, geo code, phone, email, website,etc) appended with informational tags, scores about likelihood of existence and classification of certain properties. This is the first step into creating information out of this data. But as mentioned processed data that is inaccurate is just nicely structured data that is inaccurate. Now it’s time to improve this accuracy.
  7. The trick is that data is better combined. If we can data from different sources that describe the same entity, we can reduce the risk of one of those sources being stale or incorrect. The more combinations we can make the more trustworthy our data can become. And ready to be processed into information, Datlinq Locations We call these combinations ‘crosswalks’ and one of our purposes is to imbue every location with as many crosswalks. Both to gather more detailed information (some sources provide reviews, other menu’s, etc) but also a verification tool if the location is still in business. (we Check these crosswalks periodically) In the demo we’ll use a different source of data that overlaps somewhat with the … data. ETL’ing this data in a similar structure to be used before combining
  8. Even though the solution of combining this data seems obvious, the meticulous part is to process and clean this data so it is ready for combining. Because with each transformation you are ‘irreversibly’ chasing the data down the line. What to keep, what to change, what to merge, what to split are the hard questions Fortunately this is something we have been doing at Datlinq for a long time. We have a lot of experience with gathering, cleaning and matching data.
  9. So far I have not mentioned any tool that was advertised in this talk. So if we have all this experience and all this data and all these great clients why need any of these tools at all? The problem is that in the last few years the floodgates have been opened and data keeps pouring in from all kinds of sources into our data lake aka data garbage dump Our challenge was to change our semi-automatic cleaning & combining process into a fully automated one, based on machine learning that can handle the volume and variety of data that flows through our system. It’s not feasible any more to check all this data by hand or small scripts that run sporadicly
  10. No we need a tool that can effortlessly process and store high volumes of data in a scalable way The best tool on the market these days seems to be Apache Spark. Sparks offers a way to distribute (map) your workload in a fault tolerant way across many machines and combine these back into a single data source. Spark is the engine that runs all processes. You could build one huge monolithic Spark Job that would entail your entire data pipeline. Even though that seems easy, and will probably be the fastest solution, it’s a horrible idea, because failure at the end means failure of the entire pipeline. Also it’s hard do split out certain jobs that can run on different clusters. You are just replacing your single threaded opaque pipeline with a distrubuted scalable opaque pipeline. So the best approach is in fact to build smaller jobs
  11. The best practise in building these SparkJobs is building many small ones that work together and allow the output of one to be the input of another. This way you can build single responsibility jobs that do a specific thing without worrying about the entire pipeline. You just have to defined different types of Spark Jobs. ETL Jobs that turn raw data into a semi-structured clean dataset. Matching jobs that combine these datasets into combined datasets. Enrichment jobs that turn these combined datasets into enriched data by adding new features. Also ML jobs, like classification jobs that use these features to predict. In the demo I’ll try to convey how you could approach this problem, but bear in mind that there are better ways. I just kept it simple for the demo. The downside of all these separate jobs is that all these loose components have to be orchestrated into a single functional pipeline. With the flock of birds this occurs to naturally emerging (flock) behaviour. Unfortunately big data tools are not ready for that (yet).
  12. In the olden days we would create huge lists of cronjobs that trigger certain jobs at certain time intervals, but this has many issues. Cronjobs run regardless of what happened in the past. They are hard to schedule, since you have to estimate the time each job takes to schedule the next. If one fails, the rest will keep running and probably fail too. Logging is hard and scaling across multiple machines seems a guarantee for headaches. No what we actually need is a tool that allows for easy composition and scheduling of complex workflows, with dependencies and also monitor these workflows, retry a number of times in case of failure and notify the status of each job One such a tool is Apache Airflow. Written in Python and maturing pretty fast, it allows for all our requirements and has an increasing plugin library allowing it to work with Amazon, Azure and Google Cloud. In the demo I’ll show you how we use SparkJobs plugins for Airflow to trigger our individual jobs and have them depend on each other
  13. Now that we have our jobs in a row we need a place to run them (besides our laptop) The best solution is in my opinion the cloud. An yes. The cloud is just somebody else’s computer, but it offers us precisely what we need what we don’t get if we would host these machines ourselves: flexibility and on demand scalability For your information. It is important that our data is up to date, but it’s nowhere near realtime (yet) so our pipeline runs once a day for a couple of hours. In these few hours it uses massive machines run all our jobs, but in the end (thanks to airflow) it kills everything besides the original data lake, resulting database and the kuberenetes cluster running our (scalable API) We only pay for used cpu cycles. Don’t get me wrong. It’s only cheaper compared to owning similar resources, that would be idle 70% of the time, but you get many services countable via CLI out to the box back. We’ll see some of the during our demo which runs on Google Cloud (and my laptop)
  14. I’d have love to show you the entire current pipeline as is, but that posed a couple of difficulties, mainly that the current flow would take more then the allotted time to explain let alone comprehend. So I build a very, very simple pipeline using Spark, Airflow and Google Cloud The idea is that you get inspired in building your own pipelines. I’ve tested it once, so what could go wrong?
  15. Questions? https://github.com/TomLous/meetup-spark-airflow-demo