SlideShare a Scribd company logo
1 of 19
Data pipelines from zero
Lars Albertsson
Data architect @ Schibsted
www.mapflat.com
1
Who’s talking?
Swedish Institute of Computer Science (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
2
Presentation goals
Overview of data pipelines for analytics / data products
Target audience: Big data starters
Overview of necessary components
Base recipe
In vicinity of state-of-practice
Baseline for comparing design proposals
Subjective best practices
Technology suggestions, (alternatives)
3
Data product anatomy
4
Cluster storage
Ingress
Unified log
ETL Egress
DB
DB
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
Cluster storage
HDFS
(NFS, S3, Google CS, C*)
Event collection
5
Unified log
Immutable events
Append-only
Source of truth
Service
Unreliable
Unreliable
Reliable,
write available
Kafka
(Kinesis,
Google Pub/Sub)
Secor,
Camus
Immediate handoff to append-only replicated log.
Don’t manipulate, shuffle, sort, demux. Add timestamps.
Database state collection
Do: Read snapshots, event conversion tools
(Aegisthus, Bottled Water)
Careful: Dump replicated slave
Don’t: Use API, dump live master
6
Cluster storage
HDFS
(NFS, S3, Google CS, C*)
Service
DB
DB backup
Service
Datasets
7
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
Hadoop + Hive name conventions
Instance = class + parameters, same schema
Immutable
Dataset
class
Instance parameters,
Hive convention
Seal PartitionsPrivacy
level
Schema
version
Pipelines
Dataset “build system”
Input will be missing
Jobs will fail
Jobs will have bugs
Dataset =
function([inputs], code)
Deterministic, idempotent
8
Cluster storage
Unified log
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable
Luigi, (Airflow, Oozie)
Workflow manager
Dataset “build tool”
Build when input is available
Backfill for previous failures
Rebuild for bugs
=> Eventual correctness
DSL describes DAG
Includes egress
Data retention, privacy audit
9
DB
Batch processing MVP
Start simple, lean, end-to-end, without Hadoop/Spark
Serial jobs on pool of machines + work queue
Downsample to fit one machine if necessary
(Local Spark, Scalding, Crunch, Akka reactive
streams)
Get end-to-end workflows in production for trial
Integration test end-to-end semantics
Ensure developer productivity - code/test cycle
10
Processing at scale
Parallelise jobs only when forced to do so
Spark, (Hadoop + Scalding / Crunch)
Avoid: Vanilla MapReduce, non-JVM
Most jobs fit in single machine
Big complexity + performance win
11
Schemas
Storage formats: Json, Avro, Parquet. Protobuf, Thrift
There is always a schema, implicit or explicit
Schema on read
Dynamic typing, quick schema changes
Schema on write
Static typing possible
Use schema on read for analytics.
Incompatible change? New dataset class.
12
Egress datasets
Serving
Cassandra, denormalised
Export & Analytics
SQL
Workbenches (Zeppelin)
(Elasticsearch, proprietary OLAP)
13
Parting words
Keep things simple. Batch, few components & little state.
Don’t drop incoming data.
Focus on developer code, test, debug cycle - end to end.
Expect, tolerate human error.
Harmony with technical ecosystems - follow tech leaders.
Scalability only when necessary.
Plan early: Privacy, retention, audit, schema evolution.
14
Bonus slides
15
+Operations
+Security
+Responsive scaling
- Development workflows
- Privacy
- Vendor lock-in
Cloud or not?
Data pipelines example
17
Users
Page
views
Sales
Sales
reports
Views with
demographics
Sales with
demographics
Conversion
analytics
Conversion
analytics
Views with
demographics
Raw Derived
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines team organisation
Conway’s law
“Organizations which design systems ... are
constrained to produce designs which are
copies of the communication structures of
these organizations.”
Better organise to match desired design, then.

More Related Content

What's hot

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcpCatherine Kimani
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesMark Rittman
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Wes McKinney
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsSoftwareMill
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 

What's hot (20)

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 

Viewers also liked

jonh lennon
jonh lennonjonh lennon
jonh lennonUAM AZC
 
13 Stats That Will Redefine Your Email Marketing Priorities
13 Stats That Will Redefine Your Email Marketing Priorities13 Stats That Will Redefine Your Email Marketing Priorities
13 Stats That Will Redefine Your Email Marketing PrioritiesSailthru
 
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @ShanghaiLuke Han
 
Imagine a world without mocks
Imagine a world without mocksImagine a world without mocks
Imagine a world without mockskenbot
 
Inventions, Obsessions, and B2B Marketing Attribution
Inventions, Obsessions, and B2B Marketing AttributionInventions, Obsessions, and B2B Marketing Attribution
Inventions, Obsessions, and B2B Marketing AttributionPerkuto
 
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...EMILIANA HABELA
 
Everlane Social Media Analysis
Everlane Social Media AnalysisEverlane Social Media Analysis
Everlane Social Media AnalysisErica Oldfield
 
Marketing Agility: The Missing Metric?
Marketing Agility: The Missing Metric?Marketing Agility: The Missing Metric?
Marketing Agility: The Missing Metric?Shelly Lucas
 
Introduction to e-Gov Competency Framework (e-GCF) for Digital India Amit S...
Introduction to e-Gov Competency Framework (e-GCF) for Digital India   Amit S...Introduction to e-Gov Competency Framework (e-GCF) for Digital India   Amit S...
Introduction to e-Gov Competency Framework (e-GCF) for Digital India Amit S...Amit Srivastava
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine TourLuke Han
 
Sabarkantha Model of Rural Broadband for Digital India
Sabarkantha Model of Rural Broadband for Digital IndiaSabarkantha Model of Rural Broadband for Digital India
Sabarkantha Model of Rural Broadband for Digital IndiaNagarajan M
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
QlikView / Qlik Sense (ver. EN)
QlikView / Qlik Sense (ver. EN)QlikView / Qlik Sense (ver. EN)
QlikView / Qlik Sense (ver. EN)BPX SA
 
How Top Brands Unify Social Measurement Across the Marketing Stack
How Top Brands Unify Social Measurement Across the Marketing StackHow Top Brands Unify Social Measurement Across the Marketing Stack
How Top Brands Unify Social Measurement Across the Marketing StackOrigami Logic
 

Viewers also liked (19)

jonh lennon
jonh lennonjonh lennon
jonh lennon
 
13 Stats That Will Redefine Your Email Marketing Priorities
13 Stats That Will Redefine Your Email Marketing Priorities13 Stats That Will Redefine Your Email Marketing Priorities
13 Stats That Will Redefine Your Email Marketing Priorities
 
Ethereum @ descon 2016
Ethereum @ descon 2016Ethereum @ descon 2016
Ethereum @ descon 2016
 
Ether Mining 101
Ether Mining 101Ether Mining 101
Ether Mining 101
 
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
 
Imagine a world without mocks
Imagine a world without mocksImagine a world without mocks
Imagine a world without mocks
 
Geografia introdução
Geografia   introduçãoGeografia   introdução
Geografia introdução
 
Inventions, Obsessions, and B2B Marketing Attribution
Inventions, Obsessions, and B2B Marketing AttributionInventions, Obsessions, and B2B Marketing Attribution
Inventions, Obsessions, and B2B Marketing Attribution
 
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...
Influencia de la alimentación en la salud."Alimentos beneficiosos para la pró...
 
Everlane Social Media Analysis
Everlane Social Media AnalysisEverlane Social Media Analysis
Everlane Social Media Analysis
 
Marketing Agility: The Missing Metric?
Marketing Agility: The Missing Metric?Marketing Agility: The Missing Metric?
Marketing Agility: The Missing Metric?
 
Introduction to e-Gov Competency Framework (e-GCF) for Digital India Amit S...
Introduction to e-Gov Competency Framework (e-GCF) for Digital India   Amit S...Introduction to e-Gov Competency Framework (e-GCF) for Digital India   Amit S...
Introduction to e-Gov Competency Framework (e-GCF) for Digital India Amit S...
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Sabarkantha Model of Rural Broadband for Digital India
Sabarkantha Model of Rural Broadband for Digital IndiaSabarkantha Model of Rural Broadband for Digital India
Sabarkantha Model of Rural Broadband for Digital India
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
QlikView / Qlik Sense (ver. EN)
QlikView / Qlik Sense (ver. EN)QlikView / Qlik Sense (ver. EN)
QlikView / Qlik Sense (ver. EN)
 
Qlik vs. Tableau: High-Level Comparison
Qlik vs. Tableau: High-Level ComparisonQlik vs. Tableau: High-Level Comparison
Qlik vs. Tableau: High-Level Comparison
 
How Top Brands Unify Social Measurement Across the Marketing Stack
How Top Brands Unify Social Measurement Across the Marketing StackHow Top Brands Unify Social Measurement Across the Marketing Stack
How Top Brands Unify Social Measurement Across the Marketing Stack
 
Totvs bi
Totvs biTotvs bi
Totvs bi
 

Similar to Data pipelines from zero

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Impetus Technologies
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" BioinformaticsBrian Repko
 

Similar to Data pipelines from zero (20)

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
eScience Cluster Arch. Overview
eScience Cluster Arch. OvervieweScience Cluster Arch. Overview
eScience Cluster Arch. Overview
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Spark
SparkSpark
Spark
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 

Recently uploaded (20)

Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 

Data pipelines from zero

  • 1. Data pipelines from zero Lars Albertsson Data architect @ Schibsted www.mapflat.com 1
  • 2. Who’s talking? Swedish Institute of Computer Science (test tools) Sun Microsystems (very large machines) Google (Hangouts, productivity) Recorded Future (NLP startup) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data processing & modelling) 2
  • 3. Presentation goals Overview of data pipelines for analytics / data products Target audience: Big data starters Overview of necessary components Base recipe In vicinity of state-of-practice Baseline for comparing design proposals Subjective best practices Technology suggestions, (alternatives) 3
  • 4. Data product anatomy 4 Cluster storage Ingress Unified log ETL Egress DB DB DB Service DatasetJob Pipeline Service Export Business intelligence
  • 5. Cluster storage HDFS (NFS, S3, Google CS, C*) Event collection 5 Unified log Immutable events Append-only Source of truth Service Unreliable Unreliable Reliable, write available Kafka (Kinesis, Google Pub/Sub) Secor, Camus Immediate handoff to append-only replicated log. Don’t manipulate, shuffle, sort, demux. Add timestamps.
  • 6. Database state collection Do: Read snapshots, event conversion tools (Aegisthus, Bottled Water) Careful: Dump replicated slave Don’t: Use API, dump live master 6 Cluster storage HDFS (NFS, S3, Google CS, C*) Service DB DB backup Service
  • 7. Datasets 7 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json Hadoop + Hive name conventions Instance = class + parameters, same schema Immutable Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  • 8. Pipelines Dataset “build system” Input will be missing Jobs will fail Jobs will have bugs Dataset = function([inputs], code) Deterministic, idempotent 8 Cluster storage Unified log Pristine, immutable datasets Intermediate Derived, regenerable
  • 9. Luigi, (Airflow, Oozie) Workflow manager Dataset “build tool” Build when input is available Backfill for previous failures Rebuild for bugs => Eventual correctness DSL describes DAG Includes egress Data retention, privacy audit 9 DB
  • 10. Batch processing MVP Start simple, lean, end-to-end, without Hadoop/Spark Serial jobs on pool of machines + work queue Downsample to fit one machine if necessary (Local Spark, Scalding, Crunch, Akka reactive streams) Get end-to-end workflows in production for trial Integration test end-to-end semantics Ensure developer productivity - code/test cycle 10
  • 11. Processing at scale Parallelise jobs only when forced to do so Spark, (Hadoop + Scalding / Crunch) Avoid: Vanilla MapReduce, non-JVM Most jobs fit in single machine Big complexity + performance win 11
  • 12. Schemas Storage formats: Json, Avro, Parquet. Protobuf, Thrift There is always a schema, implicit or explicit Schema on read Dynamic typing, quick schema changes Schema on write Static typing possible Use schema on read for analytics. Incompatible change? New dataset class. 12
  • 13. Egress datasets Serving Cassandra, denormalised Export & Analytics SQL Workbenches (Zeppelin) (Elasticsearch, proprietary OLAP) 13
  • 14. Parting words Keep things simple. Batch, few components & little state. Don’t drop incoming data. Focus on developer code, test, debug cycle - end to end. Expect, tolerate human error. Harmony with technical ecosystems - follow tech leaders. Scalability only when necessary. Plan early: Privacy, retention, audit, schema evolution. 14
  • 16. +Operations +Security +Responsive scaling - Development workflows - Privacy - Vendor lock-in Cloud or not?
  • 17. Data pipelines example 17 Users Page views Sales Sales reports Views with demographics Sales with demographics Conversion analytics Conversion analytics Views with demographics Raw Derived
  • 18. Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines team organisation
  • 19. Conway’s law “Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” Better organise to match desired design, then.