SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Data Vault Automation at
de Bijenkorf
PRESENTED BY
ROB WINTERS
ANDREI SCORUS
Presentation agenda
◦ Project objectives
◦ Architectural overview
◦ The data warehouse data model
◦ Automation in the data warehouse
◦ Successes and failures
◦ Conclusions
About the presenters
Rob Winters
Head of Data Technology, the Bijenkorf
Project role:
◦ Project Lead
◦ Systems architect and administrator
◦ Data modeler
◦ Developer (ETL, predictive models, reports)
◦ Stakeholder manager
◦ Joined project September 2014
Andrei Scorus
BI Consultant, Incentro
Project role:
◦ Main ETL Developer
◦ ETL Developer
◦ Modeling support
◦ Source system expert
◦ Joined project November 2014
Project objectives
◦ Information requirements
◦ Have one place as the source for all reports
◦ Security and privacy
◦ Information management
◦ Integrate with production
◦ Non-functional requirements
◦ System quality
◦ Extensibility
◦ Scalability
◦ Maintainability
◦ Security
◦ Flexibility
◦ Low Cost
Technical Requirements
• One environment to quickly generate customer insights
• Then feed those insights back to production
• Then measure the impact of those changes in near real time
Source system landscape
Source Type Number of Sources Examples Load Frequency Data Structure
Oracle DB 2 Virgo ERP 2x/hour Partial 3NF
MySQL 3 Product DB, Web
Orders, DWH
10x/hour 3NF (Web Orders),
Improperly normalized
Event bus 1 Web/email events 1x/minute Tab delimited with
JSON fields
Webhook 1 Transactional Emails 1x/minute JSON
REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON
SOAP APIs 5+ AdWords, Pricing 1x/day XML
Architectural overview
Tools
AWS
◦ S3
◦ Kinesis
◦ Elasticache
◦ Elastic Beanstalk
◦ EC2
◦ DynamoDB
Open Source
◦ Snowplow Event Tracker
◦ Rundeck Scheduler
◦ Jenkins Continuous Integration
◦ Pentaho PDI
Other
◦ HP Vertica
◦ Tableau
◦ Github
◦ RStudio Server
DWH internal architecture
• Traditional three tier DWH
• ODS generated automatically from
staging
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
Bijenkorf Data Vault overview
Data volumes
• ~1 TB base volume
• 10-12 GB daily
• ~250 source tables
Aligned to Data Vault 2.0
• Hash keys
• Hashes used for CDC
• Parallel loading
• Maximum utilization of available
resources
• Data unchanged in to the vault
Some statistics
18 hubs
• 34 loading scripts
27 links
• 43 loading scripts
39 satellites
• 43 loading scripts
13 reference tables
• 1 script per table
Model contains
• Sales transactions
• Customer and corporate
locations
• Customers
• Products
• Payment methods
• E-mail
• Phone
• Product grouping
• Campaigns
• deBijenkorf card
• Social media
Excluded from the vault
◦ Event streams
◦ Server logs
◦ Unstructured data
Deep dive: Transactions in DV
•Transactions
Deep dive: Customers in DV
•Same as link on customer
Challenges encountered during data modeling
Challenge Issue Details Resolution
Source issues • Source systems and original data
unavailable for most information
• Data often transformed 2-4 times before
access was available
• Business keys (ex. SKU) typically replaced
with sequences
• Business keys rebuilt in staging prior to
vault loading
Modeling returns • Retail returns can appear in ERP in 1-3
ways across multiple tables with
inconsistent keys
• Online returns appear as a state change
on original transaction and may/may not
appear in ERP
• Original model showed sale state on
line item satellite
• Revised model recorded “negative sale”
transactions and used a new link to
connect to original sale when possible
Fragmented
knowledge
• Information about the systems was being
held by multiple people
• Documentation was out-of-date
• Talking to as many people as possible
and testing hypotheses on the data
Targeted benefits of DWH automation
Objective Achievements
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
Source loading automation
o Design of loader focused on process abstraction, traceability, and minimization of “moving parts”
o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from
source systems, one for loading flat files from all sources to staging tables
o Replication was desired but rejected due to limited access to source systems
Source tables
duplicated in
staging with
addition of
loadTs and
sourceFile
columns
Metadata for
source file
added
Loader
automatically
generates ODS,
begins tracking
source files for
duplication and
data quality
Query
generator
automatically
executes full
duplication on
first execution
and
incrementals
afterward
CREATE TABLE stg_oms.customer
(
customerId int
, customerName varchar(500)
, customerAddress varchar(5000)
, loadTs timestamp NOT NULL
, sourceFile varchar(255) NOT NULL
)
ORDER BY customerId
PARTITION BY date(loadTs)
;
INSERT INTO meta.source_to_stg_mapping
(targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField)
VALUES
('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL')
;
Example: Add additional table from existing sourceWorkflow of source integration
Vault loading automation
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
Design goals for mart loading automation
Requirement Solution Benefit
Simple,
standardized
models
Metadata-driven
Pentaho PDI
Easy development
using parameters
and variables
Easily
Extensible
Plugin framework
Rapid integration
of new
functionality
Rapid new job
development
Recycle
standardized jobs
and
transformations
Limited moving
parts, easy
modification
Low
administration
overhead
Leverage built in
logging and
tracking
Easily integrated
mart loading
reporting with
other ETL reports
Data Information mart automation flow
Retrieve
commands
• Each dimension and fact is processed independently
Get
dependencies
• Based on defined transformation, get all related vault tables: links, satellites or hubs
Retrieve
changed data
• From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension
• Store the data in the database until further processing
Execute
transformations
• Multiple Pentaho transformations can be processed per command using the data captured in previous steps
Maintentance
• Logging happens throughout the whole process
• Cleanup after all commands have been processed
Primary uses of Bijenkorf DWH
CustomerAnalysis
• Provided first unified
data model of
customer activity
• 80% reduction in
unique customer keys
• Allowed for
segmentation of
customers based on
combination of in-
store and online
activity
Personalization
• DV drives
recommendation
engine and customer
recommendations
(updated nightly)
• Data pipeline
supports near real
time updating of
customer
recommendations
based on web activity
BusinessIntelligence
• DV-based marts
replace joining dozens
of tables across
multiple sources with
single facts/
dimensions
• IT-driven reporting
being replaced with
self-service BI
Biggest drivers of success
AWS Infrastructure
Cost: Entire infrastructure for less than one
server in the data center
Toolset: Most services available off the
shelf, minimizing administration
Freedom: No dependency on IT for
development support
Scalability: Systems automatically scaled to
match DWH demands
Automation
Speed: Enormous time savings after initial
investment
Simplicity: Able to run and monitor 40k+
queries per day with minimal effort
Auditability: Enforced tracking and archiving
without developer involvement
PDI framework
Ease of use: Adding new commands takes at
most 45 minutes
Agile: Building the framework took 1 day
Low profile: Average memory usage of
250MB
Biggest mistakes along the way
• Initial integration design was based on provided documentation/models which was rarely accurate
• Current users of sources should have been engaged earlier to explain undocumented caveats
Reliance on documentation and requirements over expert users
• Variables were utilized late in development, slowing progress significantly and creating consistency
issues
• Good initial design of templates will significantly reduce development time in mid/long run
Late utilization of templates and variables
• We attempted to design and populate the entire data vault prior to focusing on customer deliverables
like reports (in addition to other projects)
• We have shifted focus to continuous release of new information rather than waiting for completeness
Aggressive overextension of resources
Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation!
◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own
◦ Balance stateful versus stateless and monolithic versus fragmented architecture design
◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant
◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!
Rob Winters
WintersRD@gmail.com
Andrei Scorus
andrei.scorus@incentro.com

Weitere ähnliche Inhalte

Was ist angesagt?

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Was ist angesagt? (20)

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 

Andere mochten auch

Ibm integration bus
Ibm integration busIbm integration bus
Ibm integration bus
FuturePoint Technologies
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
Rob Winters
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
Sarita Kataria
 

Andere mochten auch (20)

Building a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine LearningBuilding a Personalized Offer Using Machine Learning
Building a Personalized Offer Using Machine Learning
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Architecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data AnalyticsArchitecting for Real-Time Big Data Analytics
Architecting for Real-Time Big Data Analytics
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Ibm integration bus
Ibm integration busIbm integration bus
Ibm integration bus
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Top bi travelbird
Top bi travelbirdTop bi travelbird
Top bi travelbird
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Getting Started with Big Data Analytics
Getting Started with Big Data AnalyticsGetting Started with Big Data Analytics
Getting Started with Big Data Analytics
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!
 
Semantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing PractitionerSemantic Technology for the Data Warehousing Practitioner
Semantic Technology for the Data Warehousing Practitioner
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
 
Techzone 2014 presentation rundeck
Techzone 2014 presentation rundeckTechzone 2014 presentation rundeck
Techzone 2014 presentation rundeck
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 

Ähnlich wie Data Vault Automation at the Bijenkorf

Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 

Ähnlich wie Data Vault Automation at the Bijenkorf (20)

Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHow a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the Database
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
Introduction to Conductor
Introduction to ConductorIntroduction to Conductor
Introduction to Conductor
 
Delivering Changes for Applications and Databases
Delivering Changes for Applications and DatabasesDelivering Changes for Applications and Databases
Delivering Changes for Applications and Databases
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
DevOps+Data: Working with Source Control
DevOps+Data: Working with Source ControlDevOps+Data: Working with Source Control
DevOps+Data: Working with Source Control
 
AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?AppSphere 15 - Is the database affecting your critical business transactions?
AppSphere 15 - Is the database affecting your critical business transactions?
 

Kürzlich hochgeladen

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 

Kürzlich hochgeladen (20)

Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 

Data Vault Automation at the Bijenkorf

  • 1. Data Vault Automation at de Bijenkorf PRESENTED BY ROB WINTERS ANDREI SCORUS
  • 2. Presentation agenda ◦ Project objectives ◦ Architectural overview ◦ The data warehouse data model ◦ Automation in the data warehouse ◦ Successes and failures ◦ Conclusions
  • 3. About the presenters Rob Winters Head of Data Technology, the Bijenkorf Project role: ◦ Project Lead ◦ Systems architect and administrator ◦ Data modeler ◦ Developer (ETL, predictive models, reports) ◦ Stakeholder manager ◦ Joined project September 2014 Andrei Scorus BI Consultant, Incentro Project role: ◦ Main ETL Developer ◦ ETL Developer ◦ Modeling support ◦ Source system expert ◦ Joined project November 2014
  • 4. Project objectives ◦ Information requirements ◦ Have one place as the source for all reports ◦ Security and privacy ◦ Information management ◦ Integrate with production ◦ Non-functional requirements ◦ System quality ◦ Extensibility ◦ Scalability ◦ Maintainability ◦ Security ◦ Flexibility ◦ Low Cost Technical Requirements • One environment to quickly generate customer insights • Then feed those insights back to production • Then measure the impact of those changes in near real time
  • 5. Source system landscape Source Type Number of Sources Examples Load Frequency Data Structure Oracle DB 2 Virgo ERP 2x/hour Partial 3NF MySQL 3 Product DB, Web Orders, DWH 10x/hour 3NF (Web Orders), Improperly normalized Event bus 1 Web/email events 1x/minute Tab delimited with JSON fields Webhook 1 Transactional Emails 1x/minute JSON REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON SOAP APIs 5+ AdWords, Pricing 1x/day XML
  • 6. Architectural overview Tools AWS ◦ S3 ◦ Kinesis ◦ Elasticache ◦ Elastic Beanstalk ◦ EC2 ◦ DynamoDB Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Continuous Integration ◦ Pentaho PDI Other ◦ HP Vertica ◦ Tableau ◦ Github ◦ RStudio Server
  • 7. DWH internal architecture • Traditional three tier DWH • ODS generated automatically from staging • Ops mart reflects data in original source form • Helps offload queries from source systems • Business marts materialized exclusively from vault
  • 8. Bijenkorf Data Vault overview Data volumes • ~1 TB base volume • 10-12 GB daily • ~250 source tables Aligned to Data Vault 2.0 • Hash keys • Hashes used for CDC • Parallel loading • Maximum utilization of available resources • Data unchanged in to the vault Some statistics 18 hubs • 34 loading scripts 27 links • 43 loading scripts 39 satellites • 43 loading scripts 13 reference tables • 1 script per table Model contains • Sales transactions • Customer and corporate locations • Customers • Products • Payment methods • E-mail • Phone • Product grouping • Campaigns • deBijenkorf card • Social media Excluded from the vault ◦ Event streams ◦ Server logs ◦ Unstructured data
  • 9. Deep dive: Transactions in DV •Transactions
  • 10. Deep dive: Customers in DV •Same as link on customer
  • 11. Challenges encountered during data modeling Challenge Issue Details Resolution Source issues • Source systems and original data unavailable for most information • Data often transformed 2-4 times before access was available • Business keys (ex. SKU) typically replaced with sequences • Business keys rebuilt in staging prior to vault loading Modeling returns • Retail returns can appear in ERP in 1-3 ways across multiple tables with inconsistent keys • Online returns appear as a state change on original transaction and may/may not appear in ERP • Original model showed sale state on line item satellite • Revised model recorded “negative sale” transactions and used a new link to connect to original sale when possible Fragmented knowledge • Information about the systems was being held by multiple people • Documentation was out-of-date • Talking to as many people as possible and testing hypotheses on the data
  • 12. Targeted benefits of DWH automation Objective Achievements Speed of development • Integration of new sources or data from existing sources takes 1-2 steps • Adding a new vault dependency takes one step Simplicity • Five jobs handle all ETL processes across DWH Traceability • Every record/source file is traced in the database and every row automatically identified by source file in ODS Code simplification • Replaced most common key definitions with dynamic variable replacement File management • Every source file automatically archived to Amazon S3 in appropriate locations sorted by source, table, and date • Entire source systems, periods, etc can be replayed in minutes
  • 13. Source loading automation o Design of loader focused on process abstraction, traceability, and minimization of “moving parts” o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from source systems, one for loading flat files from all sources to staging tables o Replication was desired but rejected due to limited access to source systems Source tables duplicated in staging with addition of loadTs and sourceFile columns Metadata for source file added Loader automatically generates ODS, begins tracking source files for duplication and data quality Query generator automatically executes full duplication on first execution and incrementals afterward CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs timestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ; Example: Add additional table from existing sourceWorkflow of source integration
  • 14. Vault loading automation • New sources automatically added • Last change epoch based on load stamps, advanced each time all dependencies execute successfully All Staging Tables Checked for Changes • Dependencies declared at time of job creation • Load prioritization possible but not utilized List of Dependent Vault Loads Identified • Jobs parallelized across tables but serialized per job • Dynamic job queueing ensures appropriate execution order Loads Planned in Hub, Link, Sat Order • Variables automatically identified and replaced • Each load records performance statistics and error messages Loads Executed o Loader is fully metadata driven with focus on horizontal scalability and management simplicity o To support speed of development and performance, variable-driven SQL templates used throughout
  • 15. Design goals for mart loading automation Requirement Solution Benefit Simple, standardized models Metadata-driven Pentaho PDI Easy development using parameters and variables Easily Extensible Plugin framework Rapid integration of new functionality Rapid new job development Recycle standardized jobs and transformations Limited moving parts, easy modification Low administration overhead Leverage built in logging and tracking Easily integrated mart loading reporting with other ETL reports
  • 16. Data Information mart automation flow Retrieve commands • Each dimension and fact is processed independently Get dependencies • Based on defined transformation, get all related vault tables: links, satellites or hubs Retrieve changed data • From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension • Store the data in the database until further processing Execute transformations • Multiple Pentaho transformations can be processed per command using the data captured in previous steps Maintentance • Logging happens throughout the whole process • Cleanup after all commands have been processed
  • 17. Primary uses of Bijenkorf DWH CustomerAnalysis • Provided first unified data model of customer activity • 80% reduction in unique customer keys • Allowed for segmentation of customers based on combination of in- store and online activity Personalization • DV drives recommendation engine and customer recommendations (updated nightly) • Data pipeline supports near real time updating of customer recommendations based on web activity BusinessIntelligence • DV-based marts replace joining dozens of tables across multiple sources with single facts/ dimensions • IT-driven reporting being replaced with self-service BI
  • 18. Biggest drivers of success AWS Infrastructure Cost: Entire infrastructure for less than one server in the data center Toolset: Most services available off the shelf, minimizing administration Freedom: No dependency on IT for development support Scalability: Systems automatically scaled to match DWH demands Automation Speed: Enormous time savings after initial investment Simplicity: Able to run and monitor 40k+ queries per day with minimal effort Auditability: Enforced tracking and archiving without developer involvement PDI framework Ease of use: Adding new commands takes at most 45 minutes Agile: Building the framework took 1 day Low profile: Average memory usage of 250MB
  • 19. Biggest mistakes along the way • Initial integration design was based on provided documentation/models which was rarely accurate • Current users of sources should have been engaged earlier to explain undocumented caveats Reliance on documentation and requirements over expert users • Variables were utilized late in development, slowing progress significantly and creating consistency issues • Good initial design of templates will significantly reduce development time in mid/long run Late utilization of templates and variables • We attempted to design and populate the entire data vault prior to focusing on customer deliverables like reports (in addition to other projects) • We have shifted focus to continuous release of new information rather than waiting for completeness Aggressive overextension of resources
  • 20. Primary takeaways ◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation! ◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own ◦ Balance stateful versus stateless and monolithic versus fragmented architecture design ◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant ◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!

Hinweis der Redaktion

  1. One of the focus points will be the return satellite, maybe the whole link to the return location and customer should have been modeled as a link? Return satellite is an active satellite