The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
16. DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business
Intelligence
Business
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
New Sources
Logs Mails Sensor …Social
Media
Enterprise
Hadoop
Plattform
#1 The Vision of the Big Data Lake
17. A Hadoop project feels just like yet
another data warehouse project
-except the knowledge
#1 Vision & Reality
18. #1 Real world architecture - Insurance
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
DWH
Business
Intelligence
New Sources
Logs Sensor …Social
Media
Enterprise Hadoop Plattform
SAS LASR Server
Apache Zeppelin
20. Batch Layer
Speed Layer
Data Ingestion
Data Processing
Data Storage
Data Storage Data Analysis
Visualization
Visualization
…
Data
Channels
ms - s
min - h
#2 Lambda in Action - (e)Commerce
21. Data Ingestion
Data Processing
Raw Data
#2 Cassandra & Hadoop - AdServing
Data Processing
User Journey
Aggregated Data
Web Frontend
Aggregated Data
< 120 days
Data Science
23. Batch Layer
Speed Layer
Data Ingestion
Stream Processing
ms - s
min - h
#3 Fraud detection - Financial services
Data
Import
Data
Preparation
Model
Generation
Model
Validation
Feature &
Parameter
Selection
Manual or automatic
Iterations to tune
parameters
Use
Model
Refresh Model from
latest input data
24. 1. Background & Motivation
2. My experience
3. Agile Big Data Model
Agenda
25. Trade-offs for a Hadoop Platform
Cost
Efficiency
Flexibility
Speed of
Provisioning
26. Trade-offs for a Hadoop Platform
Cost
Efficiency
Flexibility
Speed of
Provisioning
Those companies will be successful that manage to build
maximum flexibility and speed of provisioning into their platform
without generating yet another silo, all while controlling the costs
27. Support for different speeds
A modern Hadoop platform needs to cope with different
speed levels to enable different use cases.
Speed of data processing
Sizeofdata
Batch
Interactive
Streaming
Realtime
Batch Layer
Speed Layer
Data
Ingestion
KB
TB
h ms
Data
Channels
28. The Microservices of Hadoop
Data-centric, in Pipelines you have to think!
Producers Data Ingestion Data Storage & Analysis Visualization & Consumers
Batch Data
Streaming Data
MS SQL
MySQL
Oracle
JMS
Events
…
csv
Interactive Parallel Processing
HDFS
(redundant, reliable storage)
SQL
Hive
YARN
(Data Operating System)
In-
Memory
Spark
Others
…
Search
Solr
Spark R
Ambari Views & Zeppelin
(Visualization)
Hadoop Platform
MS SQL 2016
+
R
Data Pipeline A
Data Pipeline C
Data Pipeline B
29. Core principles and values
•The core beliefs are the agile principles
•The foundation is a data-centric role model
oriented on the domains of the Big Data
Platform
•Independent project teams deliver data
pipe lines - from the beginning to the end
•The project teams collaborate with
specialized Big Data roles
•The data model is built on the principles of
domain driven design (DDD)
•Data Governance is built on self-
organization
30. Role model
Analytical Data
Operational
Data
Data Engineers
load and transform data
Answers to Questions
Data Analysts
process data
Data Scientists
analyze and correlate data
Admins
maintain, enhance, scale
“Hidden treasures“ Data Stewards
are responsible for the data
quality in one domain
Big Data Platform
Raw Data
31. Data model
Project data
Project data
Data X
Domain A
Project data
Data Y
Domain B
Data Steward A
is responsible
Data Steward B
Project A
is responsible
uses
• The data model is based on the principles of Domain
Driven Design (DDD)
• The data is divided into domains, the smallest
domain is user data
• User and project teams are directly responsible for their own data
• Can use other existing data
• Data is bundled into comprehensive domains, e.g.
• Business domains
• National subsidiaries
• Domains can be hierarchical
• Responsibility for one domain is exactly at one data
steward
• Always put meta data to the user data
• If not possible otherwise, do it in an informal way, otherwise use an
automated tool
Don’t strive for an unified data model!
• Redundancy will not be forced but accepted as a real-world necessity
User data
32. Collaboration model
Big Data Platform
Architecture Board
provide authoritative guidelines
Project A Project B Project X
use
consult
IT Operations
Data
Stewards
Business
Departments
Data
Scientists
work with /
are part of
…
Own
projects
work with /
are part of
consult
are responsible
for data dom
ains
consult
Own
projects
are
responsible
Project
Teams
33. Role description: IT Operations
•Operates and monitors the Big Data
Platform - based on an agreed-upon
service level agreement (SLA)
•Keeps the platform up-to-date in short
cycles
•Add additional components and
technologies
•Scales the platform
•Have a DevOps mindset
34. Role description: Project Team
•A project team works on a data pipeline
- from beginning to the end
• Data pipelines can have different depth
•A project team is independent from
other project teams
• Project teams can collaborate
•A project team needs to have all roles
to fulfill their project goal
•A project team has full responsibility for
it’s own data
Project A
Data Scientist
Data Engineers
Data Analyst
Product Owner
35. Role description: Architecture Board
•Designs technological guidelines
•Consults on deviations from those
guidelines
•Meets on a regular basis with full
transparency
•Can consult project teams on their daily
business
•Consists of architects, Data Stewards
& Key members of the project teams
36. Role description: Data Steward
•Supervises the creation and usage of
data and its quality
• Steering person of a self-organized data
governance
•Is responsible for the user and meta
data of (at least) one domain
•Operative role, works closely with all
other roles
•Independent, self-organized team
•Are part of the architecture board
37. Role description: Data Scientist
•Independent team of data specialists
•Work as part of project teams but also
have their own tasks, e.g.
• Scientific assessment of data quality
• Generate project and product ideas
•Consult and work closely with data
stewards and business departments
•Still unicorns on the job market