Hadoop meets Agile! - An Agile Big Data Model

Hadoop meets Agile!
An (attempt on an)
Agile Big Data Model
uweprintz

1. Background & Motivation
2. My experience
3. Agile Big Data Model
Agenda

#1 Request/Response
3 Paradigms for dealing with data

#2 Batch

#3 Stream

#1 Request/Response
#2 Batch
#3 Stream

Once Hadoop was the cool kid on the block…

…but nowadays Hadoop feels just bloated

Your typical Hadoop distribution

Your typical Hadoop data flows

DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
RDBMS EDW MPP …
Business 
Intelligence
Business 
Applications
Custom
Applications
Operation
Manage
&
Monitor
Dev Tools
Build
&
Test
New Sources
Logs Mails Sensor …Social 
Media
Enterprise 
Hadoop
Plattform
#1 The Vision of the Big Data Lake

A Hadoop project feels just like yet
another data warehouse project
-except the knowledge
#1 Vision & Reality

#1 Real world architecture - Insurance
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Traditional Systems
DWH
Business 
Intelligence
New Sources
Logs Sensor …Social 
Media
Enterprise Hadoop Plattform
SAS LASR Server
Apache Zeppelin

Batch Layer
Speed Layer
Data Ingestion
Data Processing
Data Storage
Data Storage Data Analysis
Visualization
Visualization
…
Data
Channels
ms - s
min - h
#2 Lambda in Action - (e)Commerce

Data Ingestion
Data Processing
Raw Data
#2 Cassandra & Hadoop - AdServing
Data Processing
User Journey
Aggregated Data
Web Frontend
Aggregated Data
< 120 days
Data Science

#3
It’s all about Data Science!

Batch Layer
Speed Layer
Data Ingestion
Stream Processing
ms - s
min - h
#3 Fraud detection - Financial services
Data 
Import
Data
Preparation
Model
Generation
Model
Validation
Feature &
Parameter
Selection
Manual or automatic
Iterations to tune
parameters
Use  
Model
Refresh Model from
latest input data

Trade-offs for a Hadoop Platform
Cost
Efficiency
Flexibility
Speed of
Provisioning

Trade-offs for a Hadoop Platform
Cost
Efficiency
Flexibility
Speed of
Provisioning
Those companies will be successful that manage to build
maximum flexibility and speed of provisioning into their platform
without generating yet another silo, all while controlling the costs

Support for different speeds
A modern Hadoop platform needs to cope with different
speed levels to enable different use cases.
Speed of data processing
Sizeofdata
Batch
Interactive
Streaming
Realtime
Batch Layer
Speed Layer
Data
Ingestion
KB
TB
h ms
Data
Channels

The Microservices of Hadoop
Data-centric, in Pipelines you have to think!
Producers Data Ingestion Data Storage & Analysis Visualization & Consumers
Batch Data
Streaming Data
MS SQL
MySQL
Oracle
JMS
Events
…
csv
Interactive Parallel Processing
HDFS
(redundant, reliable storage)
SQL
Hive
YARN
(Data Operating System)
In-
Memory
Spark
Others
…
Search
Solr
Spark R
Ambari Views & Zeppelin
(Visualization)
Hadoop Platform
MS SQL 2016
+
R
Data Pipeline A
Data Pipeline C
Data Pipeline B

Core principles and values
•The core beliefs are the agile principles
•The foundation is a data-centric role model
oriented on the domains of the Big Data
Platform
•Independent project teams deliver data
pipe lines - from the beginning to the end
•The project teams collaborate with
specialized Big Data roles
•The data model is built on the principles of
domain driven design (DDD)
•Data Governance is built on self-
organization

Role model
Analytical Data
Operational
Data
Data Engineers
load and transform data
Answers to Questions
Data Analysts
process data
Data Scientists
analyze and correlate data
Admins
maintain, enhance, scale
“Hidden treasures“ Data Stewards
are responsible for the data
quality in one domain
Big Data Platform
Raw Data

Data model
Project data
Project data
Data X
Domain A
Project data
Data Y
Domain B
Data Steward A
is responsible
Data Steward B
Project A
is responsible
uses
• The data model is based on the principles of Domain
Driven Design (DDD)
• The data is divided into domains, the smallest
domain is user data
• User and project teams are directly responsible for their own data
• Can use other existing data
• Data is bundled into comprehensive domains, e.g.
• Business domains
• National subsidiaries
• Domains can be hierarchical
• Responsibility for one domain is exactly at one data
steward
• Always put meta data to the user data
• If not possible otherwise, do it in an informal way, otherwise use an
automated tool
Don’t strive for an unified data model!
• Redundancy will not be forced but accepted as a real-world necessity
User data

Collaboration model
Big Data Platform
Architecture Board
provide authoritative guidelines
Project A Project B Project X
use
consult
IT Operations
Data
Stewards
Business
Departments
Data
Scientists
work with /
are part of
…
Own
projects
work with /
are part of
consult
are responsible
for data dom
ains
consult
Own
projects
are
responsible
Project
Teams

Role description: IT Operations
•Operates and monitors the Big Data
Platform - based on an agreed-upon
service level agreement (SLA)
•Keeps the platform up-to-date in short
cycles
•Add additional components and
technologies
•Scales the platform
•Have a DevOps mindset

Role description: Project Team
•A project team works on a data pipeline
- from beginning to the end
• Data pipelines can have different depth
•A project team is independent from
other project teams
• Project teams can collaborate
•A project team needs to have all roles
to fulfill their project goal
•A project team has full responsibility for
it’s own data
Project A
Data Scientist
Data Engineers
Data Analyst
Product Owner

Role description: Architecture Board
•Designs technological guidelines
•Consults on deviations from those
guidelines
•Meets on a regular basis with full
transparency
•Can consult project teams on their daily
business
•Consists of architects, Data Stewards
& Key members of the project teams

Role description: Data Steward
•Supervises the creation and usage of
data and its quality
• Steering person of a self-organized data
governance
•Is responsible for the user and meta
data of (at least) one domain
•Operative role, works closely with all
other roles
•Independent, self-organized team
•Are part of the architecture board

Role description: Data Scientist
•Independent team of data specialists
•Work as part of project teams but also
have their own tasks, e.g.
• Scientific assessment of data quality
• Generate project and product ideas
•Consult and work closely with data
stewards and business departments
•Still unicorns on the job market

Get in contact
Twitter:
@uweprintz
uwe.seiler@codecentric.de
Mail:
uwe.printz@codecentric.de
Phone
+49 176 1076531
XING:
https://www.xing.com/profile/Uwe_Printz

https://unsplash.com/photos/7NtiJBowheE
Slide 2: Copyright by Uwe Printz
Slide 9: https://www.splitshire.com/little-dark-rider/
Slide 10: https://pixabay.com/de/kugelfisch-mexiko-handwerk-seziert-882440/
Slide 15: https://commons.wikimedia.org/wiki/File:Welcome_to_Fabulous_Las_Vegas_Sign.svg
Slide 19: http://unsplash.com/photos/7x4dOkulU9E/
Slide 22: https://unsplash.com/search/unicorn?photo=iWYrCr8eGwU
Slide 38: Copyright by Uwe Printz
All pictures CC0 or shot by the author

Hadoop meets Agile! - An Agile Big Data Model

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop meets Agile! - An Agile Big Data Model

Ähnlich wie Hadoop meets Agile! - An Agile Big Data Model (20)

Mehr von Uwe Printz

Mehr von Uwe Printz (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop meets Agile! - An Agile Big Data Model