Manage your data infrastructure like this:
(i) don’t drown the infra teams in domain data specifics
(ii) build robust low-latency lookup facilities to feed online services
(iii) always take stress out of the equation
At Klarna bank we do online decisions on risk, fraud and ID. Over a hundred data sources are being processed by over a hundred analysts and over a hundred batch jobs. Three data infrastructure engineering teams are operating and developing this data lake: Core team, apps team, and performance team. The total head count is less than a dozen.
To keep afloat, we’ve distilled the following practices: (i) The immutability and recomputation properties of the Lambda/Kappa architectures (ii) continuously delivered and automated infrastructure, (iii) tooling to empower producers and consumers of data to be accountable and self-sufficient, and (iv) proactively improve efficiency of data users.
We’ll talk about some of these practices and tools we have built during several years of running banking applications on Hortonworks Hadoop. Ecosystem components we’ll touch include Kafka, Avro, Hive, Oozie, ELK, Ranger, and Ansible. Tools developed by us include HiveRunner, tooling for data import, along with continuous delivery of data pipelines.
Speakers
Erik Zeitler, Senior Data Engineer, PhD, Klarna Bank
Per Ullberg, Lead Software Engineer, Klarna Bank
Who are we?
Online Payment provider
Over a decades worth of experience in payments
Erik
Interaction is valued - Questions are allowed - we’ll see how far we’ll get
Erik
Erik
Erik
Erik
Pelle
Linear projection
Little domain logic
Append only
Queryability
Downstream performance
validate/invalidate data
Route
Normalize
Single source aggregation
More domain knowledge
Velocity, sessions etc.
May be incremental - Single source makes it easy to understand what has changed
General Data Processing
Massive domain knowledge
F(all data) - Multiple sources
Big performance gains
Pelle
Linear projection
Little domain logic
Append only
Queryability
Downstream performance
validate/invalidate data
Route
Normalize
Single source aggregation
More domain knowledge
Velocity, sessions etc.
May be incremental - Single source makes it easy to understand what has changed
General Data Processing
Massive domain knowledge
F(all data) - Multiple sources
Big performance gains
Pelle
Linear projection
Little domain logic
Append only
Queryability
Downstream performance
validate/invalidate data
Route
Normalize
Single source aggregation
More domain knowledge
Velocity, sessions etc.
May be incremental - Single source makes it easy to understand what has changed
General Data Processing
Massive domain knowledge
F(all data) - Multiple sources
Big performance gains
Pelle
Linear projection
Little domain logic
Append only
Queryability
Downstream performance
validate/invalidate data
Route
Normalize
Single source aggregation
More domain knowledge
Velocity, sessions etc.
May be incremental - Single source makes it easy to understand what has changed
General Data Processing
Massive domain knowledge
F(all data) - Multiple sources
Big performance gains
Pelle
Linear projection
Little domain logic
Append only
Queryability
Downstream performance
validate/invalidate data
Route
Normalize
Single source aggregation
More domain knowledge
Velocity, sessions etc.
May be incremental - Single source makes it easy to understand what has changed
General Data Processing
Massive domain knowledge
F(all data) - Multiple sources
Big performance gains
Pelle
Pelle
Transaction logs on Kafka
Ingestion from the cloud by mirrored Kafka topics.
Beware of your model! If all rows change every day… you gain nothing from using the change capture log.
Write in the cloud - read on prem
Erik
Domain agnostic infra teams
Operate infra on a skeleton crew
storage
processing
fwks
Self-service infra
Redirect questions
Producer requirements
Infra is no middleman
Producer awareness
“API discovery”
This can be a conflict!
Provide self service
Build more tools
Document
Slack support channel
Dedup
Guaranteed delivery
Binary to queryable
Closed partitions
Dataset discovery
Erik
Erik
Proactive professional Services!
I know this sounds like lame consultant speak.
But we really needed this: We have to solve the kind of performance problems you saw on the previous slide before they become a problem.
To this end, we needed tooling. We use ELK to store search and display performance data.
Optimization is worthless if the output is incorrect.
We needed a scalable validation tool that could work on scale: The task is to compare two Hive databases. Compare the output of the original queries to the output of the optimized queries.
So we built one. Difftång.