A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

A
Big
Data
Lake
based

on
Spark
for
BBVA

June
2015

STARTING
POSITION

Absence
of
a
soDware

capable
of
processing

the
data

Isolated

data
silos

MulIple
structured
&

unstructured
data
sources

MulIple
log

management
soDware

ApplicaIons
just

wriIng
to
disk

(no
network
logging)

DRIVERS

Countless
applicaIons
&
beneﬁts

FRAUD
SECURITY

DATA
ANALYSIS

MONITORING

SIEM

AUDIT

E-‐COMMERCE

USER-‐TRACKING

DEVELOPMENT
DEBUGGING

REGULATORY
COMPLIANCE

HIGH-‐LEVEL
SOLUTION

•  MulIple
source
ingesIon
to
a
common
bus

•  NormalizaIon
and
transformaIon
to
a
uniﬁed
log
(hard
work!)

•  MulIple
data
sinks
depending
on
the
clients
and/or
use
cases:

-‐

Analy(cs

-‐

Regulatory
compliance

-‐

Indexing
engine

-‐

…

Big
Data
Lake
Normalized log
Raw log

IN
DETAIL

Syslog
capable
devices

syslog-‐ng

SQL

.

.

.

SPARKTA

SOFTWARE
PIECES

1.
LOGS
SENT
FROM
SYSLOG-‐NG

DEVICES
THAT
DON'T
SUPPORT
INSTALLATION
OF
SYSLOG-‐NG,

SEND
LOGS
VIA
SYSLOG
TO
A
SYSLOG-‐NG
RELAY

2.
LOGS
SENT
FROM
SYSLOG-‐NG

USED
AS
A
DISTRIBUTION
HUB

A
TOPIC
PER
CONSUMER/CLIENT

3.
NEW
APPLICATIONS
TO
WRITE
DIRECTLY
TO
KAFKA

4.
MULTIPLE
DESTINATIONS

SPARKTA

ELK

RDD-‐Based
Matrices

Batch

InteracIve
[SQL]

Streaming

Machine
Learning

WHY
SPARK

1

ONE
STACK
TO
RULE
THEM
ALL

Learn
just
one
system

Develop
within
one
framework

Deploy/Manage
just
one
system

InteracIve

Batch

processing

Stream

processing

SPARK

Databricks
co-‐founder
&
CTO
Matei
Zaharia
(source)

LOG
COLLECTION

•  Syslog-‐ng
is
a
log
collecIon
soDware
capable
of
processing
them
in
near
real-‐Ime
&
deliver
them
to
a

wide
variety
of
desInaIons.

•  Syslog-‐ng
provides
reliable
log
management
for
environments
ranging
from
a
few
to
thousands
of
hosts,

with
an
extreme
message
collecIon
rate.

•  Supported
in
more
than
50
server
plahorms
(including
legacy
ones!)

•  Syslog-‐ng
can
naIvely
collect
and
process
log
messages
from
a
wide
variety
of
Enterprise
soDware
and

custom
applicaIons.

LOG
DISTRIBUTION

•  Kaia
is
a
distributed,
parIIoned,
replicated
commit
log
service,
originally
developed
by
LinkedIn.

•  It
is
designed
to
opImize
its
performance,
oﬀer
strong
durability
guarantees
and
scale
easily.

•  Kaia
has
huge
throughput,
built-‐in
parIIoning,
replicaIon,
and
fault-‐tolerance
which
makes
it
a

good
soluIon
for
large
scale
message
processing
applicaIons.

•  Normally
used
for
consumpIon
of
raw
data
from
topics
and
then
it
is
aggregated,
enriched,
and

transformed
into
new
Kaia
topics
for
further
processing.

PRODUCER
PRODUCER
PRODUCER

KAFKA

CLUSTER

CONSUMER
CONSUMER
CONSUMER

LOG
STORAGE

•  HDFS
is
a
distributed
file
system
that
provides
high
performance
access
to
data
stored
in
a
cluster.

•  It
is
the
‘de
facto’
clustered-‐storage
soluIon
in
the
Hadoop
ecosystem,
supported
by
the
vast
majority
of

Big
Data
soDware.
HDFS
is
a
key
technology
when
you
are
required
to
process,
specially
when
it
is
staIc

data.

•  It
is
designed
to
achieve
high
availability,
high
performance
and
easy
scalability.

•  Parquet
is
an
efficient
columnar
storage
format.
Parquet
is
built
to
support
very
efficient
compression

and
encoding
schemes.

•  Apache
Avro
is
a
data
serializaIon
system
with
rich
data
structures
and
a
compact,
fast,
binary
data

format.

Developer(s):
Apache
SoDware
FoundaIon

Stable
release:
2.7.0/April
2015

OperaIng
system:
Cross-‐plahorm

Type:
Distributed
filesystem

License:
Apache
License
2.0

Website:
hadoop.apache.org

…IN
APROX.

200
SERVERS

STREAMED

11
TB/DAY

>2000

APPLICATIONS/DEVICES

FIGURES

NOT
YET
FULLY
DEPLOYED

OBJETIVES
/
ESTIMATION

CONSIDERATIONS

REPLICATION

COMPRESSION

BOTTLENECKS

FAILURES

APROX.
2PB

OF
STORE
DATA

SPARKTA REAL-TIME
Challenges at Stratio2

Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based
on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.
These technologies were always a perfect fit, but we soon found ourselves
writing the same pieces of integration code over and over again.

Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them
were complex or obsolete while others were not open source.
For this reason, Stratio created SPARKTA: an open source and full-featured
platform for real-time analytics, based on Apache Spark.

Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have
been in pursuit of efficient counting and architects have constantly built systems that
computes counts quicker. In this age of social media, where 100s of 1000s events
take place every second, we designed a aggregation engine to deliver real-time
service

•  No need of coding, only declarative aggregation
workflows
•  Data continuously streamed in & processed in near real-
time
•  Ready to use out of the box
•  Plug & play: flexible workflows (inputs, outputs, parsers,
etc…)
•  High performance
•  Scalable and fault tolerant
nice intro from countandra

A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
engine
Allows multiple application to be
defined, each of which is bound to a
context, executing the aggregation
workflow
others
AGGREGATION
WORKFLOW

Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any time, and
also stop or monitor
them

Key Technologies
any spark streaming receiver :)
Use Spark dataframes API or RDDs to integrate any
datasource
+
Apache Kite SDK
INPUTS PROCESSIN
G
RabbitM
QZeroMQ
Twitter
Flume
Kafka
....
OUTPUTS
...
.

Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON document, including:

INPUT Where is the data coming from?

OUTPUT(s) Where should aggregate data be stored?

DIMENSION(s) Which fields will you need for your real-time
needs?

ROLLUP(s) How do you want to aggregate the dimensions?

TRANSFORMATION(s) Which functions should be applied before aggregation?

SAVE RAW DATA Do you want to save raw events?

Key Technologies
ROLLUPS
•  Pass-through
•  Time-based
•  Secondly, minutely, hourly, daily,
monthly, yearly...
•  Hierarchycal
•  GeoRange: Areas with different sizes
(rectangles)
OPERATORS
•  Max, min, count, sum
•  Average, median
•  Stdev, variance, count distinct
•  Last value
•  Full-text search
KiteSDK

SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend several points of the platform to
fulfill your needs, such as adding new inputs,
outputs, operators, dimension types.
Add new functions to Apache Kite in order to
extend the data cleaning, enrichment and
normalization capabilities.

CONTACT

Óscar
Méndez

CEO

omendez@straIo.com

Info

StraIo

contact@straIo.com

BIG
DATA

CHILD’S
PLAY

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Ähnlich wie A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO) (20)

Mehr von Spark Summit

Mehr von Spark Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)