This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses:
1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka.
2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems.
3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.
Generative AI on Enterprise Cloud with NiFi and Milvus
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
1. A
Big
Data
Lake
based
on
Spark
for
BBVA
June
2015
2. STARTING
POSITION
Absence
of
a
soDware
capable
of
processing
the
data
Isolated
data
silos
MulIple
structured
&
unstructured
data
sources
MulIple
log
management
soDware
ApplicaIons
just
wriIng
to
disk
(no
network
logging)
3. DRIVERS
Countless
applicaIons
&
benefits
FRAUD
SECURITY
DATA
ANALYSIS
MONITORING
SIEM
AUDIT
E-‐COMMERCE
USER-‐TRACKING
DEVELOPMENT
DEBUGGING
REGULATORY
COMPLIANCE
4. HIGH-‐LEVEL
SOLUTION
• MulIple
source
ingesIon
to
a
common
bus
• NormalizaIon
and
transformaIon
to
a
unified
log
(hard
work!)
• MulIple
data
sinks
depending
on
the
clients
and/or
use
cases:
-‐
Analy(cs
-‐
Regulatory
compliance
-‐
Indexing
engine
-‐
…
Big
Data
Lake
Normalized log
Raw log
6. SOFTWARE
PIECES
1.
LOGS
SENT
FROM
SYSLOG-‐NG
DEVICES
THAT
DON'T
SUPPORT
INSTALLATION
OF
SYSLOG-‐NG,
SEND
LOGS
VIA
SYSLOG
TO
A
SYSLOG-‐NG
RELAY
2.
LOGS
SENT
FROM
SYSLOG-‐NG
USED
AS
A
DISTRIBUTION
HUB
A
TOPIC
PER
CONSUMER/CLIENT
3.
NEW
APPLICATIONS
TO
WRITE
DIRECTLY
TO
KAFKA
4.
MULTIPLE
DESTINATIONS
SPARKTA
ELK
7. RDD-‐Based
Matrices
Batch
InteracIve
[SQL]
Streaming
Machine
Learning
WHY
SPARK
1
ONE
STACK
TO
RULE
THEM
ALL
Learn
just
one
system
Develop
within
one
framework
Deploy/Manage
just
one
system
InteracIve
Batch
processing
Stream
processing
SPARK
Databricks
co-‐founder
&
CTO
Matei
Zaharia
(source)
8. LOG
COLLECTION
• Syslog-‐ng
is
a
log
collecIon
soDware
capable
of
processing
them
in
near
real-‐Ime
&
deliver
them
to
a
wide
variety
of
desInaIons.
• Syslog-‐ng
provides
reliable
log
management
for
environments
ranging
from
a
few
to
thousands
of
hosts,
with
an
extreme
message
collecIon
rate.
• Supported
in
more
than
50
server
plahorms
(including
legacy
ones!)
• Syslog-‐ng
can
naIvely
collect
and
process
log
messages
from
a
wide
variety
of
Enterprise
soDware
and
custom
applicaIons.
9. LOG
DISTRIBUTION
• Kaia
is
a
distributed,
parIIoned,
replicated
commit
log
service,
originally
developed
by
LinkedIn.
• It
is
designed
to
opImize
its
performance,
offer
strong
durability
guarantees
and
scale
easily.
• Kaia
has
huge
throughput,
built-‐in
parIIoning,
replicaIon,
and
fault-‐tolerance
which
makes
it
a
good
soluIon
for
large
scale
message
processing
applicaIons.
• Normally
used
for
consumpIon
of
raw
data
from
topics
and
then
it
is
aggregated,
enriched,
and
transformed
into
new
Kaia
topics
for
further
processing.
PRODUCER
PRODUCER
PRODUCER
KAFKA
CLUSTER
CONSUMER
CONSUMER
CONSUMER
10. LOG
STORAGE
• HDFS
is
a
distributed
file
system
that
provides
high
performance
access
to
data
stored
in
a
cluster.
• It
is
the
‘de
facto’
clustered-‐storage
soluIon
in
the
Hadoop
ecosystem,
supported
by
the
vast
majority
of
Big
Data
soDware.
HDFS
is
a
key
technology
when
you
are
required
to
process,
specially
when
it
is
staIc
data.
• It
is
designed
to
achieve
high
availability,
high
performance
and
easy
scalability.
• Parquet
is
an
efficient
columnar
storage
format.
Parquet
is
built
to
support
very
efficient
compression
and
encoding
schemes.
• Apache
Avro
is
a
data
serializaIon
system
with
rich
data
structures
and
a
compact,
fast,
binary
data
format.
Developer(s):
Apache
SoDware
FoundaIon
Stable
release:
2.7.0/April
2015
OperaIng
system:
Cross-‐plahorm
Type:
Distributed
filesystem
License:
Apache
License
2.0
Website:
hadoop.apache.org
11. …IN
APROX.
200
SERVERS
STREAMED
11
TB/DAY
>2000
APPLICATIONS/DEVICES
FIGURES
NOT
YET
FULLY
DEPLOYED
OBJETIVES
/
ESTIMATION
CONSIDERATIONS
REPLICATION
COMPRESSION
BOTTLENECKS
FAILURES
APROX.
2PB
OF
STORE
DATA
13. Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based
on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.
These technologies were always a perfect fit, but we soon found ourselves
writing the same pieces of integration code over and over again.
14. Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them
were complex or obsolete while others were not open source.
For this reason, Stratio created SPARKTA: an open source and full-featured
platform for real-time analytics, based on Apache Spark.
15. Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have
been in pursuit of efficient counting and architects have constantly built systems that
computes counts quicker. In this age of social media, where 100s of 1000s events
take place every second, we designed a aggregation engine to deliver real-time
service
• No need of coding, only declarative aggregation
workflows
• Data continuously streamed in & processed in near real-
time
• Ready to use out of the box
• Plug & play: flexible workflows (inputs, outputs, parsers,
etc…)
• High performance
• Scalable and fault tolerant
nice intro from countandra
16. A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
engine
Allows multiple application to be
defined, each of which is bound to a
context, executing the aggregation
workflow
others
AGGREGATION
WORKFLOW
17. Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any time, and
also stop or monitor
them
18. Key Technologies
any spark streaming receiver :)
Use Spark dataframes API or RDDs to integrate any
datasource
+
Apache Kite SDK
INPUTS PROCESSIN
G
RabbitM
QZeroMQ
Twitter
Flume
Kafka
....
OUTPUTS
...
.
19. Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON document, including:
INPUT Where is the data coming from?
OUTPUT(s) Where should aggregate data be stored?
DIMENSION(s) Which fields will you need for your real-time
needs?
ROLLUP(s) How do you want to aggregate the dimensions?
TRANSFORMATION(s) Which functions should be applied before aggregation?
SAVE RAW DATA Do you want to save raw events?
20. Key Technologies
ROLLUPS
• Pass-through
• Time-based
• Secondly, minutely, hourly, daily,
monthly, yearly...
• Hierarchycal
• GeoRange: Areas with different sizes
(rectangles)
OPERATORS
• Max, min, count, sum
• Average, median
• Stdev, variance, count distinct
• Last value
• Full-text search
KiteSDK
21. SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend several points of the platform to
fulfill your needs, such as adding new inputs,
outputs, operators, dimension types.
Add new functions to Apache Kite in order to
extend the data cleaning, enrichment and
normalization capabilities.