Whether MVC for user interfaces, or Spine and Leaf for data centers, new
architecture patterns in our industry act as sort of historical markers of the
effectiveness and acceptance of new technologies. Practical techniques push the
bounds resulting in a shift. Application of distributed storage and streaming
capabilities such as Kafka and of course Hadoop are shifting Big Data architectures
from a layer cake concept, or North/South oriented approach to one which can be
thought of as an East/West architectural concept. Recently popular is Lambda
Architecture, this article presents an SAP HANA based rendering of the Lambda
Architecture.
3. SAP External | 3
Introduction
New Architectures for Big Data
Whether MVC for user interfaces, or Spine and Leaf for data centers, new
architecture patterns in our industry act as sort of historical markers of the
effectiveness and acceptance of new technologies. Practical techniques push the
bounds resulting in a shift. Application of distributed storage and streaming
capabilities such as Kafka and of course Hadoop are shifting Big Data architectures
from a layer cake concept, or North/South oriented approach to one which can be
thought of as an East/West architectural concept. Recently popular is Lambda
Architecture, this article presents an SAP HANA based rendering of the Lambda
Architecture.
Lambda Architecture
Before discussing how SAP products can be used in a Lambda fashion, a brief
overview of Lambda is in order. Attribution must go to Nathan Marz for his work.
Typically consisting of three components, a batch layer, a speed layer, and a
serving layer, Lambda rethinks how data flows through an analytics system. Data
flows left to right in a multi speed architecture; with all data stored in the batch layer
and the speed layer simultaneously, while the serving layer is used to combine and
analyze that data as required. Advantages are fault tolerance, scalability, while at
the same time offering multiple latencies to supporting a a variety of performance
requirements.
With Lambda some business questions are answered using the speed layer, others
answered by the batch layer, and still others answered by combing both. As
algorithms in the speed layer decide which data to keep, answer immediate
questions such as which ad to display or how much discount to offer, it also stores
an immutable copy of the data into the batch layer. In addition to acting as a replay
source, data in the batch layer is suited for use cases such as long tail analytics,
machine learning, or propensity to ‘act’ types of endeavors. Fault tolerance,
resilience, ultimate scale, and back testing are derived from having the immutable
copy of the original data for replay through the downstream components. The
serving layer may persist a copy of the data to be used for additional downstream
use of the data such as dashboards or feeds to another system. Shown here are
several open source options for the components which make up a Lambda
Architecture.
4. SAP External | 4
Traditional Lambda Architecture
Lambda with SAP HANA Platform
At SAP we are rapidly delivering value with our customers by employing these types of architecture patterns.
In 2012 we started moving the HANA product towards the SAP Real-time data architecture where one already
finds the notions of Lambda. Smart Data Streaming is the obvious choice for the real-time layer. Born on Wall
Street, this component is capable of ingesting millions of rows per second into HANA. Complete with a rich set
of API’s, stream and window compute constructs, and scale out capability, it can send the right data into
HANA and all the data into HDFS as necessary.
As the leading in-memory database platform, HANA meets the needs of the serving component of Lambda
rather well. The data can be stored once, then transformed, aggregated, and calculated dynamically. Easy to
understand in-memory compute engines such as graph, spatial, OLAP, and predictive analytics are available
to manipulate the data where it is stored. Because the immutable data is kept in an append only columnar
SQL database and the serving is done on the fly, our customer’s experience a valuable and unique
combination of expressive, declarative programming capability on data which is managed in a SQL
framework. This allows data access via everyday BI tools via standard ODBC or JDBC SQL as well as ODBO
MDX. Based on Node.js, the HANA app server XS Advanced provides amongst other things noSQL access to
the data via restful ODATA.
5. SAP External | 5
Often times requirements change such that new algorithms or new derivations of the immutable data are
required. Because of the dynamic nature of materialization employed by the HANA core architecture, these
changes can be made almost instantly. This flexibility changes the logistics of replay, allowing for rapid
experimentation and moves into a production environment. In the case that a schema needs to be extended
HANA does have schema flexibility whereby columns may be added to an existing table structure, leaving any
existing code intact. The amount of flexibility offered by HANA once the immutable data has been stored is a
fundamental differentiator of HANA. This flexibility offers reduced time to value and increased data agility.
While the HANA platform is working extremely well for real-time applications, Lambda is complete by
incorporating Vora. From an architecture standpoint, there are several aspects to add to the discussion. First
is the aspect of scalability.
6. SAP External | 6
One of the beauties of the lambda is scaling each component dynamically and independently as latency
requirements shift to meet changing business requirements. This notion of independent scaling allows
appropriate resources allocation in a fit-for purpose model. Vora’s use of Hadoop moves the scale of the
overall system up to multi petabyte range. More importantly, addressing long tail analytics and other big data
compute problems are now possible.
This leads to the next concept, which is one of throughput and algorithmic capability. Each of the real-
time/batch/serving layers are best suited for a particular algorithm at a given throughout. All of them can run
‘out of band’ code; the question is at which level of concurrency and complexity. Rather than have JAVA JVM
as the foundation of each, SAP have the appropriate engines with well-documented and supported extension
capabilities. One may argue that open source provides the ultimate extension capability but only as a
consuming organization is willing to merge code. In this case, SAP is providing high performance, fit for
purpose engines to achieve the desired results more efficiently.
This leaves us with the replay topic, an oft forgotten concept, especially in an operational batch context.
Usually left for that – operations, it is often the case that re-running loads into a system to catch up with the
real world, or to modify results based on new algorithms cause over provisioning of hardware in today's ETL
and serving layers. By accounting for this capability up front, the opportunity to make transforms happen at the
right time reduce the time and energy required to keep things in phase. HANA is a fantastic example of this
given its powerful late materialization capability. Eventually elastic capabilities will further reduce replay
workload stress by optimal allocation of compute resources with finer grain and lower latency. By employing
lambda and other emerging big data architecture patterns, our customers remain well positioned as the state
of the art continues forward.