3. DataStax and Azure IoT Reference Architecture 3
1. Reference Architecture Overview
Connected sensors, devices, and intelligent operations are transforming businesses and enabling new growth
opportunities with Microsoft Azure Internet of Things (IoT) services.
This document outlines how to use DataStax Enterprise (DSE) in the Azure IoT reference architecture.
DataStax Enterprise is a geographically distributed and horizontally scalable transactional database based on Apache
Cassandra. It includes integrated Spark analytics for steam processing and machine learning and a graph database for
relationship modeling. DataStax Enterprise is ideal for storing operational data with always available uptime requirements.
Figure 1: IoT solution architecture
Figure 1 shows the conceptual architecture for Azure IoT. This is detailed in the document Microsoft Azure IoT Reference
Architecture1
. DataStax Enterprise can be used to implement a number of components in the architecture, enhancing
performance, functionality and reliability.
2. Implementation
Figure 2, below, shows how DataStax Enterprise can be used as part of the end-to-end Azure IoT reference architecture.
1
https://azure.microsoft.com/en-us/updates/microsoft-azure-iot-reference-architecture-available/
Low power
devices
Existing IoT
devices
IoT Client
Solution UX
Provisioning API
Identity and Registry Stores
Stream Process
Analytics &
Machine Learning
Business
Integration
Connectors
and
Gateway(s)
Device State Store
Gateway
Cloud
Gateway
App Backend
Data Path
Optional solution component
IoT solution component
IoT Client
Presentation & Business
Connectivity
Data Processing, Analytics and ManagementDevice Connectivity
Personal
mobile
devices
IP capable
devices
IoT Client
Business
systems
Storage
4. DataStax and Azure IoT Reference Architecture 4
Figure 2: DataStax usage in the Azure IoT Reference Architecture
DataStax Enterprise can be used to implement the device registry and the device state stores. Additionally, the analytics
components of DataStax Enterprise can be leveraged to implement the stream processing and analytics portions of the
Azure IoT reference architecture.
Using DataStax Enterprise for these components offers several key advantages:
Linear Scalability – Able to scale to handle millions of transactions per second
Resilience to Failure – DataStax Enterprise provides node fault tolerance, rack fault tolerance and data center
level disaster tolerance
Integrated Analytics – Support for graph databases, full text search and machine learning.
These three advantages are very pertinent to the requirements of IoT stack components. More detail on how they relate to
those components is given later in this paper.
DataStax is working with Mesosphere to deploy DataStax Enterprise in the Mesos Universe, allowing for a push button,
containerized deployment in both Mesos and Azure Container Service (ACS). This allows for simplified provisioning and
orchestration if implementing an open source version of the Azure IoT reference architecture. More information is
available here2
:
Alternatively, if the Azure IoT reference architecture is implemented using primarily Azure services rather than open source
components, DataStax Enterprise can be deployed on Azure VMs using Azure Resource Manager (ARM). For field
gateways, DataStax can be deployed directly on the hardware. This means that a single database infrastructure can be
used across hybrid cloud IoT deployments. This results in simplified operations by avoiding the need to maintain multiple
types of infrastructure.
2
http://www.marketwired.com/press-release/mesosphere-brings-datastax-enterprise-to-the-dc-os-universe-app-store-
2130849.htm
Low power
devices
Existing IoT
devices
IoT Client
Solution UX
Provisioning API
Device Registry Stores
Real-time Analytics
(Spark / Spark R/ Spark ML scoring)
Batch Analytics
(Spark ML training)
Business
Integration
Connectors
and
Gateway(s)
Device State Store
Gateway
IoT Hub
App Backend
Data Path
Optional solution component
IoT solution component
IoT Client
Presentation & Business
Connectivity
Data Processing, Analytics and ManagementDevice Connectivity
Personal
mobile
devices
IP capable
devices
IoT Client
Business
systems
IoT solution component using DataStax
Hadoop
5. DataStax and Azure IoT Reference Architecture 5
The following sections describe the advantages of using DataStax Enterprise for the different components of the Azure IoT
reference architecture.
3. Device Registry Store
The device registry contains device related metadata attributes and reference data for provisioned devices. The device
registry serves as an index for device discoverability and is used by the solution backend components and UI. Typically, the
device registry contains only slowly changing data. Examples of device registry store data include:
The building and room number a smoke alarm is installed in
The installation date for a mixing valve
The upstream and downstream components connected to a generator.
Uptime is extremely important for the device registry. If it is not available, operations that depend on device metadata will
fail. DataStax Enterprise is very resilient to failure, making it an excellent choice for the device registry store. DataStax
Enterprise clusters are made up of, in descending order, data centers, racks and nodes. DSE is resilient to failure at the
data center, rack and node level.
Using DataStax Enterprise for the device state store has advantages beyond the resilience inherent in the database. DSE is
a multi-model database, including support for tabular data, full text search, analytics and graph databases. The graph
mode is particularly useful for the device registry. It can be used to model the relations between devices and other domain
specific entities. For example, as shown in Figure 3, it can be used to represent the relations between sensors, machines,
factories and products in a manufacturing scenario.
Figure 3: Example of DataStax Enterprise graph model for Azure IoT
Many databases fall into one of the two categories with respect to transactional behavior:
1. Strongly Consistent
2. Eventually Consistent
SENSOR
OPERATOR FACTORY PRODUCT
SENSOR
MACHINE
MACHINE
FAILURE
Monitors
Monitors
Reports
Affects
Part of
Assembles
Producers
Works for
6. DataStax and Azure IoT Reference Architecture 6
The CAP theorem3
details the tradeoffs between these categories. Essentially strong consistency comes with
disadvantages on scale and reliability. Eventual consistency does better in that regard, but at the cost of potential
inconsistency.
DataStax Enterprise is unique in that it offers tunable consistency. This means that the consistency level can be balanced
with performance and availability characteristics depending on the application. For the device registry store, it may be
advisable to tune toward read performance and consistency. This is because the device registry contains changing data
where inconsistency could impact the application back end and analytics processes.
One example of how tunable consistency can benefit performance in Azure IoT would be to take advantage of write
frequency for the device registry. In this scenario, writes are infrequent but should be reflected across the entire cluster.
We could tune write consistency to ALL. This would require all writes to be acknowledged before an update to devices
registered in the database is acknowledged as successful. Reads in this scenario are much more frequent, so we may want
to bias the consistency tuning to make them occur as quickly as possible. For this reason, we could tune to ONE for reads
in the device registry store. This would mean that a read would be acknowledged as soon as any copy of the data was
returned.
Quorum level consistency options could be used as well. More information on tuning consistency is available in this
article4
.
4. Device State Store
The device state store contains operational data from the devices. Device operational data is high volume and high
velocity data, typically many orders of magnitude more than what is stored in the Device Registry Store. This is because a
single device will produce many readings. Given these data volumes, it’s extremely important to use a highly scalable
database for the device state store.
DataStax Enterprise scales linearly to handle the load demands of millions of devices. Figure 4, below, shows how DSE
performance scales linearly as nodes are added to a cluster. DSE can scale from extremely small to large clusters that can
handle millions of transactions per second. This makes DSE an ideal database for the device state store.
3
https://en.wikipedia.org/wiki/CAP_theorem
4
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html
7. DataStax and Azure IoT Reference Architecture 7
Figure 4: Near linear scalability in DataStax Enterprise5
In the case of the device state store, the need for transferring or replicating each data category should be analyzed. Raw
telemetry data might not need to be available on a secondary site. Aggregated data will represent a reduced data volume
which might be easier to replicate if needed.
The device state store has much more dynamic data than the device registry store. Uptime and the ability to handle large
data volumes remain important and the deployment architecture that works well in that scenario remains viable here.
In the device state store, it may be advisable to tune consistency differently than in the device registry. Here the aim is to
optimize for writes rather than reads. In these cases, reads and writes can occur with quorum level consistency.
The Device State Store and Device Registry store can be implemented as distinct databases or as a single database. While
there may be minor latency advantages to implementing distinct databases, for most cases we would recommend a single
database. This simplifies administration and reduces hardware cost.
5. Real-Time Analytics
After ingress through Azure IoT Hub as the cloud gateway, the flow of data through the system is facilitated by data
pumps and analytics tasks. Data pumps are typically moving or routing data without any transformation, while analytics
tasks perform complex event processing. Since the IoT Hub provides brokered communication and supports multiple
consumers, the same data can be consumed by different stream processors for different purposes, which will result in
multiple data streams flowing concurrently. For example, a stream processor may listen only for special types of events,
while another one could perform complex event processing in parallel. Those processors can determine the path of data
and route without any reshaping or perform complex event processing tasks such as data aggregation, data enrichment
5
http://www.datastax.com/apache-cassandra-leads-nosql-benchmark
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
1 2 4 8 16 32
Cassandra Couchbase Hbase MongoDB
Operations/sec
Nodes
8. DataStax and Azure IoT Reference Architecture 8
through correlation with reference data, as well as analytics tasks such as detection of threshold limits or anomalies and
generation of alerts.
As part of its multi-model capabilities, DataStax Enterprise embeds Apache Spark. Spark consists of a number of
components that are relevant to the Real-time Analytics aspect of the Azure IoT reference architecture, including Spark
Streaming and Spark MLlib.
For hot path analytics, data flows directly from the Azure IoT Hub into Spark Streaming that is integrated into the DataStax
Enterprise runtime. Spark allows you to use a pre-trained MLlib models directly in the Spark Streaming analytics pipeline.
Thus, incoming data can be scored against Spark MLlib models. With this architecture, real-time predictions can be made
against pre-trained machine learning models. Embedding machine learning infrastructure into real-time data feeds using
Spark Streaming allows the system to react quickly to new input, intelligently predicting with greater responsiveness and
accuracy than a traditional business intelligence based approach.
This integration simplifies usage of machine learning models directly in the streaming pipeline. It also improves
performance and latency as the data is processed close to the database.
Some example applications of this infrastructure include:
Real-time correlation analysis from multiple fire alarm sensors to determine if particulate buildup is due to a fire or
normal wear and tear, allowing for maintenance optimization.
Prediction of drill head failure in oil drilling through real time modeling of heat and fatigue measurements.
Integration with workforce management data in real-time to automatically route maintenance personal while
optimizing for job urgency and trip distance.
Hot path analytics with machine learning are where IoT users can extract the greatest amount of business value from their
IoT investment. DataStax Enterprise provides a concrete implementation of this type of IoT analytics infrastructure that
combines the resilience of Cassandra with multi model analytics for a comprehensive operational analytics solution.
6. Batch Analytics
Batch analytics in Azure IoT can be provided by Azure HD Insight (HDI). HDI is ideal for cases where large amounts of data
must be analyzed with batch queries or even ad-hoc queries. Training machine learning models in batch (as opposed to
incremental real-time training) or building monthly roll up reports are both use cases HDI is well suited for.
9. DataStax and Azure IoT Reference Architecture 9
Figure 5: Machine Learning Lifecycle with Spark
In some cases, where the batch queries are well defined or the amount of data stored is smaller, it may make sense to use
DataStax Enterprise for both the hot path and batch analytics. This results in a lower TCO as only one database needs to
be maintained as opposed to both DSE and HDI. However, this is balanced with some tradeoffs in suitability for large scale
batch queries.
Note that Spark MLlib models trained in HDI Spark as well as Spark in DataStax Enterprise can be exported and used in the
hot path for real-time analytics based on machine learning. This allows the buildout of a full machine learning lifecycle
within the IoT architecture as shown in Figure 5 above.
7. Field Gateway
A field gateway is a specialized device that acts as a communication enabler and as a local device control system and
device data processing hub. A field gateway can perform local processing and control functions for the devices. On the
other side, it can filter and aggregate the device telemetry. This reduces the amount of data transferred to the cloud back
end. Gateways may assist in device provisioning, data filtering, batching and aggregation, buffering of data, protocol
translation, and event processing.
Define Model
Features
(SparkSQL,
BI, etc.)
Train Model
(Batch Spark
MLlib)
Export Model
to Real-Time
Analytics
(Spark MLlib)
Real-Time
Model
Scoring
(Spark
Streaming
and MLlib)
Evaluate
Model
Performance
(SparkSQL,
BI, etc.)
10. DataStax and Azure IoT Reference Architecture 10
Figure 6: Data Flow in a stateful Field Gateway
Field Gateways that embed DataStax Enterprise are stateful. If connectivity is intermittent, these gateways can operate as
store and forward databases, syncing to the cloud gateways when connectivity is available. This gives a mechanism for
data storage locally, even on gateways with constrained hardware. This is because DataStax Enterprise can run on devices
with minimal hardware resources. It also lays the path for field gateways with sufficiently powerful hardware to embed
advanced analytics for edge processing.
Field Gateways that embed DataStax Enterprise can persist state in a standalone instance of the database or in an edge
database using DataStax Enterprise Advanced Replication. In the Advanced Replication case, with additional integration
work, messages can be passed through the IoT Hub to the central DataStax Enterprise datacenters.
This synchronizes the database automatically, saving a user the tedious exercise of implementing that logic.
An example topology is show in Figure 7.
Figure 7: Advanced Replication for Field Gateways with a DataStax Enterprise cluster
Low power
IoT devices
IoT devices
Field Gateway IoT Hub
Client
IoT Hub
Protocol
Adapter
Data
Buffering
11. DataStax and Azure IoT Reference Architecture 11
7.1. Edge Processing
In many scenarios, especially those where devices communicate with their cloud backend systems via metered networks, it
is not desirable to send raw sensor readings or status information across the communication link to the cloud because of
the associated cost and load.
Some IoT solutions specifically require evaluation of signal data streams, with video and audio covering particular signal
shapes and spectrums, by application of digital signal processing algorithms or pattern matching or discovery, so it is
required to treat these kinds of signals in a first-class fashion.
A sufficiently powerful field gateway can perform local processing, aggregation or encoding before data is transferred
over the network.
In some cases, resilience and increased processing power for edge processing is desired in the Field Gateway. In such a
case, it may be desirable to deploy a three or more node cluster at the edge as shown in Figure 8.
Figure 8: DataStax Edge Cluster and Central Cluster
By embedding DataStax Enterprise in a field gateway or even an edge device, hot path analytics can be provided to
devices with lower latency. Additionally, in scenarios with intermittent connectivity, analytics will continue to be performed
even when the network is down.
DataStax Advanced Replication allows gateways (or even devices) to store a subset of the entire database locally and
replicate particular information in a unidirectional way. There are two obvious use cases for uni-directional replication
here:
1. Telemetry Data may be aggregated on the gateway and then encoded. The raw bit stream would remain local to
the device and not be replicated anywhere. However, the encoded data would be passed uni-directionally to the
cloud based Device State Store on the backend.
2. Registry Data should be stored in the cloud based Device Registry Store. Some devices may want to maintain a
local copy of registered devices in their immediate area, for instance in the same building.
Storing data locally at the Field Gateway both reduces latency in the system and makes the system more resilient to
failure.
Edge Cluster
Central Cluster
Telemetry Data
Operational Metadata
Command & Control
12. DataStax and Azure IoT Reference Architecture 12
For edge devices with limited hardware footprints, simple analytics such as moving averages and other aggregations are
possible. For devices with more performant hardware, it is even possible to embed Spark MLlib scoring at the edge of the
IoT network and push trained models from the cloud back end in Azure down to the edge.
8. Geographical Replication
DataStax Enterprise, built on Apache Cassandra, is unique among databases. DSE provides the ability to deploy
geographically distributed databases. DSE clusters are made up of arbitrary numbers of datacenters with any number of
nodes in each datacenter. Each DSE data center (deployed in an Azure region) automatically synchronizes information
across the geo-distributed cluster. This provides two keys benefits:
Disaster Avoidance – All DSE data centers run in an active/active/active configuration. In the event an Azure
service or data center fails, clients automatically fail over to a live data center.
Data Locality – Data is available locally, wherever an application back end is deployed. This reduces access latency
and ensures the application can scale horizontally to handle load from millions of devices.
Figure 9: DataStax Geo-Replication Architecture
For an implementation of the Azure IoT reference architecture, geographical redundancy can be leveraged in a number of
ways:
Disaster Avoidance – IoT infrastructure can be deployed in an active/active state by leveraging DSE. This allows
the application to continue operating even in the event of a regional failure. This capability is particularly powerful
as DSE is deployed in an active/active scenario with any number of simultaneously active datacenters. The result is
there is no downtime during failover, instead the application can immediately connect to a DSE node in an
available region. This is a great way to take advantage of the large number of Azure regions. Even in the case
where full active deployment is not required, geographical resilience can be leveraged to protect against potential
data loss of local failures.
Improved Performance – By deploying IoT solutions in multiple locations and data centers closer to IoT devices,
latency to analyze and act on device data and sensor readings can be reduced. This gives an improved experience
for users of the IoT system.
13. DataStax and Azure IoT Reference Architecture 13
9. Conclusion
Azure IoT is leading the industry with a componentized IoT reference architecture and pluggable services and
infrastructure that can be customized to meet any IoT need. DataStax Enterprise can be used to implement components of
the reference architecture in a geographically scalable and resilient way. Beyond that, DataStax Enterprise provides
analytics and graph database features that makes its usage as part of the Azure IoT reference architecture even more
compelling.
For more information, please contact konstantin.dotchkoff@microsoft.com, claudioc@microsoft.com or
ben.lackey@datastax.com.