Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
FIWARE Tech Summit - FIWARE Cygnus and STH-Comet
1. FIWARE Big Data ecosystem : Cygnus and STH-Comet
Joaquin Salvachua
Andres Muñoz
Universidad Politécnica de Madrid
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua
7. Cygnus
• Persistence (collecting, aggregating and moving data
) for later Batch processing.
• Could be integrated into a lambda architecture
• Quite flexible and configurable: based on stream data
flows with a pub-sub like comunication model.
6
8. CYGNUS
• What is it for?
– Cygnus is a connector in charge of persisting Orion context data in certain
configured third-party storages, creating a historical view of such data. In other
words, Orion only stores the last value regarding an entity's attribute, and if an
older value is required then you will have to persist it in other storage, value by
value, using Cygnus.
• How does it receives context data from Orion Context Broker?
– Cygnus uses the subscription/notification feature of Orion. A subscription is made
in Orion on behalf of Cygnus, detailing which entities we want to be notified when
an update occurs on any of those entities attributes.
7
10. Cygnus
• Cygnus is a connector in charge of persisting certain
sources of data in certain configured third-party
storages, creating a historical view of such data.
• Internally, Cygnus is based on Apache Flume, data
collection and persistence agents.
– An agent is basically composed of a listener or source in charge of receiving the
data, a channel where the source puts the data once it has been transformed
into a Flume event, and a sink, which takes Flume events from the channel in
order to persist the data within its body into a third-party storage.
9
12. Data Sinks
• NGSI-like context data in:
– HDFS, the Hadoop distributed file system.
– MySQL, the well-know relational database manager.
– CKAN, an Open Data platform.
– MongoDB, the NoSQL document-oriented database.
– STH Comet, a Short-Term Historic database built on top of MongoDB.
– Kafka, the publish-subscribe messaging broker.
– DynamoDB, a cloud-based NoSQL database by Amazon Web Services.
– PostgreSQL, the well-know relational database manager.
– Carto, the database specialized in geolocated data.
• Twitter data in:
– HDFS, the Hadoop distributed file system.
11
13. Cygnus events
• A Source consumes Events having a specific format, and those Events are
delivered to the Source by an external source like a web server. For example,
an AvroSource can be used to receive Avro Events from clients or from other
Flume agents in the flow. When a Source receives an Event, it stores it into
one or more Channels. The Channel is a passive store that holds the Event
until that Event is consumed by a Sink. One type of Channel available in
Flume is the FileChannel which uses the local filesystem as its backing store.
A Sink is responsible for removing an Event from the Channel and putting it
into an external repository like HDFS (in the case of an HDFSEventSink) or
forwarding it to the Source at the next hop of the flow. The Source and Sink
within the given agent run asynchronously with the Events staged in the
Channel.
12
16. Multiple Agents
• One instance for each
Agent.
• This add more capability to
the system
15
17. Connecting Orion Context Broker and
Cygnus
• Cygnus takes advantage of the subscription-notification mechanism
of Orion Context Broker. Specifically, Cygnus needs to be notified each
time certain entity's attributes change, and in order to do that, Cygnus
must subscribe to those entity's attribute changes.
16
25. HDFS details regarding Cygnus persistence
24
• By default, for each entity Cygnus stores the data at:
– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>-
<entity-type>.txt
• Within each HDFS file, the data format may be json-row or json-column:
– json-row
{
"recvTimeTs":"13453464536”,
"recvTime":"2014-02-27T14:46:21”,
"entityId":"Room1”,
"entityType":"Room”,
"attrName":"temperature”,
"attrType":"centigrade”,
“attrValue":"26.5”,
"attrMd":[
…
]
}
– json-column
{
"recvTime":"2014-02-27T14:46:21”,
"temperature":"26.5”,
"temperature_md":[
…
],
“pressure”:”90”,
“pressure_md”:[
…
]
}
26. High Availability
• Simple configuration:
– implementing HA for Flume/Cygnus is as easy as running two
instances of the software and putting a load balancer in
between them and the data source (or sources).
• Use File Channels instead of Memory Channels (extra
persistence) which is the default.
• Advanced configuration:
– Flume with Zookeeper
• https://github.com/telefonicaid/fiware-cygnus/blob/master/doc/cygnus-ngsi/installation_and_administration_guide/reliability.md
25
34. Data schemas and pre-aggregation
• Although the STH stores the evolution of (raw) data (i.e., attributes
values) in time, its real power comes from the storage of aggregated
data
• The STH should be able to respond to queries such as:
– Give me the maximum temperature of this room during the last month
(range) aggregated by day (resolution)
– Give me the mean temperature of this room today (range) aggregated by
hour or even minute (resolution)
– Give me the standard deviation of the temperature of this room this last
year (range) aggregated by day (resolution)
– Give me the number of times the air conditioner of this room was switched
on or off last Monday (range) aggregated by hour
33
50. Extra documentation
• The per agent Quick Start Guide found at readthedocs.org provides a good
documentation summary (cygnus-ngsi, cygnus-twitter).
• Nevertheless, both the Installation and Administration Guide and the User and
Programmer Guide for each agent also found at readthedocs.org cover more advanced
topics.
• The per agent Flume Extensions Catalogue completes the available documentation for
Cygnus (cygnus-ngsi, cygnus-twitter).
• Other interesting links are:
• Our Apiary Documentation if you want to know how to use our API methods for
Cygnus.
• cygnus-ngsi integration examples .
• cygnus-ngsi introductory course in FIWARE Academy.
49
51. Round Robin channel selection
50
• It is possible to configure more than one channel-sink pair for each
storage, in order to increase the performance
• A custom ChannelSelector is needed
• https://github.com/telefonicaid/fiware-
connectors/blob/master/flume/doc/operation/performance_tuning
_tips.md
53. Pattern-based Context Data Grouping
52
• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a
concatenation:
– destination=<entity_id>-<entityType>
• It is possible to group different context data thanks to this regex-based feature
implemented as a Flume interceptor:
cygnusagent.sources.http-source.interceptors = ts de
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
cygnusagent.sources.http-source.interceptors.de.type =
es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtract
or$Builder
cygnusagent.sources.http-source.interceptors.de.matching_table =
/usr/cygnus/conf/matching_table.conf
54. Matching table for pattern-based grouping
53
• CSV file (‘|’ field separator) containing rules
– <id>|<comma-separated_fields>|<regex>|<destination>|<destination_dataset>
• For instance:
1|entityId,entityType|Room.(d*)Room|numeric_rooms|rooms
2|entityId,entityType|Room.(D*)Room|character_rooms|rooms
3|entityType,entityId|RoomRoom.(D*)|character_rooms|rooms
4|entityType|Room|other_roorms|rooms
• https://github.com/telefonicaid/fiware-
connectors/blob/master/flume/doc/design/interceptors.md#destinationextractor-interceptor
55. Kerberos authentication
54
• HDFS may be secured with Kerberos for authentication purposes
• Cygnus is able to persist on kerberized HDFS if the configured HDFS user has a
registered Kerberos principal and this configuration is added:
cygnusagent.sinks.hdfs-sink.krb5_auth = true
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file =
/usr/cygnus/conf/krb5_login.conf
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file =
/usr/cygnus/conf/krb5.conf
• https://github.com/telefonicaid/fiware-
connectors/blob/master/flume/doc/operation/hdfs_kerberos_authe
ntication.md
57. FIWARE Big Data ecosystem :
Cygnus and STH-Comet
Joaquin Salvachua
Andres Muñoz
Universidad Politécnica de Madrid (UPM)
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua