FIWARE Tech Summit - FIWARE Cygnus and STH-Comet

FIWARE Big Data ecosystem : Cygnus and STH-Comet
Joaquin Salvachua
Andres Muñoz
Universidad Politécnica de Madrid
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua

Cygnus
• Persistence (collecting, aggregating and moving data
) for later Batch processing.
• Could be integrated into a lambda architecture
• Quite flexible and configurable: based on stream data
flows with a pub-sub like comunication model.
6

CYGNUS
• What is it for?
– Cygnus is a connector in charge of persisting Orion context data in certain
configured third-party storages, creating a historical view of such data. In other
words, Orion only stores the last value regarding an entity's attribute, and if an
older value is required then you will have to persist it in other storage, value by
value, using Cygnus.
• How does it receives context data from Orion Context Broker?
– Cygnus uses the subscription/notification feature of Orion. A subscription is made
in Orion on behalf of Cygnus, detailing which entities we want to be notified when
an update occurs on any of those entities attributes.
7

Cygnus
• Cygnus is a connector in charge of persisting certain
sources of data in certain configured third-party
storages, creating a historical view of such data.
• Internally, Cygnus is based on Apache Flume, data
collection and persistence agents.
– An agent is basically composed of a listener or source in charge of receiving the
data, a channel where the source puts the data once it has been transformed
into a Flume event, and a sink, which takes Flume events from the channel in
order to persist the data within its body into a third-party storage.
9

Cygnus Architecture
• Cygnus runs Flume agents. Thus, Cygnus agents
architecture is Flume agents one.
10

Data Sinks
• NGSI-like context data in:
– HDFS, the Hadoop distributed file system.
– MySQL, the well-know relational database manager.
– CKAN, an Open Data platform.
– MongoDB, the NoSQL document-oriented database.
– STH Comet, a Short-Term Historic database built on top of MongoDB.
– Kafka, the publish-subscribe messaging broker.
– DynamoDB, a cloud-based NoSQL database by Amazon Web Services.
– PostgreSQL, the well-know relational database manager.
– Carto, the database specialized in geolocated data.
• Twitter data in:
– HDFS, the Hadoop distributed file system.
11

Cygnus events
• A Source consumes Events having a specific format, and those Events are
delivered to the Source by an external source like a web server. For example,
an AvroSource can be used to receive Avro Events from clients or from other
Flume agents in the flow. When a Source receives an Event, it stores it into
one or more Channels. The Channel is a passive store that holds the Event
until that Event is consumed by a Sink. One type of Channel available in
Flume is the FileChannel which uses the local filesystem as its backing store.
A Sink is responsible for removing an Event from the Channel and putting it
into an external repository like HDFS (in the case of an HDFSEventSink) or
forwarding it to the Source at the next hop of the flow. The Source and Sink
within the given agent run asynchronously with the Events staged in the
Channel.
12

Cygnus Configuration examples
• https://github.com/telefonicaid/fiware-
cygnus/blob/master/doc/cygnus-
ngsi/installation_and_administration_guide/confi
guration_examples.md
13

Multiple persistence backends
14

Multiple Agents
• One instance for each
Agent.
• This add more capability to
the system
15

Connecting Orion Context Broker and
Cygnus
• Cygnus takes advantage of the subscription-notification mechanism
of Orion Context Broker. Specifically, Cygnus needs to be notified each
time certain entity's attributes change, and in order to do that, Cygnus
must subscribe to those entity's attribute changes.
16

Configure a basic Cygnus agent
21
• Edit /usr/cygnus/conf/agent_<id>.conf
• List of sources, channels and sinks:
cygnusagent.sources = http-source
cygnusagent.sinks = hdfs-sink
cygnusagent.channels = hdfs-channel
• Channels configuration
cygnusagent.channels.hdfs-channel.type = memory
cygnusagent.channels.hdfs-channel.capacity = 1000
cygnusagent.channels.hdfs-channel.
transactionCapacity = 100

22
• Sources configuration:
cygnusagent.sources.http-source.channels = hdfs-channel
cygnusagent.sources.http-source.type =
org.apache.flume.source.http.HTTPSource
cygnusagent.sources.http-source.port = 5050
cygnusagent.sources.http-source.handler =
es.tid.fiware.fiwareconnectors.cygnus.handlers.OrionRestHandler
cygnusagent.sources.http-source.handler.notification_target =
/notify
cygnusagent.sources.http-source.handler.default_service =
def_serv
cygnusagent.sources.http-source.handler.default_service_path =
def_servpath
cygnusagent.sources.http-source.handler.events_ttl = 10
cygnusagent.sources.http-source.interceptors = ts de
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
cygnusagent.sources.http-source.interceptors.de.type =
es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationEx
tractor$Builder
cygnusagent.sources.http-source.interceptors.de.matching_table =
/usr/cygnus/conf/matching_table.conf

23
• Sinks configuration:
cygnusagent.sinks.hdfs-sink.channel = hdfs-channel
cygnusagent.sinks.hdfs-sink.type =
es.tid.fiware.fiwareconnectors.cygnus.sinks.OrionHDFSSink
cygnusagent.sinks.hdfs-sink.cosmos_host = cosmos.lab.fi-
ware.org
cygnusagent.sinks.hdfs-sink.cosmos_port = 14000
cygnusagent.sinks.hdfs-sink.cosmos_default_username =
cosmos_username
cygnusagent.sinks.hdfs-sink.cosmos_default_password =
xxxxxxxxxxxxx
cygnusagent.sinks.hdfs-sink.hdfs_api = httpfs
cygnusagent.sinks.hdfs-sink.attr_persistence = column
cygnusagent.sinks.hdfs-sink.hive_host = cosmos.lab.fi-
ware.org
cygnusagent.sinks.hdfs-sink.hive_port = 10000
cygnusagent.sinks.hdfs-sink.krb5_auth = false

HDFS details regarding Cygnus persistence
24
• By default, for each entity Cygnus stores the data at:
– /user/<your_user>/<service>/<service-path>/<entity-id>-<entity-type>/<entity-id>-
<entity-type>.txt
• Within each HDFS file, the data format may be json-row or json-column:
– json-row
{
"recvTimeTs":"13453464536”,
"recvTime":"2014-02-27T14:46:21”,
"entityId":"Room1”,
"entityType":"Room”,
"attrName":"temperature”,
"attrType":"centigrade”,
“attrValue":"26.5”,
"attrMd":[
…
]
}
– json-column
{
"recvTime":"2014-02-27T14:46:21”,
"temperature":"26.5”,
"temperature_md":[
…
],
“pressure”:”90”,
“pressure_md”:[
…
]
}

High Availability
• Simple configuration:
– implementing HA for Flume/Cygnus is as easy as running two
instances of the software and putting a load balancer in
between them and the data source (or sources).
• Use File Channels instead of Memory Channels (extra
persistence) which is the default.
• Advanced configuration:
– Flume with Zookeeper
• https://github.com/telefonicaid/fiware-cygnus/blob/master/doc/cygnus-ngsi/installation_and_administration_guide/reliability.md
25

Data schemas and pre-aggregation
• Although the STH stores the evolution of (raw) data (i.e., attributes
values) in time, its real power comes from the storage of aggregated
data
• The STH should be able to respond to queries such as:
– Give me the maximum temperature of this room during the last month
(range) aggregated by day (resolution)
– Give me the mean temperature of this room today (range) aggregated by
hour or even minute (resolution)
– Give me the standard deviation of the temperature of this room this last
year (range) aggregated by day (resolution)
– Give me the number of times the air conditioner of this room was switched
on or off last Monday (range) aggregated by hour
33

Data schemas and pre-aggregation
34

Log level retrieval & update
41

Configuration : environment variables
43

44

45

Usage and installation
Installation
– Git clone https://github.com/ging/fiware-sth-comet
– Npm install
• Docker
– Docker pull fiware/sth-comet
– Docker run –t –i fiware/sth-comet
• Running
– Fiware-sth-comet> ./bin/sth46

Extra documentation
• The per agent Quick Start Guide found at readthedocs.org provides a good
documentation summary (cygnus-ngsi, cygnus-twitter).
• Nevertheless, both the Installation and Administration Guide and the User and
Programmer Guide for each agent also found at readthedocs.org cover more advanced
topics.
• The per agent Flume Extensions Catalogue completes the available documentation for
Cygnus (cygnus-ngsi, cygnus-twitter).
• Other interesting links are:
• Our Apiary Documentation if you want to know how to use our API methods for
Cygnus.
• cygnus-ngsi integration examples .
• cygnus-ngsi introductory course in FIWARE Academy.
49

Round Robin channel selection
50
• It is possible to configure more than one channel-sink pair for each
storage, in order to increase the performance
• A custom ChannelSelector is needed
connectors/blob/master/flume/doc/operation/performance_tuning
_tips.md

RoundRobinChannelSelector configuration
51
cygnusagent.sources = mysource
cygnusagent.sinks = mysink1 mysink2 mysink3
cygnusagent.channels = mychannel1 mychannel2 mychannel3
cygnusagent.sources.mysource.type = ...
cygnusagent.sources.mysource.channels = mychannel1
mychannel2 mychannel3
cygnusagent.sources.mysource.selector.type =
es.tid.fiware.fiwareconnectors.cygnus.channelselectors.
RoundRobinChannelSelector
cygnusagent.sources.mysource.selector.storages = N
cygnusagent.sources.mysource.selector.storages.storage1
= <subset_of_cygnusagent.sources.mysource.channels>
...
cygnusagent.sources.mysource.selector.storages.storageN
= <subset_of_cygnusagent.sources.mysource.channels>

Pattern-based Context Data Grouping
52
• Default destination (HDFS file, mMySQL table or CKAN resource) is obtained as a
concatenation:
– destination=<entity_id>-<entityType>
• It is possible to group different context data thanks to this regex-based feature
implemented as a Flume interceptor:
cygnusagent.sources.http-source.interceptors = ts de
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
cygnusagent.sources.http-source.interceptors.de.type =
es.tid.fiware.fiwareconnectors.cygnus.interceptors.DestinationExtract
or$Builder
cygnusagent.sources.http-source.interceptors.de.matching_table =
/usr/cygnus/conf/matching_table.conf

Kerberos authentication
54
• HDFS may be secured with Kerberos for authentication purposes
• Cygnus is able to persist on kerberized HDFS if the configured HDFS user has a
registered Kerberos principal and this configuration is added:
cygnusagent.sinks.hdfs-sink.krb5_auth = true
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_user = krb5_username
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_password = xxxxxxxxxxxx
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_login_file =
/usr/cygnus/conf/krb5_login.conf
cygnusagent.sinks.hdfs-sink.krb5_auth.krb5_conf_file =
/usr/cygnus/conf/krb5.conf
connectors/blob/master/flume/doc/operation/hdfs_kerberos_authe
ntication.md

Thank you!
http://fiware.org
Follow @FIWARE on Twitter

FIWARE Big Data ecosystem :
Cygnus and STH-Comet
Joaquin Salvachua
Andres Muñoz
Universidad Politécnica de Madrid (UPM)
Joaquin.salvachua@upm.es, @jsalvachua, @FIWARE
www.slideshare.net/jsalvachua

FIWARE Tech Summit - FIWARE Cygnus and STH-Comet

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie FIWARE Tech Summit - FIWARE Cygnus and STH-Comet

Ähnlich wie FIWARE Tech Summit - FIWARE Cygnus and STH-Comet (20)

Mehr von FIWARE

Mehr von FIWARE (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

FIWARE Tech Summit - FIWARE Cygnus and STH-Comet