Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
Reliable Data Intestion in BigData / IoT
1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Reliable Data Ingestion in Big Data/IoT
Guido Schmutz
@gschmutz
2. Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer, Software Architect for Java, SOA & Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Reliable Data Ingestion in Big Data/IoT
3. Our company.
Reliable Data Ingestion in Big Data/IoT
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and
technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
7. Big Data Definition (4 Vs)
+ Time to action ? – Big Data + Real-Time = Stream Processing
Characteristics of Big Data: Its Volume, Velocity
and Variety in combination
Reliable Data Ingestion in Big Data/IoT
8. Ever increasing volume and velocity - Internet of Things
(IoT) Wave
Internet of Things (IoT): Enabling
communication between devices,
people & processes to exchange
useful information & knowledge
that create value for humans
Term was first proposed by Kevin
Ashton in 1999
Source: The Economist
Source: Ericsson, June 2016
Reliable Data Ingestion in Big Data/IoT
9. What is Data Ingestion?
Acquiring data as it is produced from Data Source(s)
Transforming into a consumable form
Delivering the transformed data to the consuming system(s)
The challenge: Doing this continuously and at scale across a wide variety of
sources and consuming systems
Ingress and Egress are to other terms referring to data movement in and out
of a system
Reliable Data Ingestion in Big Data/IoT
10. Hadoop Clusterd
Hadoop Cluster
Hadoop Cluster
Lambda Architecture for Big Data
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Event
Hub
Event
Hub
Event
Hub
NoSQL
Parallel
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI Tools
Enterprise Data
Warehouse
Search
Online & Mobile
Apps
SQL Import
Weather
Data
Reliable Data Ingestion in Big Data/IoT
11. SQL Import
Hadoop Clusterd
Hadoop Cluster
Hadoop Cluster
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Weather
Data
Mobile
Apps
Batch Analytics
Streaming Analytics
Event
Hub
Event
Hub
Event
Hub
NoSQL
Parallel
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI Tools
Enterprise Data
Warehouse
Search
Online & Mobile
Apps
Integrate Sanitize / Normalize Deliver
12. IoT GW
MQTT Broker
Continuous Ingestion -
DataFlow Pipelines
DB Source
Big Data
Log
Stream
Processing
IoT Sensor
Event Hub
Topic
Topic
REST
Topic
IoT GW
CDC GW
Connect
CDC
DB Source
Log CDC
Native
IoT Sensor
IoT Sensor
12
Dataflow GW
Topic
Topic
Queue
Messaging GW
Topic
Dataflow GW
Dataflow
Topic
REST
12
File Source
Log
Log
Log
Social
Native
Reliable Data Ingestion in Big Data/IoT
13. DataFlow Pipeline
Reliable Data Ingestion in Big Data/IoT
• Flow-based ”programming”
• Ingest Data from various sources
• Extract – Transform – Load
• High-Throughput, straight-through
data flows
• Data Lineage
• Batch- or Stream-Processing
• Visual coding with flow editor
• Event Stream Processing (ESP) but
not Complex Event Processing (CEP)
Source: Confluent
14. SQL Polling
Change Data Capture (CDC)
File Stream (File Tailing)
File Stream (Appender)
Continuous Ingestion –
Integrating data sources
Sensor Stream
Reliable Data Ingestion in Big Data/IoT
15. Ingestion with/without Transformation?
Reliable Data Ingestion in Big Data/IoT
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
17. Why is Data Ingestion Difficult?
Physical and Logical
Infrastructure changes
rapidly
Key Challenges:
Infrastructure Automation
Edge Deployment
Infrastructure Drift
Data Structures and
formats evolve and change
unexpectedly
Key Challenges:
Consumption Readiness
Corruption and Loss
Structure Drift
Data semantics change
with evolving applications
Key Challenges
Timely Intervention
System Consistency
Semantic Drift
Reliable Data Ingestion in Big Data/IoT
Source: Streamsets
18. Challenges for Ingesting Sensor Data
Reliable Data Ingestion in Big Data/IoT
Multitude of sensors
Real-Time Streaming
Multiple Firmware versions
Bad Data from damaged sensors
Regulatory Constraints
Data Quality
Source: Cloudera
19. Key Elements of Data Ingestion
Reliable Data Ingestion in Big Data/IoT
Idempotence
Batching (Bulk)
Data Transformation
Compression
Availability and Recoverability
Reliable Data Transfer and Data
Validation
Resource Consumption
Performance
Monitoring
21. How to implement an Event Hub?
Apache Kafka to the rescue
• Distributed publish-subscribe messaging system
• Designed for processing of high-volume, real time
activity stream data (logs, metrics, social media, …)
• Stateless (passive) architecture, offset-based
consumption
• Provides Topics, but does not implement JMS
standard
• Initially developed at LinkedIn, now part of Apache
• Peak Load on single cluster: 2 million messages/sec, 4.7
Gigabits/sec inbound, 15 Gigabits/sec outbound
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Reliable Data Ingestion in Big Data/IoT
23. Apache Flume
distributed data collection service
gets flows of data (like logs) from their source
aggregates them to where they have to be
processed
Sources: files, syslog, avro, …
Sinks: HDFS files, HBase, …
Reliable Data Ingestion in Big Data/IoT
Source: Flume Documentation
24. Apache Sqoop
Reliable Data Ingestion in Big Data/IoT
• Sqoop exchanges data between an RDBMS and
Hadoop
• It can import all tables, single table, or a portion of a
table into HDFS
• Does this very efficiently via a Map-only MapReduce job
• Result is a directory in HDFS containing comma-
delimited text
• Scoop can also export data from HDFS back to the
database
$ sqoop import --connect jdbc:mysql://localhost/company
--username twheeler --password bigsecret
--warehouse-dir /mydata
--table customers
25. Oracle GoldenGate
Reliable Data Ingestion in Big Data/IoT
• Provides low-impact change data
capture solution for Oracle and Non-
Oracle RDMBS
• Non-intrusive
• Low-Latency
• Open, modular Architecture
• Supports heterogeneous systems
• Oracle GoldenGate for Big Data
provides Hadoop and Kafka Support
26. Apache Kafka Connect
• a tool for scalably and reliably streaming
data between Apache Kafka and other
data systems
• is not an ETL framework
• Pre-build connectors available for Data
Source and Data Sinks
• JDBC (Source)
• Oracle GoldenGate (Source)
• MQTT (Source)
• HDFS (Sink)
• Elasticsearch (Sink)
• MongoDB (Sink)
• Cassandra (Source & Sink)
Reliable Data Ingestion in Big Data/IoT
Source: Confluent
27. Apache NiFi & MiNiFi
• Originated at NSA as Niagarafiles
• Open sourced December 2014, Apache
TLP July 2015
• Opaque, file-oriented payload
• Distributed system of processors with
centralized control
• Based on flow-based programming
concepts
• Data Provenance
• Web-based user interface
• Apache MiNiFi focuses on the collection of
data at the source of its creation
Reliable Data Ingestion in Big Data/IoT
28. StreamSets Data Collector
Founded by ex-Cloudera, Informatica
employees
Continuous open source, intent-driven, big
data ingest
Visible, record-oriented approach fixes
combinatorial explosion
Batch or stream processing
• Standalone, Spark cluster, MapReduce
cluster
IDE for pipeline development by ‘civilians’
Relatively new - first public release
September 2015
So far, vast majority of commits are from
StreamSets staff
Reliable Data Ingestion in Big Data/IoT
29. Other Alternatives
Reliable Data Ingestion in Big Data/IoT
• Spring Cloud Data Flow
• Node-RED
• Project Flogo
• Oracle Streaming Analytics
• Spark Streaming
• …
31. Oracle’s Service Bus as a consumer of Kafka
Service Bus 12c
Cloud
Apps
Business
Service
Cloud
Proxy
Service
Kafka
Cloud
API
Mobile
Apps Pipeline
Routing
Kafka
Sensor /
IoT
Web Apps
Business
Service
REST
Business
Service
WSDL
Backend
Apps
REST
Backend
Apps
WSDL
Proxy
Service
Kafka
Pipeline
Routing
Database
DB CDC
Stream
Processing
Reliable Data Ingestion in Big Data/IoT
32. Oracle’s Service Bus as a producer to Kafka
Service Bus 12c
Cloud
Apps
Business
Service
Cloud
Proxy
Service
REST
Cloud
API
Mobile
Apps Pipeline
Routing
Sensor /
IoT
Web Apps
Business
Service
REST
Business
Service
Kafka
Backend
Apps
REST
Proxy
Service
SOAP
Pipeline
Routing
Reliable Data Ingestion in Big Data/IoT
Kafka
Backend
Apps
SOA / BPM
34. Trivadis @ DOAG 2016
Booth: 3rd Floor – next to the escalator
Know how, T-Shirts, Contest and Trivadis Power to go
We look forward to your visit
Because with Trivadis you always win !
Reliable Data Ingestion in Big Data/IoT