Anzeige
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Anzeige

Similar a Introduction to Apache Apex(20)

Anzeige

Introduction to Apache Apex

  1. Introduction to Apache Apex Priyanka Gugale (priyag@apache.org) September 30th 2016
  2. Next Gen Stream Data Processing • Data from variety of sources (IoT, Kafka, files, social media etc.) • Unbounded, continuous data streams ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  3. Apache Apex 3 • In-memory, distributed stream processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Natural programming model • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in your member variables • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes
  4. Apex Platform Overview 4
  5. Native Hadoop Integration 5 • YARN is the resource manager • HDFS for storing persistent state
  6. Application Development Model 6 ▪A Stream is a sequence of data tuples ▪A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded ▪Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  7. Dag Components 7 • Tuple ● Atomic data that flows over a stream • Operator ● Basic compute unit per tuple • Stream ● Connector abstraction between operators ● Tuples flow over this Operator 1 Operator 2 Stream tuple 3 tuple 1 tuple 2
  8. Dag Example 8 Stream
  9. Operator Library 9 RDBMS • Vertica • MySQL • Oracle • JDBC NoSQL • Cassandra, Hbase • Aerospike, Accumulo • Couchbase/ CouchDB • Redis, MongoDB • Geode Messaging • Kafka • Solace • Flume, ActiveMQ • Kinesis, NiFi File Systems • HDFS/ Hive • NFS • S3 Parsers • XML • JSON • CSV • Avro • Parquet Transformations • Filters • Rules • Expression • Dedup • Enrich Analytics • Dimensional Aggregations (with state management for historical data + query) Protocols • HTTP • FTP • WebSocket • MQTT • SMTP Other • Elastic Search • Script (JavaScript, Python, R) • Solr • Twitter
  10. 10 Platform Features
  11. Windowing in Apex 11 ● Data is flowing w.r.t time ● Computers understands time ● Use time axis as a reference ● Break the stream into finite time slices ⇒ Streaming Windows
  12. Windowing in Apex 12 12 Input Operator Operator 1 Operator 2 Operator 3 Window N+1 Begin Window Data Tuple End Window WNWN+1WN+2 As time progress
  13. Checkpointing 13 ▪ Application window ▪ Sliding window and tumbling window ▪ Checkpoint window ▪ No artificial latency
  14. Scalability 14 NxM Partitions Unifier 0 1 2 3 Logical DAG 0 1 2 1 1 Un ifie r 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 U ni fi e r 1 a 1 b 1 c 2 a 2 b U ni fi e r 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck U ni fi e r U ni fi e r 0 1 a 1 b 1 c 2 a 2 b U ni fi e r 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  15. Fault Tolerance 15 • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  16. • In-memory PubSub • Stores results emitted by operator until committed • Handles backpressure / spillover to local disk • Ordering, idempotency Operator 1 Container 1 Buffer Server Node 1 Operator 2 Container 2 Node 2 Buffer Server 16
  17. Industrial IoT applications 17 GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions. Business Need Apex based Solution Client Outcome • Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss • Predictive analytics to reduce costly maintenance and improve customer service • Unified monitoring of all connected sensors and devices to minimize disruptions • Fast application development cycle • High scalability to meet changing business and application workloads • Ingestion application using DataTorrent Enterprise platform • Powered by Apache Apex • In-memory stream processing • Built-in fault tolerance • Dynamic scalability • Comprehensive library of pre-built operators • Management UI console • Helps GE improve performance and lower cost by enabling real-time Big Data analytics • Helps GE detect possible failures and minimize unplanned downtimes with centralized management & monitoring of devices • Enables faster innovation with short application development cycle • No data loss and 24x7 availability of applications • Helps GE adjust to scalability needs with auto-scaling
  18. Resources for the use cases 18 • Pubmatic • https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE • https://www.youtube.com/watch?v=hmaSkXhHNu0 • http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using- apache-apex-hadoop • SilverSpring Networks • https://www.youtube.com/watch?v=8VORISKeSjI • http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks
  19. Q&A 19
Anzeige