SlideShare ist ein Scribd-Unternehmen logo
1 von 27
1© Cloudera, Inc. All rights reserved.
IoT with Spark Streaming
Anand Iyer, Senior Product Manager
2© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data stream is represented as DStreams (Discretized Streams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – process using RDD operations
• Micro-batches usually 0.5 sec in size
3© Cloudera, Inc. All rights reserved.
Cloudera customer use case examples – Streaming
• On-line fraud
detection
Financial
Services
• On-line
recommender
systems
• Inventory
management
Retail
• Incident
prediction
(sepsis)
Health
• Analysis of ad
performance in
real-time
Ad tech
4© Cloudera, Inc. All rights reserved.
Concrete end-to-end IoT Use Case
Using Spark Streaming with Kafka, HBase & Solr
5© Cloudera, Inc. All rights reserved.
Proactive maintenance and accident prevention in Railways
• Sensor information continuously streaming in from railway carriages
• Goal: Early detection of damage to rail carriage wheels or to railway tracks
• Proactively fix issues before they become severe
• Prevent derailments, save money and lives
• Based on real-world use case, modified to fit the talk
6© Cloudera, Inc. All rights reserved.
Locomotive Wheel Axle Sensors
Each Sensor Reading Contains:
- Unique ID
- Locomotive ID
- Speed
- Temperature
- Pressure
- Acoustic signals
- GPS Co-ordinates
- Timestamp
- etc
7© Cloudera, Inc. All rights reserved.
Identify Damage to locomotive axle or wheels
Manifests as sustained increase
in sensor readings like temperature,
pressure, acoustic noise, etc.
8© Cloudera, Inc. All rights reserved.
Identify Damage on railway tracks
Manifests as a sudden spike
in sensor readings for
pressure or acoustic noise.
9© Cloudera, Inc. All rights reserved.
Real-Time Detection of Locomotive Wheel Damage
Kafka
- Enrich incoming events with relevant meta-
data
- Locomotive information from
locomotive ID: type, weight, cargo,etc
- Sensor information from Sensor ID:
precise location, type, etc
- GPS co-ordinates to location
characteristics such as gradient of track.
- Recommend HBase as metadata store.
- Use HBase-spark module to fetch data.
- Apply application logic to determine if
sensor readings indicate damage
- Simple rule based
- Complex predictive machine learning
model
10© Cloudera, Inc. All rights reserved.
Real-Time Detection of Locomotive Wheel Damage
Kafka Kafka
https://github.com/harishreedharan/spark-streaming-kafka-output
HDFS
11© Cloudera, Inc. All rights reserved.
Real-Time Detection of Locomotive Wheel Damage
- When an alert is thrown, technician will need to diagnose the event
- Requires visualizing sensor data as a time-series:
- Over arbitrary windows of time
- Compare with values from prior trips
- Software for visualization: http://grafana.org/
- Technician can take appropriate action based on analysis:
- Send rail carriage for maintenance
- Stop train immediately to prevent accident
Visualize Time-Series Sensor Data
12© Cloudera, Inc. All rights reserved.
Data Store for Time-Series Data
Ideal solution: Kudu
- Time series data entails sequential scans for writes and reads, interspersed with
random seeks
Until Kudu is GA:
- Use HBase and model tables for time-series data
- OpenTSDB:
- Built on top of HBase
- Uses a HBase table schema optimized for time-series data
- Simple HTTP API
13© Cloudera, Inc. All rights reserved.
Real-Time Detection of Locomotive Wheel Damage
Kafka Kafka
HDFS
14© Cloudera, Inc. All rights reserved.
Detecting damage to Railtracks
• They manifest as a sharp spike in sensor readings (pressure, acoustic noise)
• Multiple sensors will demonstrate the same spike at the same location (GPS co-
ordinates)
• Multiple sensors from multiple trains will give similar readings at the same
location.
How to detect?
• Index each sensor reading, in Solr, such that they can be queried by GPS co-
ordinates
• When a “spike” is observed, and corresponding alert event is fired, trigger a
search
15© Cloudera, Inc. All rights reserved.
Detecting damage to Rail tracks
• Index each sensor reading, with the Morphlines library
• Embed call to Morphlines in your Spark Streaming application
• Values can be kept in the index for specified period of time, such as a month. Solr can automatically
purge old documents from the index.
• When a “spike” is observed, and corresponding alert event is fired, trigger a
search (manually or programmatically)
• Search for sensor readings at the same GPS co-ordinates as the latest spike.
• Filter out irrelevant readings (e.g. readings on the left track, if spike was observed on the right track)
• Sort results by time, latest to oldest
• If majority of recent readings show a “spike”, indicative of track damage
16© Cloudera, Inc. All rights reserved.
Final Architecture
Kafka Kafka
HDFS
17© Cloudera, Inc. All rights reserved.
Noteworthy Streaming Constructs
18© Cloudera, Inc. All rights reserved.
Sliding Window Operations
Example usages:
- compute counts of items in latest window of time, such as occurrences of exceptions in a
log or trending hashtags in a tweet stream
- Join two streams by matching keys within same window
Note: Provide adequate memory to hold a window’s worth of data
Define operations on data
within a sliding window.
Window Parameters:
- window length
- sliding interval
19© Cloudera, Inc. All rights reserved.
Maintain and update arbitrary state
updateStateByKey(...)
• Define initial state
• Provide state update function
• Continuously update with new information
Examples:
• Running count of words seen in text stream
• Per user session state from activity stream
Note: Requires periodic check-pointing to fault-tolerant storage.
20© Cloudera, Inc. All rights reserved.
Lessons from Production
21© Cloudera, Inc. All rights reserved.
Use Kafka Direct Connector whenever possible
• Better efficiency and performance than Receiver based Connectors
• Automatic back-pressure: steady performance
Kafka
Spark
Driver
Executor
Executor
Executor
Executor
Receiver
Receiver
Spark
Driver
Executor
Executor
Executor
Executor
22© Cloudera, Inc. All rights reserved.
The challenge with Checkpoints
• Spark checkpoints are java serialized
• Upgradeability can be an issue – upgrading the version of Spark or your
application can make checkpointed data unreadable
But long running applications need updates and upgrades!!
23© Cloudera, Inc. All rights reserved.
Upgrades with Checkpoints
• Most often, all you need to pick up is some previous state – maybe an RDD or some
“state”(updateStateByKey), or last processed Kafka offsets
• The solution: Disable Spark Checkpoints
• Use foreachRDD to persist state yourself, to HDFS, in a format your application can
understand
• E.g. Avro, Protobuf, Parquet…
• For upateStateByKey, generate the new state - then persist
24© Cloudera, Inc. All rights reserved.
updateStateByKey(…) upcoming improvements
• Time-out: Automatically delete data after a preset number of micro-batches
• Efficient Updates: Only update a subset of the keys
• Callback to persist state during graceful shutdown
25© Cloudera, Inc. All rights reserved.
Exactly Once Semantics
What is it?
Given a stream of incoming data, any operator is applied exactly once on each
item.
Why is it important?
Prevent erroneous processing of data stream. E.g., Double counting of
aggregations or throwing of redundant alerts
Spark Streaming provides exactly one semantics for data transformations.
However, output operations provide at-least once semantics!!
26© Cloudera, Inc. All rights reserved.
Exactly Once Semantics with Spark Streaming & Kafka
• Associate a “key” with each value written to external store, that can be used for
de-duping
• This key needs to be unique for a given micro-batch
• Kafka Direct Connector provides the following associated with each record, which
will be the same for a given micro-batch:
Kafka-Partition + start-offset + end-offset
• Check out org.apache.spark.streaming.kafka.OffsetRanges and
org.apache.spark.streaming.kafka.HasOffsetRanges
27© Cloudera, Inc. All rights reserved.
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 

Was ist angesagt? (20)

Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Conviva spark
Conviva sparkConviva spark
Conviva spark
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 

Andere mochten auch

Andere mochten auch (16)

Hadoop, Iot and Analytics- The Three Musketeers
Hadoop, Iot and Analytics- The Three MusketeersHadoop, Iot and Analytics- The Three Musketeers
Hadoop, Iot and Analytics- The Three Musketeers
 
Healthcare IoT and Analytics to treat Parkinsons
Healthcare IoT and Analytics to treat ParkinsonsHealthcare IoT and Analytics to treat Parkinsons
Healthcare IoT and Analytics to treat Parkinsons
 
Event Driven Streaming Analytics - Demostration on Architecture of IoT
Event Driven Streaming Analytics - Demostration on Architecture of IoTEvent Driven Streaming Analytics - Demostration on Architecture of IoT
Event Driven Streaming Analytics - Demostration on Architecture of IoT
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDB
 
Intelligent APIs for Big Data & IoT Create customized data views for mobile,...
Intelligent APIs for Big Data & IoT  Create customized data views for mobile,...Intelligent APIs for Big Data & IoT  Create customized data views for mobile,...
Intelligent APIs for Big Data & IoT Create customized data views for mobile,...
 
Data Analytics for IoT Device Deployments: Industry Trends and Architectural ...
Data Analytics for IoT Device Deployments: Industry Trends and Architectural ...Data Analytics for IoT Device Deployments: Industry Trends and Architectural ...
Data Analytics for IoT Device Deployments: Industry Trends and Architectural ...
 
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
 
User and IoT Data Analytics
User and IoT Data AnalyticsUser and IoT Data Analytics
User and IoT Data Analytics
 
IoT Analytics Company Presentation
IoT Analytics Company Presentation IoT Analytics Company Presentation
IoT Analytics Company Presentation
 
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
AWS re:Invent 2016: Understanding IoT Data: How to Leverage Amazon Kinesis in...
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
 
Data Analytics for IoT
Data Analytics for IoT Data Analytics for IoT
Data Analytics for IoT
 

Ähnlich wie IoT Austin CUG talk

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 

Ähnlich wie IoT Austin CUG talk (20)

Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Symantec SDN Deployment
Symantec SDN DeploymentSymantec SDN Deployment
Symantec SDN Deployment
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with KafkaAvoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
 
Informix MQTT Streaming
Informix MQTT StreamingInformix MQTT Streaming
Informix MQTT Streaming
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 

Mehr von Felicia Haggarty (7)

8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOps8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOps
 
Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Data revolution by Doug Cutting
Data revolution by Doug CuttingData revolution by Doug Cutting
Data revolution by Doug Cutting
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

IoT Austin CUG talk

  • 1. 1© Cloudera, Inc. All rights reserved. IoT with Spark Streaming Anand Iyer, Senior Product Manager
  • 2. 2© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data stream is represented as DStreams (Discretized Streams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – process using RDD operations • Micro-batches usually 0.5 sec in size
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera customer use case examples – Streaming • On-line fraud detection Financial Services • On-line recommender systems • Inventory management Retail • Incident prediction (sepsis) Health • Analysis of ad performance in real-time Ad tech
  • 4. 4© Cloudera, Inc. All rights reserved. Concrete end-to-end IoT Use Case Using Spark Streaming with Kafka, HBase & Solr
  • 5. 5© Cloudera, Inc. All rights reserved. Proactive maintenance and accident prevention in Railways • Sensor information continuously streaming in from railway carriages • Goal: Early detection of damage to rail carriage wheels or to railway tracks • Proactively fix issues before they become severe • Prevent derailments, save money and lives • Based on real-world use case, modified to fit the talk
  • 6. 6© Cloudera, Inc. All rights reserved. Locomotive Wheel Axle Sensors Each Sensor Reading Contains: - Unique ID - Locomotive ID - Speed - Temperature - Pressure - Acoustic signals - GPS Co-ordinates - Timestamp - etc
  • 7. 7© Cloudera, Inc. All rights reserved. Identify Damage to locomotive axle or wheels Manifests as sustained increase in sensor readings like temperature, pressure, acoustic noise, etc.
  • 8. 8© Cloudera, Inc. All rights reserved. Identify Damage on railway tracks Manifests as a sudden spike in sensor readings for pressure or acoustic noise.
  • 9. 9© Cloudera, Inc. All rights reserved. Real-Time Detection of Locomotive Wheel Damage Kafka - Enrich incoming events with relevant meta- data - Locomotive information from locomotive ID: type, weight, cargo,etc - Sensor information from Sensor ID: precise location, type, etc - GPS co-ordinates to location characteristics such as gradient of track. - Recommend HBase as metadata store. - Use HBase-spark module to fetch data. - Apply application logic to determine if sensor readings indicate damage - Simple rule based - Complex predictive machine learning model
  • 10. 10© Cloudera, Inc. All rights reserved. Real-Time Detection of Locomotive Wheel Damage Kafka Kafka https://github.com/harishreedharan/spark-streaming-kafka-output HDFS
  • 11. 11© Cloudera, Inc. All rights reserved. Real-Time Detection of Locomotive Wheel Damage - When an alert is thrown, technician will need to diagnose the event - Requires visualizing sensor data as a time-series: - Over arbitrary windows of time - Compare with values from prior trips - Software for visualization: http://grafana.org/ - Technician can take appropriate action based on analysis: - Send rail carriage for maintenance - Stop train immediately to prevent accident Visualize Time-Series Sensor Data
  • 12. 12© Cloudera, Inc. All rights reserved. Data Store for Time-Series Data Ideal solution: Kudu - Time series data entails sequential scans for writes and reads, interspersed with random seeks Until Kudu is GA: - Use HBase and model tables for time-series data - OpenTSDB: - Built on top of HBase - Uses a HBase table schema optimized for time-series data - Simple HTTP API
  • 13. 13© Cloudera, Inc. All rights reserved. Real-Time Detection of Locomotive Wheel Damage Kafka Kafka HDFS
  • 14. 14© Cloudera, Inc. All rights reserved. Detecting damage to Railtracks • They manifest as a sharp spike in sensor readings (pressure, acoustic noise) • Multiple sensors will demonstrate the same spike at the same location (GPS co- ordinates) • Multiple sensors from multiple trains will give similar readings at the same location. How to detect? • Index each sensor reading, in Solr, such that they can be queried by GPS co- ordinates • When a “spike” is observed, and corresponding alert event is fired, trigger a search
  • 15. 15© Cloudera, Inc. All rights reserved. Detecting damage to Rail tracks • Index each sensor reading, with the Morphlines library • Embed call to Morphlines in your Spark Streaming application • Values can be kept in the index for specified period of time, such as a month. Solr can automatically purge old documents from the index. • When a “spike” is observed, and corresponding alert event is fired, trigger a search (manually or programmatically) • Search for sensor readings at the same GPS co-ordinates as the latest spike. • Filter out irrelevant readings (e.g. readings on the left track, if spike was observed on the right track) • Sort results by time, latest to oldest • If majority of recent readings show a “spike”, indicative of track damage
  • 16. 16© Cloudera, Inc. All rights reserved. Final Architecture Kafka Kafka HDFS
  • 17. 17© Cloudera, Inc. All rights reserved. Noteworthy Streaming Constructs
  • 18. 18© Cloudera, Inc. All rights reserved. Sliding Window Operations Example usages: - compute counts of items in latest window of time, such as occurrences of exceptions in a log or trending hashtags in a tweet stream - Join two streams by matching keys within same window Note: Provide adequate memory to hold a window’s worth of data Define operations on data within a sliding window. Window Parameters: - window length - sliding interval
  • 19. 19© Cloudera, Inc. All rights reserved. Maintain and update arbitrary state updateStateByKey(...) • Define initial state • Provide state update function • Continuously update with new information Examples: • Running count of words seen in text stream • Per user session state from activity stream Note: Requires periodic check-pointing to fault-tolerant storage.
  • 20. 20© Cloudera, Inc. All rights reserved. Lessons from Production
  • 21. 21© Cloudera, Inc. All rights reserved. Use Kafka Direct Connector whenever possible • Better efficiency and performance than Receiver based Connectors • Automatic back-pressure: steady performance Kafka Spark Driver Executor Executor Executor Executor Receiver Receiver Spark Driver Executor Executor Executor Executor
  • 22. 22© Cloudera, Inc. All rights reserved. The challenge with Checkpoints • Spark checkpoints are java serialized • Upgradeability can be an issue – upgrading the version of Spark or your application can make checkpointed data unreadable But long running applications need updates and upgrades!!
  • 23. 23© Cloudera, Inc. All rights reserved. Upgrades with Checkpoints • Most often, all you need to pick up is some previous state – maybe an RDD or some “state”(updateStateByKey), or last processed Kafka offsets • The solution: Disable Spark Checkpoints • Use foreachRDD to persist state yourself, to HDFS, in a format your application can understand • E.g. Avro, Protobuf, Parquet… • For upateStateByKey, generate the new state - then persist
  • 24. 24© Cloudera, Inc. All rights reserved. updateStateByKey(…) upcoming improvements • Time-out: Automatically delete data after a preset number of micro-batches • Efficient Updates: Only update a subset of the keys • Callback to persist state during graceful shutdown
  • 25. 25© Cloudera, Inc. All rights reserved. Exactly Once Semantics What is it? Given a stream of incoming data, any operator is applied exactly once on each item. Why is it important? Prevent erroneous processing of data stream. E.g., Double counting of aggregations or throwing of redundant alerts Spark Streaming provides exactly one semantics for data transformations. However, output operations provide at-least once semantics!!
  • 26. 26© Cloudera, Inc. All rights reserved. Exactly Once Semantics with Spark Streaming & Kafka • Associate a “key” with each value written to external store, that can be used for de-duping • This key needs to be unique for a given micro-batch • Kafka Direct Connector provides the following associated with each record, which will be the same for a given micro-batch: Kafka-Partition + start-offset + end-offset • Check out org.apache.spark.streaming.kafka.OffsetRanges and org.apache.spark.streaming.kafka.HasOffsetRanges
  • 27. 27© Cloudera, Inc. All rights reserved. Thank You

Hinweis der Redaktion

  1. Good afternoon Everyone. Today we are going to talk about one of the most popular extensions of Spark : Spark Streaming. And we will talk about using Spark Streaming to implement a use case in a fast growing, and simply put, really cool and popular domain: the Internet of Things. We wall walk you through a concrete Internet of Things use case. When we talk about the use case, we will focus on end-to-end architectures. After covering the use case, we will do a deeper dive into some interesting spark streaming features such as sliding windows, streaming state, ml algorithms, and share some pro-tips or best practices with you.
  2. So first, a very quick primer on spark streaming: In Spark Streaming, each incoming data stream is represented by an abstraction called a Dstream…which stands for Discretized Stream. A Dstream is a continuous stream of data, broken up into chunks called micro-batches. Data in each micro-batch becomes and RDD, and is processed by RDD operations. A batch spark job is defined as a sequence of transformatins and actions of RDDs…..similarly a streaming job is authored as a sequence of transformations and actions on Dstreams. Dstream micro-batch sizes are often 1 second or even 0.5 second in size.
  3. Spark Streaming has seen tremendous adoption over the past year and we are seeing customers deploy it for a wide variety of use cases….and here I have a random collection of examples of use cases in diverse industries.
  4. But today we will talk about an Internet of Things use case: Proactive maintenance and accident prevention in railways. The internet of things is all about sensors, continuously sending data back to your data center. In our case, we are talking about sensors fitted to railway locomotives and railway carriages The goal is to process this sensor data to identify 2 critical issues: Damage to the the wheels or axles of trains Damage to railway tracks At one end, this will help us prevent derailments. Trains are among the safest modes of transporation…much safer than cars. However, many of these accidents are preventable. Also, the proportion of freight trains is a lot higher than trains transporting humans. When freight train derailment happens, there many not always be a loss of life…..and hence not covered in the news…but there is a heavy financial loss….all of which can be avoided. That is one end of the spectrum. The other piece is simply identifying defects early, so that they can be fixed proactively thus extending the lifespan of locomotives and rail carriages as well as tracks….fixing issues early, nipping it in the bud so to spreak, will invariably save costs. This example is based on a real-world use case, but it has been heavily modified and simplified to fit a 15 minute slide deck
  5. Lets do a deeper dive into the sensors we are talking about: In this diagram of a railway carriage, the red spots on the wheels of the train, denote where the sensors will be located. These sensors will send back information, on a regular basis….lets say couple measurements per second. The frequency of readings is adjustable. Each reading will have: A unique ID, that identifies the sensor An ID that identifies the locomotive A speed measurement….while diagnosing an issue it is important to know how fast the train was going Temperature measurement….if something goes wrong, invariably something is bound to get too hot Pressure…if the wheels can not spin comfortable because something is hindering them, the pressure readings will go up Acoustic Signals…..basically noise….noise is a good indicator of problems…for example the sound of clanging metal is a lot different than the smooth turning of wheels or humming of engines GPS co-ordinates….this is important, we need to know where the train is for many reasons….which we will talk about shortly Timestamp….you need to know when the reading was taken
  6. Ok….so given these sensor readings, how do we identify damage to the wheels: They will manifest as a sustained increase in sensor readings like temperature, pressure or acoustic noise. It will be a pronounced lasting increase, possible progressively getting worse
  7. How do we identify damage to the rail track? Damage to the railtrack is going to be at a specific location…..often on just one side of the track. When a wheel goes over the problem area, there is bound to be a sudden sharp pronounced spike in sensor readings….most likely acoustic noise and pressure. The key thing is that it will be a pronounced spike, at a specific location, after which the sensors readings will come back down to normal.
  8. Cool. Now lets talk about the implementation. Data gets from the locomotive sensors to our data center….how that happens is not in the scope of this talk…..if you are really curious you probably need to attend some conference by Cisco or Intel. Once it gets into your data center, write it to a streaming data channel….we recommend Kafka. From Kafka, you can read the events in your spark streaming job and process them. We recommend using the receiverless direct connector to read from Kafka. In your spark streaming job, you will first need to enrich the data….that is….for each event attach it with relevant metadata which is required to identify damage. For example. Use the locomotive ID to get information about the locomotive such as type…is it freight or passenger, weight, type of cargo….if it is carrying dangerious chemicals you probably want to want to stop the train even for slight damage….vs an empty cargo train. Similarly, join with information about the sensor, such as its location on the train….is it on the right wheel, the left wheel….and stuff like that. From GPS information, figure out where the train is….if it is going up a steep incline, temperature readings may go up, and that is ok. We recommend storing this type of metadata in Hbase which is ideal for randome key reads….and Hbase comes with the hbase-spark module that makes it easy to call hbase from spark jobs. Once you have enriched the data, and transformed it…you can determine if the sensor readiings signify damage….and that can be rule based or it can be a machine learing model that is trained….again….outside the domain of this talk since we are not a bunch of mechanical engineers.
  9. When potential damage is identified write an event out to Kafka….say to a topic called, “alerts”….and have an application listening to this topic that will in fact send out a pager alert or email alert or other form of alert to technicians. Write raw data to HDFS. It will come in handy: Team of data scientists can do offline analysis More important, the raw data will come in handy when there is a bug in application logic…or a faulty sensor, your end results don’t match expectations. So for auditing purposes….auditing your application and in coming data.
  10. So we have identified a potential problem. The next step is for a technician to step in and diagnose. Diagnosing the issue will require visualizing the sensor readings as time series data….look at how they are trending, looks at readings for different windows of time…..compare with readings from a different time window……all of this entails visualing time series data. For visualization of time series, grafana is a popular and useful application….but there are many other options available, and it is also fairly easy to build one with javascript. The technician can manually inspect the data and decide what to do…..send the railway carriage for maintenance or if things seem bad, stop it and get it physically checked out.
  11. We are talking about a lot of sensors producing 1 or more readings per second…that is a lot of data and it needs to be stored in a way that lends itself for time series visualization. Time series data entails sequential reads….since you look at a continuous window of time….similarly writing time series data is also sequential since you will keep appending newer readings….these sequential scans are interspersed with random reads….when for example you change your window start and stop time or move back on forth between different points in time. The ideal storage for this is Kudu: Kudu delivers the best performance for mixed scan and random seek workloads. Until Kudu is GA, use Hbase. Hbase performance may not match Kudu performance, but it will certainly work for this use case.
  12. Can we not use sensors on the tracks. Sure. But sensors on locomotives are easy, few. Raitracks travel through remote areas Those are the hardest ones to put sensors, and those are probably the ones that do need monitoring.
  13. Can we not use sensors on the tracks. Sure. But sensors on locomotives are easy, few. Raitracks travel through remote areas Those are the hardest ones to put sensors, and those are probably the ones that do need monitoring. Call out Hue!!