SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns For Real Time Streaming
Data Analytics
15 Apr 2015
Sheetal Dolas
Principal Architect, Hortonworks
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Who am I ?
• Principal Architect @ Hortonworks
• Most of the career has been in field, solving real life
business problems
• Last 5+ years in Big Data including Hadoop, Storm etc.
• Co-developed Cisco OpenSOC ( http://opensoc.github.io
)
sheetal@hortonworks.com
@sheetal_dolas
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• Streaming Architectural Patterns - Overview
• Design Patterns
o What
o Why
o Illustrations
• QA
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Architectural Patterns
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real Time Streaming Architecture
Source Systems
Sources
Syslog
Machine
Data
External
Streams
Other
Data
Collection
Flume /
Custom
Agent A
Agent B
Agent N
Messaging
System
Kafka
Topic B
Topic N
Topic A
Real Time
Processing
Storm
Topology B
Topology
N
Topology A
Storage
Search
Elastic
Search / Solr
Low Latency
NoSql
HBase
Historic
Hive /
HDFS
Access
Web Services
REST API
Web Apps
Analytic
Tools
R / Python
BI Tools
Alerting
Systems
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lambda Architecture
New Data
Data
Stream
Batch Layer
All Data
Pre-compute
Views
Speed Layer
Stream
Processing
Real Time View
Serving Layer
Batch View
Batch View
Data Access
Query
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Kappa Architecture
Data Source
Data
Stream
Stream Processing
System
Job Version n
Serving DB
Output table n
Output table n +
1
Data Access
Query
Job Version n +
1
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Pattern – What is it?
A General reusable solution to a commonly occurring
problem within a given context in software design.
SolutionReusable Problem
Commonl
y
Occurring
Software
Design
Contextua
l
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns – Why ?
• Streaming use cases have distinct characteristics
o Unpredictable incoming data patterns
o Correlating multiple streams
o Out-of-sequence and late events
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Design Patterns – Why ?
• High scale and continuous streams pose new challenges
o Peaks and valleys
o Changing data characteristics over period of time
o Maintain the latency and throughput SLAs
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Patterns
Architectural
Patterns
• Real-time
Streaming
• Near-real-time
Streaming
• Lambda
Architecture
• Kappa
Architecture
Functional Patterns
• Stream Joins
• Top N
(Trending)
• Rolling
Windows
Data Management
Patterns
• External Lookup
• Responsive
Shuffling
• Out-of-
Sequence
Events
Data Security
Patterns
• Message
Encryption
• Authorized
Access
• Secure Cluster
Authentication
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streaming Patterns – Being Discussed
Architectural
Patterns
• Real-time
Streaming
• Near-real-time
Streaming
• Lambda
Architecture
• Kappa
Architecture
Functional Patterns
• Stream Joins
• Top N
(Trending)
• Rolling
Windows
Data Management
Patterns
• External Lookup
• Responsive
Shuffling
• Out-of-
Sequence
Events
Data Security
Patterns
• Message
encryption
• Authorized
Access
• Secure Cluster
Authentication
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup
Dynamic, High Speed Enrichments With External Data Lookup
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Description
Referencing frequently changing external system data for
event enrichments, filters or validations
by minimizing the event processing latencies, system
bottlenecks and maintaining high throughput.
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Challenges
• Increased latency due to frequent external system calls
• Insufficient memory to hold all reference data in memory
• Scalability & performance issues with large data
reference sets
• Reference data needs frequent cache purge and
refreshes
• External systems can become a bottleneck
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup – Potential Options
Performance Scalability Fault Tolerance
Always Fetch
Cache Everything
Partition and
Cache on the go
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - A Reference Use Case
• Real Time Credit Card Fraud Identification and Alert
o Credit card transaction data comes as stream (typically through
Kafka)
o External system holds information about the card holder’s recent
location
o Each credit card transaction is looked up against user’s current
location
o If the geographic distance between the credit card transaction
location and user’s recent known location is significant, the credit
card transaction is flagged as potential fraud
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Topology Overview
StormSource Stream
Credit Card
Transaction
Spout
Partitioner
Bolt
Alerting System
External
Reference Data
Fraud
Analyzer Bolt
Locally caches
the user location
data. Cache
validity is time
bound
Partitions data
based on area code
of the mobile
numbers
User Location
Information
Fraud Alert
Email
Looks up user’s current location
from external system and finds
geo distance between
transaction location and user
location
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Peek in the Bolts
Storm
Partitioner Bolt
Instance 2
Partitioner Bolt
Instance 1
Partitioner Bolt
Instance n
Fraud Analyzer
Bolt
Instance 1
CA NV TX
Fraud Analyzer
Bolt
Instance 2
NY CT MA
Fraud Analyzer
Bolt
Instance n
FL NC OH
Stream is partitioned
based on area code
Local cache
(time sensitive)
(Use lightweight
caching solution like
Guava)
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - Benefits of the approach
• Only required data is cached (on demand)
• Each bolt caches only partition of reference data
• Data is locally cached so trips to external system are
reduced
• Cache is time sensitive
• On the go cache building handles failures elegantly
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup – Applicability
• Stream processing depends on external data
• External data is sufficiently large that could not be hold in
memory of each task
• External data keeps changing
• External system has scalability limitations
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Description
Automatically adjust shuffling for better performance and
throughput during peaks and varying data skews in streams
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Challenges
• Incoming data stream is unpredictable and can be
skewed
• Skew can change from time to time
• Managing latency and throughput with skews is difficult
• Since streams are continuously flowing, restarting
topology with new shuffling logic is practically not
possible
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Shuffling – Potential Options
Latency &
Throughput
System Reliability Uptime
Static Shuffle
Responsive
Shuffle
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
External Lookup - A Reference Use Case
• Optimized HBase Inserts
o Event data is stored in HBase after storm processing
o Group events such that a bolts can insert more events in HBase
with less trips to region servers
o Over period of time HBase regions can split/merge
o Automatically adjust the event grouping as HBase region layout
changes over period of time
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example – HBase writes w/o responsive shuffling
HBase Bolt
Instance 2
(100 events)
HBase Bolt
Instance 1
(100 events)
HBase Bolt
Instance 3
(100 events)
Region Server
Instance 1
(100 events)
Region Server
Instance 2
(100 events)
Region Server
Instance 3
(100 events)
300
events
sent
300
events
received
9 trips to
region
servers
300
events
sent
App Bolt
Instance 1
(100 events)
App Bolt
Instance 2
(100 events)
App Bolt
Instance 3
(100 events)
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Design
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example – HBase writes with responsive shuffling
HBase Bolt
Instance 2
(100 events)
HBase Bolt
Instance 1
(100 events)
HBase Bolt
Instance 3
(100 events)
Region Server
Instance 1
(100 events)
Region Server
Instance 2
(100 events)
Region Server
Instance 3
(100 events)
300
events
sent
300
events
received
3 trips to
region
servers
300
events
sent
RS Aware
Partitioner
RS Aware
Partitioner
RS Aware
Partitioner
Partitioner
automatically
adapts to
splitting/mergi
ng HBase
regions
App Bolt
Instance 1
(100 events)
App Bolt
Instance 2
(100 events)
App Bolt
Instance 3
(100 events)
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Benefits
• Topology responds to changes in data patterns and
adopts accordingly
• Maintains high level of SLA and throughput adherence
• Minimizes needs for maintenance & hence downtimes
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Responsive Shuffling - Applicability
• Change in shuffle pattern does not impact final outcome
• Data stream has varying skews
• Target/Reference system specifications change over
period of time
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Description
An out-of-sequence event is one that's received late,
sufficiently late that you've already processed events that
should have been processed after the out-of-sequence
event was received.
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Challenges
• Hard to determine if all events in given window have
been received
• Need referencing of relevant data for late events
• Builds more pressure on processing components
• Increased latency and degraded overall system
performance
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events – Potential Options
Latency Result Accuracy Operational Ease
Drop
Wait
Fan Out
Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Processing
Source Spout
Event Filter
Bolt
Typical
Processing
Bolt
Monitors currently being
processed events and
identifying out-of-sequence
events
In sequence
events
Out-of-
Sequence
events
Special
Handling Bolt
Based on
complexities in
processing, this can
be extended as
different topology
Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events – Benefits
• Separation of concerns
• Maintain the the overall throughput and latency
requirements
• Independent scaling of components
Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Out-of-Sequence Events - Applicability
• When order of events matter
• Processing out-of-sequence events needs special and
complex logic
• Stream has relatively low volume of out-of-sequence
events
Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Steam application is continuously running process as
opposed to batch process
• Think long term and changing data patterns over period
• Simplicity gives more reliability and predictability
• Use one or more patterns in conjunction to address the
use case
• Patterns are contextual. May not be suitable for every
case.
Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You!
sheetal@hortonworks.com
@sheetal_dolas
Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Appendix
Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka
Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Description
Ability to use Kafka as secure data transfer mechanism.
Apache Kafka is widely used messaging platform in
streaming applications. Unfortunately Kafka does not have
built in support for Authentication & Authorization (yet)
Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Flow
Source Systems
Sources
Syslog
Data
Collection
Custom
Collector
Encryptin
g
Producer
Messaging
System
Kafka
Encrypted
Messages
Real Time Processing
Storm
Kafka
Spout
Decryptin
g Bolt
App Bolt
Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka – Encryption Details
Data Collection
Event Producer
Messaging
System
Kafka
Topic
Event(s)
Envelope
Real Time Processing
Storm Decrypting Bolt
Event(s) Envelope
Encrypted AES
Key (w/ RSA)
Encrypted Event
(w/ AES)
Event(s)
Envelope
Event(s)
Envelope
Event
Encrypt
event(s)
w/ AES
Encrypt
AES
key w/
RSA
Event
Decrypt
event(s)
w/ AES
Decrypt
AES
key w/
RSA
Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka – Encryption Details
• RSA public/private keys are generated ahead of time and
securely shared with topology
• AES key is randomly generated and periodically
refreshed
• Only user having appropriate RSA private key can read
the data
• One event or a batch of events can be encrypted
together as per needs
Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Security in Kafka - Applicability
• Multiple applications want to use Kafka as their source to
the stream
• Data is sensitive and can not be shared between
applications
• Other components in the pipeline are secured
Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching
Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Description
Micro-batching is a technique that allows a process or task
to treat a stream as a sequence of small batches or
chunks of data.
For incoming streams, the events can be packaged into
small batches and delivered to a batch system for
processing
Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Challenges
• Data delivery reliability
• Unnecessary data duplication
• Increased latency
• Complexity in time-bound batching
Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching – Potential Options
Simplicity Reusability Reliability
Batch Triggering
Thread
Controller Stream
Tick Tuples
Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tick Tuples
Tick tuples are system generated tuples that Storm can
send to your bolt if you need to perform some actions at a
fixed interval
Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tick Tuple based Micro Batching - Benefits
• Takes advantages of system characteristic by batching
events together
• Adheres to processing latency needs by ensuring that
batches are executed by certain intervals
• Prevents data loss by acknowledging events only after
successful processing
• Simple, elegant and easy to maintain code
Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching - Applicability
• Target systems are more efficient with bulk transactions
• Processing group of events is more efficient than
individual event
• End to end event latency is not super sensitive
Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Micro Batching – Sample Code
Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You!
sheetal@hortonworks.com
@sheetal_dolas

Weitere ähnliche Inhalte

Was ist angesagt?

OBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptOBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptCanara bank
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringHadi Fadlallah
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Flink Streaming
Flink StreamingFlink Streaming
Flink StreamingGyula FĂłra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyRogier Werschkull
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 

Was ist angesagt? (20)

OBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptOBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.ppt
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics history
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 

Ähnlich wie Design Patterns For Real Time Streaming Data Analytics

HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionMilind Pandit
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanAnkit Singhal
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHaimo Liu
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Carolyn Duby
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityAccumulo Summit
 

Ähnlich wie Design Patterns For Real Time Streaming Data Analytics (20)

HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Joseph Witt
Joseph WittJoseph Witt
Joseph Witt
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 

Mehr von DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

KĂźrzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

KĂźrzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Design Patterns For Real Time Streaming Data Analytics

  • 1. Page1 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Data Analytics 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
  • 2. Page2 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
  • 3. Page3 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • Streaming Architectural Patterns - Overview • Design Patterns o What o Why o Illustrations • QA
  • 4. Page4 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Architectural Patterns
  • 5. Page5 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Real Time Streaming Architecture Source Systems Sources Syslog Machine Data External Streams Other Data Collection Flume / Custom Agent A Agent B Agent N Messaging System Kafka Topic B Topic N Topic A Real Time Processing Storm Topology B Topology N Topology A Storage Search Elastic Search / Solr Low Latency NoSql HBase Historic Hive / HDFS Access Web Services REST API Web Apps Analytic Tools R / Python BI Tools Alerting Systems
  • 6. Page6 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Lambda Architecture New Data Data Stream Batch Layer All Data Pre-compute Views Speed Layer Stream Processing Real Time View Serving Layer Batch View Batch View Data Access Query
  • 7. Page7 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Kappa Architecture Data Source Data Stream Stream Processing System Job Version n Serving DB Output table n Output table n + 1 Data Access Query Job Version n + 1
  • 8. Page8 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns
  • 9. Page9 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Pattern – What is it? A General reusable solution to a commonly occurring problem within a given context in software design. SolutionReusable Problem Commonl y Occurring Software Design Contextua l
  • 10. Page10 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • Streaming use cases have distinct characteristics o Unpredictable incoming data patterns o Correlating multiple streams o Out-of-sequence and late events
  • 11. Page11 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • High scale and continuous streams pose new challenges o Peaks and valleys o Changing data characteristics over period of time o Maintain the latency and throughput SLAs
  • 12. Page12 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message Encryption • Authorized Access • Secure Cluster Authentication
  • 13. Page13 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns – Being Discussed Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message encryption • Authorized Access • Secure Cluster Authentication
  • 14. Page14 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup Dynamic, High Speed Enrichments With External Data Lookup
  • 15. Page15 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Description Referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies, system bottlenecks and maintaining high throughput.
  • 16. Page16 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Challenges • Increased latency due to frequent external system calls • Insufficient memory to hold all reference data in memory • Scalability & performance issues with large data reference sets • Reference data needs frequent cache purge and refreshes • External systems can become a bottleneck
  • 17. Page17 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Potential Options Performance Scalability Fault Tolerance Always Fetch Cache Everything Partition and Cache on the go
  • 18. Page18 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Real Time Credit Card Fraud Identification and Alert o Credit card transaction data comes as stream (typically through Kafka) o External system holds information about the card holder’s recent location o Each credit card transaction is looked up against user’s current location o If the geographic distance between the credit card transaction location and user’s recent known location is significant, the credit card transaction is flagged as potential fraud
  • 19. Page19 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Topology Overview StormSource Stream Credit Card Transaction Spout Partitioner Bolt Alerting System External Reference Data Fraud Analyzer Bolt Locally caches the user location data. Cache validity is time bound Partitions data based on area code of the mobile numbers User Location Information Fraud Alert Email Looks up user’s current location from external system and finds geo distance between transaction location and user location
  • 20. Page20 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Peek in the Bolts Storm Partitioner Bolt Instance 2 Partitioner Bolt Instance 1 Partitioner Bolt Instance n Fraud Analyzer Bolt Instance 1 CA NV TX Fraud Analyzer Bolt Instance 2 NY CT MA Fraud Analyzer Bolt Instance n FL NC OH Stream is partitioned based on area code Local cache (time sensitive) (Use lightweight caching solution like Guava)
  • 21. Page21 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Benefits of the approach • Only required data is cached (on demand) • Each bolt caches only partition of reference data • Data is locally cached so trips to external system are reduced • Cache is time sensitive • On the go cache building handles failures elegantly
  • 22. Page22 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Applicability • Stream processing depends on external data • External data is sufficiently large that could not be hold in memory of each task • External data keeps changing • External system has scalability limitations
  • 23. Page23 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling
  • 24. Page24 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Description Automatically adjust shuffling for better performance and throughput during peaks and varying data skews in streams
  • 25. Page25 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Challenges • Incoming data stream is unpredictable and can be skewed • Skew can change from time to time • Managing latency and throughput with skews is difficult • Since streams are continuously flowing, restarting topology with new shuffling logic is practically not possible
  • 26. Page26 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Shuffling – Potential Options Latency & Throughput System Reliability Uptime Static Shuffle Responsive Shuffle
  • 27. Page27 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Optimized HBase Inserts o Event data is stored in HBase after storm processing o Group events such that a bolts can insert more events in HBase with less trips to region servers o Over period of time HBase regions can split/merge o Automatically adjust the event grouping as HBase region layout changes over period of time
  • 28. Page28 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes w/o responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 9 trips to region servers 300 events sent App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  • 29. Page29 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Design
  • 30. Page30 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes with responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 3 trips to region servers 300 events sent RS Aware Partitioner RS Aware Partitioner RS Aware Partitioner Partitioner automatically adapts to splitting/mergi ng HBase regions App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  • 31. Page32 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Benefits • Topology responds to changes in data patterns and adopts accordingly • Maintains high level of SLA and throughput adherence • Minimizes needs for maintenance & hence downtimes
  • 32. Page33 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Applicability • Change in shuffle pattern does not impact final outcome • Data stream has varying skews • Target/Reference system specifications change over period of time
  • 33. Page34 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events
  • 34. Page35 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Description An out-of-sequence event is one that's received late, sufficiently late that you've already processed events that should have been processed after the out-of-sequence event was received.
  • 35. Page36 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Challenges • Hard to determine if all events in given window have been received • Need referencing of relevant data for late events • Builds more pressure on processing components • Increased latency and degraded overall system performance
  • 36. Page37 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Potential Options Latency Result Accuracy Operational Ease Drop Wait Fan Out
  • 37. Page38 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Processing Source Spout Event Filter Bolt Typical Processing Bolt Monitors currently being processed events and identifying out-of-sequence events In sequence events Out-of- Sequence events Special Handling Bolt Based on complexities in processing, this can be extended as different topology
  • 38. Page39 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Benefits • Separation of concerns • Maintain the the overall throughput and latency requirements • Independent scaling of components
  • 39. Page40 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Applicability • When order of events matter • Processing out-of-sequence events needs special and complex logic • Stream has relatively low volume of out-of-sequence events
  • 40. Page41 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary
  • 41. Page42 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Steam application is continuously running process as opposed to batch process • Think long term and changing data patterns over period • Simplicity gives more reliability and predictability • Use one or more patterns in conjunction to address the use case • Patterns are contextual. May not be suitable for every case.
  • 42. Page43 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas
  • 43. Page44 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Appendix
  • 44. Page45 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka
  • 45. Page46 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Description Ability to use Kafka as secure data transfer mechanism. Apache Kafka is widely used messaging platform in streaming applications. Unfortunately Kafka does not have built in support for Authentication & Authorization (yet)
  • 46. Page47 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Flow Source Systems Sources Syslog Data Collection Custom Collector Encryptin g Producer Messaging System Kafka Encrypted Messages Real Time Processing Storm Kafka Spout Decryptin g Bolt App Bolt
  • 47. Page48 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details Data Collection Event Producer Messaging System Kafka Topic Event(s) Envelope Real Time Processing Storm Decrypting Bolt Event(s) Envelope Encrypted AES Key (w/ RSA) Encrypted Event (w/ AES) Event(s) Envelope Event(s) Envelope Event Encrypt event(s) w/ AES Encrypt AES key w/ RSA Event Decrypt event(s) w/ AES Decrypt AES key w/ RSA
  • 48. Page49 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details • RSA public/private keys are generated ahead of time and securely shared with topology • AES key is randomly generated and periodically refreshed • Only user having appropriate RSA private key can read the data • One event or a batch of events can be encrypted together as per needs
  • 49. Page50 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Applicability • Multiple applications want to use Kafka as their source to the stream • Data is sensitive and can not be shared between applications • Other components in the pipeline are secured
  • 50. Page51 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching
  • 51. Page52 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Description Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing
  • 52. Page53 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Challenges • Data delivery reliability • Unnecessary data duplication • Increased latency • Complexity in time-bound batching
  • 53. Page54 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Potential Options Simplicity Reusability Reliability Batch Triggering Thread Controller Stream Tick Tuples
  • 54. Page55 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuples Tick tuples are system generated tuples that Storm can send to your bolt if you need to perform some actions at a fixed interval
  • 55. Page56 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuple based Micro Batching - Benefits • Takes advantages of system characteristic by batching events together • Adheres to processing latency needs by ensuring that batches are executed by certain intervals • Prevents data loss by acknowledging events only after successful processing • Simple, elegant and easy to maintain code
  • 56. Page57 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Applicability • Target systems are more efficient with bulk transactions • Processing group of events is more efficient than individual event • End to end event latency is not super sensitive
  • 57. Page58 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Sample Code
  • 58. Page59 Š Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas

Hinweis der Redaktion

  1. As businesses are realizing the power Hadoop and large data analytics, many businesses are demanding large scale real time streaming data analytics. Apache Storm and Apache Spark are platforms that can process large amount of data in real time. However building applications on these platforms that can scale, reliably process data without any loss, satisfy functional needs and at the same time meet the strict latency requirements, takes lot of work to get it right. After implementing multiple large real time data processing applications using these technologies in various business domains, we distilled commonly required solutions into generalized design patterns. These patterns are proven in the very large production deployments where they process millions of events per second, tens of billions of events per day and tens of terabytes of data per day.
  2. All data is dispatched to both the batch layer and the speed layer batch layer - (i) manage the master dataset and (ii) to pre-compute the batch views. The serving layer indexes the batch views for low-latency, ad-hoc queries The speed layer compensates for the high latency and deals with recent data only incoming query can be answered by merging batch and real-time views
  3. Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. When the second job has caught up, switch the application to read from the new table. Stop the old version of the job, and delete the old output table.
  4. Not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem Patterns are formalized best practices
  5. Credit card transaction data comes as stream (typically through Kafka) An external system has information about the credit card holder’s recent location (collected from GPS on mobile device and/or from mobile towers) Each credit card transaction is looked up against user’s current location If the geographic distance between the credit card transaction location and user’s recent known location is significant (say 100 miles), the credit card transaction is flagged as potential fraud
  6. Only required data is cached (on demand) Hence reduced cache size requirements Each bolt caches only partition of reference data No duplicate caching. Reduced cache size requirements Can process more data with same RAM available Data is locally cached so trips to external system are reduced Reduced latency and increased system throughput Reduced load on external system Cache is time sensitive Provides ability to refresh cache after certain intervals for dynamic reference data On the go cache building handles failures elegantly Cache gets auto built as the events are re processed and no additional handling needed Also the data patterns are more predictable so you can also pre build cache on component start
  7. Out-of-sequence events can come very late and processing them would need referencing of relevant data In streaming applications, it is hard to determine if all events in given window have been received Out-of-sequence events can come very late that it can build more pressure on processing components as they need to wait longer as well as do additional processing for very old events The complexity can increase latency of events processed and degrade overall system performance
  8. Separation of concerns – Separate the processing responsibilities between typical event processing and exceptional event processing Typical processing components and Special handling components can be scaled independently (parallelism, memory needs, latency needs)
  9. When order of events matter - Input stream may have out-of-sequence events that need to be processed appropriately
  10. As businesses are realizing the power Hadoop and large data analytics, many businesses are demanding large scale real time streaming data analytics. Apache Storm and Apache Spark are platforms that can process large amount of data in real time. However building applications on these platforms that can scale, reliably process data without any loss, satisfy functional needs and at the same time meet the strict latency requirements, takes lot of work to get it right. After implementing multiple large real time data processing applications using these technologies in various business domains, we distilled commonly required solutions into generalized design patterns. These patterns are proven in the very large production deployments where they process millions of events per second, tens of billions of events per day and tens of terabytes of data per day.