SlideShare ist ein Scribd-Unternehmen logo
1 von 70
LAMBDA ARCHITECTURES IN
PRACTICE
KAFKA · HADOOP · STORM · DRUID
GIAN MERLINO
DRUID COMMITTER · SOFTWARE ENGINEER @ METAMARKETS
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/lambda-arch-case-study
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
PROBLEM STREAMING DATA PIPELINES
CHALLENGES NO DATA LEFT BEHIND
INFRASTRUCTURE THE “RAD”-STACK
PLATFORM DEVELOPMENT AND OPERATIONS
TOOLS
OVERVIEW
THE PROBLEM
THE PROBLEM
THE PROBLEM
THE PROBLEM
‣ Business intelligence for ad-tech
‣ Arbitrary and interactive exploration
‣ Multi-tenancy: thousands of concurrent users
‣ Recency: explore current data, alert on major changes
‣ Efficiency: each event is individually very low-value
‣ Data model: must join impressions with clicks
2013
FINDING A SOLUTION
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home
2013
HADOOP BENEFITS AND
DRAWBACKS
‣ HDFS is a scalable, reliable storage technology
‣ MapReduce is great for data-parallel computation at scale
‣ But Hadoop MapReduce is not optimized for low latency
‣ To optimize queries, we need a query layer
‣ To load data quickly, we need streaming ingestion
2013
FINDING A SOLUTION
Query Layer
Hadoop
EventStreams
Insight
Streaming data pipeline
2013
FINDING A SOLUTION
Streaming data pipeline RDBMS
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
NoSQL K/V
Stores
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
Commercial
Databases
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
Hadoop
EventStreams
Insight
DRUID
‣ Druid project started in 2011, open sourced in Oct. 2012
‣ Designed for low latency ingestion and slice-and-dice
aggregation
‣ Growing Community
• ~45 contributors
• Used in production at numerous large and small organizations
‣ Cluster vitals
‣ 10+ trillion events, 200TB of compressed queryable data
‣ Ingesting over 450,000 events/sec on average
‣ 90th/95th/99th percentile queries within 1s/2s/10s
STREAMING DATA PIPELINES
TIME SERIES DATA
‣ Unifying feature: some notion of “event timestamp”
‣ Questions are typically also time-oriented
‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min
buckets
‣ BI: Which accounts brought in the most revenue this week?
‣ Web analytics: How many unique users today? By OS? By
page?
GOALS
‣ Low-latency results
‣ Strong guarantees for historical data
DATA PIPELINE
DATA PIPELINE
‣ Data bus
‣ Decouples data acquisition from
processing
‣ Can buffer as many unprocessed
messages as you have disk
DATA PIPELINE
‣ Stream processor
‣ Join impressions/clicks
‣ Transform data
DATA PIPELINE
DATA PIPELINE
‣ Store hour-partitioned in S3
‣ Can run Hadoop jobs on it
DATA PIPELINE
λ
DEFINITION
‣ Hybrid batch/streaming data pipeline
‣ Batch technologies
• Hadoop MapReduce
• Spark
‣ Streaming technologies
• Storm
• Spark Streaming
• Samza
WHY HYBRID?
‣ This sounds insane
‣ Need to develop for both
systems
‣ Need to operate both systems
‣ Nobody really wants to do this
WHY HYBRID?
‣ We want low-latency results
‣ We also want strong guarantees for historical data
‣ Many popular streaming systems are still immature
‣ The state of things will improve in the future
‣ …but that doesn’t help you right now
FAULTS
Processor Processor Processor
Data
Store
FAULTS
Processor Processor Processor
Data
Store
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor :(
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
COPING WITH FAULTS
‣ “Exactly once” semantics
‣ Transactions
‣ Idempotency
‣ Adding an element to a set
‣ Some kinds of sketches (HyperLogLog)
‣ Doesn’t work well for counters
LATE DATA
‣ Timeline-oriented operations are common in stream processing
‣ Windowed aggregates
‣ top pages per hour
‣ unique users per hour
‣ request counts per minute
‣ Group-by-key of related events
‣ user session analysis
‣ impression/click association
LATE DATA
‣ Similar challenges with both
‣ When can we be sure we have all the data?
‣ Normally, within a minute
‣ Or if a server is slow, a few minutes…
‣ Or if a server is down, a few hours or days…
‣ We don’t want to compromise between data quality and latency
REPROCESSING
‣ Something is broken!
‣ Your data was revised
‣ Or your code had a bug
WHY HYBRID?
Late Data
Reprocessin
g
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Late Data
Reprocessin
g
Batch– yes;
Streaming– no
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Late Data
Reprocessin
g
Depends on
user code
Windows based
on received
time, not actual
time
Depends on
user code
Batch– yes;
Streaming– no
Batch–
unlimited;
Streaming–
windowPeriod
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Transactions Late Data
Reprocessin
g
Rewind with
fresh state
Can use non-
streaming Spark
Rewind with
fresh state
Depends on
user code
Windows based
on received
time, not actual
time
Depends on
user code
Batch– yes;
Streaming– no
Can use batch
ingestion
Batch–
unlimited;
Streaming–
windowPeriod* Streaming data store, not actually a stream processo
*
WHY NOT HYBRID?
‣ Batch-only?
‣ If ingestion latencies are good enough, that’s great!
‣ Streaming-only?
‣ OK if you have transactions and a way to deal with late
data
‣ Current tools do require you to be careful
HYBRID PIPELINE
DEVELOPMENT
DEVELOPMENT
‣ Need code to run on two very different systems
‣ Maintaining two codebases is perilous
‣ Productivity loss
‣ Code drift
‣ Difficulty training new developers
PROGRAMMING MODEL
‣ “Query language,” if you prefer
‣ Write once, run… at least twice
‣ Open-source options
‣ Spark + Spark streaming (if you’re in the Spark ecosystem)
‣ Summingbird (key/value aggregation oriented)
‣ Or develop in-house for your use cases
‣ This investment can make sense if you have more pipeline
developers than infrastructure developers
PROGRAMMING MODEL
‣ We built “Starfire,” a Scala library for stream transformation
‣ Built around operators
‣ map, flatMap, filter
‣ groupBy, join
‣ lookup
‣ sample
‣ union
PROGRAMMING MODEL
‣ Load two data streams
‣ Join streams on shared key
‣ Produce combined records
‣ Export data
PROGRAMMING MODEL
Save
FlatMap
Cogrou
p
Load Load
EXECUTION
‣ User code generates a system-agnostic computation
graph
‣ Drivers optimize and compile the graph for each system
‣ Current drivers:
‣ Local (for unit tests)
‣ Hadoop (using Cascading)
‣ Storm
‣ Samza (beta)
BENEFITS
‣ Hedge your bets around infrastructure
‣ New drivers can run all existing user code
‣ Separate system optimization from code development
‣ Improved grouping engine without any user code changes
‣ We have many more pipeline developers than infrastructure
developers
‣ Developer productivity increased
‣ We use Starfire even for batch-only data pipelines
HYBRID PIPELINE OPERATIONS
OPERATIONS
‣ Need to operate two very different systems
‣ Must maintain “no data left behind”
‣ Cornerstones of operations
‣ Tools
‣ Metrics
‣ Alerts
TOOLS
‣ Hadoop job service
‣ Watch S3 for new data from Kafka
‣ Automatically process data after a few hours
‣ Retry failed jobs, alerts on repeated failures
TOOLS
‣ Storm job service
‣ Submit jobs to Nimbus, the Storm scheduler
‣ Monitor for failed jobs
‣ Alert on deployment failures
TOOLS
‣ Backfill tool
‣ Run backfills for particular intervals
‣ Can be used for data changes or algorithm changes
‣ Ensures alignment on Druid segment granularity
TOOLS
‣ Configuration management tool
‣ Control which version of the code should be running
‣ Manage schemas for each pipeline
‣ Manage quotas for each pipeline
‣ Central audit logs for configuration changes
STREAM METRICS
STREAM METRICS
DO TRY THIS AT HOME
2013
CORNERSTONES
‣ Druid - druid.io - @druidio
‣ Storm - storm.incubator.apache.org - @stormprocessor
‣ Hadoop - hadoop.apache.org
‣ Kafka - kafka.apache.org - @apachekafka
GLUE
storm-kafka Tranquility
Camus / Secor Druid Hadoop indexer
TAKE AWAYS
‣ Lambda architectures may not be with us forever
‣ But they solve a real problem: eventually consistent data
pipelines using popular open-source technologies
‣ Complexity can be managed with tools and practices
THANK YOU
@DRUIDIO
@METAMARK
ETS
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/lambda-
arch-case-study

Weitere ähnliche Inhalte

Was ist angesagt?

Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with DruidYousun Jeong
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
A real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonA real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonNathan Bijnens
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
The Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseThe Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 

Was ist angesagt? (20)

Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
A real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonA real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX London
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Druid
DruidDruid
Druid
 
The Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseThe Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to Database
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 

Andere mochten auch

Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Url Shortening Services
Url Shortening ServicesUrl Shortening Services
Url Shortening ServicesAltan Khendup
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryVinayak Agrawal
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmcIdo Shilon
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache SparkMahmoud Parsian
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionAmir Sedighi
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talkDanny Yuan
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache SparkOren Raboy
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionCloudera, Inc.
 
Interactive analytics at scale with druid
Interactive analytics at scale with druidInteractive analytics at scale with druid
Interactive analytics at scale with druidJulien Lavigne du Cadet
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project SparkFabrício Catae
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataDataWorks Summit
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 

Andere mochten auch (20)

Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 
Url Shortening Services
Url Shortening ServicesUrl Shortening Services
Url Shortening Services
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education Industry
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmc
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Interactive analytics at scale with druid
Interactive analytics at scale with druidInteractive analytics at scale with druid
Interactive analytics at scale with druid
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of Data
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 

Ähnlich wie Lambda Architectures in Practice

Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...InfluxData
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...SAP Cloud Platform
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudiMasters
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) RESTcoliquio GmbH
 
SaaS - Software as a Service - Charles University - Prague - March 2013
SaaS - Software as a Service - Charles University - Prague - March 2013SaaS - Software as a Service - Charles University - Prague - March 2013
SaaS - Software as a Service - Charles University - Prague - March 2013Jaroslav Gergic
 
Partner Connect APAC - 2022 - April
Partner Connect APAC - 2022 - AprilPartner Connect APAC - 2022 - April
Partner Connect APAC - 2022 - Aprilconfluent
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backIcinga
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Henning Jacobs
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Data Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsData Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsAnant Corporation
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 

Ähnlich wie Lambda Architectures in Practice (20)

Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) REST
 
SaaS - Software as a Service - Charles University - Prague - March 2013
SaaS - Software as a Service - Charles University - Prague - March 2013SaaS - Software as a Service - Charles University - Prague - March 2013
SaaS - Software as a Service - Charles University - Prague - March 2013
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Partner Connect APAC - 2022 - April
Partner Connect APAC - 2022 - AprilPartner Connect APAC - 2022 - April
Partner Connect APAC - 2022 - April
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Data Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsData Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps Fundamentals
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 

Mehr von C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Mehr von C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Lambda Architectures in Practice

  • 1. LAMBDA ARCHITECTURES IN PRACTICE KAFKA · HADOOP · STORM · DRUID GIAN MERLINO DRUID COMMITTER · SOFTWARE ENGINEER @ METAMARKETS
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /lambda-arch-case-study
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. PROBLEM STREAMING DATA PIPELINES CHALLENGES NO DATA LEFT BEHIND INFRASTRUCTURE THE “RAD”-STACK PLATFORM DEVELOPMENT AND OPERATIONS TOOLS OVERVIEW
  • 8. THE PROBLEM ‣ Business intelligence for ad-tech ‣ Arbitrary and interactive exploration ‣ Multi-tenancy: thousands of concurrent users ‣ Recency: explore current data, alert on major changes ‣ Efficiency: each event is individually very low-value ‣ Data model: must join impressions with clicks
  • 9. 2013 FINDING A SOLUTION ‣ Load all your data into Hadoop. Query it. Done! ‣ Good job guys, let’s go home
  • 10. 2013 HADOOP BENEFITS AND DRAWBACKS ‣ HDFS is a scalable, reliable storage technology ‣ MapReduce is great for data-parallel computation at scale ‣ But Hadoop MapReduce is not optimized for low latency ‣ To optimize queries, we need a query layer ‣ To load data quickly, we need streaming ingestion
  • 11. 2013 FINDING A SOLUTION Query Layer Hadoop EventStreams Insight Streaming data pipeline
  • 12. 2013 FINDING A SOLUTION Streaming data pipeline RDBMS Hadoop EventStreams Insight
  • 13. 2013 FINDING A SOLUTION Streaming data pipeline NoSQL K/V Stores Hadoop EventStreams Insight
  • 14. 2013 FINDING A SOLUTION Streaming data pipeline Commercial Databases Hadoop EventStreams Insight
  • 15. 2013 FINDING A SOLUTION Streaming data pipeline Hadoop EventStreams Insight
  • 16. DRUID ‣ Druid project started in 2011, open sourced in Oct. 2012 ‣ Designed for low latency ingestion and slice-and-dice aggregation ‣ Growing Community • ~45 contributors • Used in production at numerous large and small organizations ‣ Cluster vitals ‣ 10+ trillion events, 200TB of compressed queryable data ‣ Ingesting over 450,000 events/sec on average ‣ 90th/95th/99th percentile queries within 1s/2s/10s
  • 18. TIME SERIES DATA ‣ Unifying feature: some notion of “event timestamp” ‣ Questions are typically also time-oriented ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets ‣ BI: Which accounts brought in the most revenue this week? ‣ Web analytics: How many unique users today? By OS? By page?
  • 19. GOALS ‣ Low-latency results ‣ Strong guarantees for historical data
  • 21. DATA PIPELINE ‣ Data bus ‣ Decouples data acquisition from processing ‣ Can buffer as many unprocessed messages as you have disk
  • 22. DATA PIPELINE ‣ Stream processor ‣ Join impressions/clicks ‣ Transform data
  • 24. DATA PIPELINE ‣ Store hour-partitioned in S3 ‣ Can run Hadoop jobs on it
  • 26. λ
  • 27. DEFINITION ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies • Hadoop MapReduce • Spark ‣ Streaming technologies • Storm • Spark Streaming • Samza
  • 28. WHY HYBRID? ‣ This sounds insane ‣ Need to develop for both systems ‣ Need to operate both systems ‣ Nobody really wants to do this
  • 29. WHY HYBRID? ‣ We want low-latency results ‣ We also want strong guarantees for historical data ‣ Many popular streaming systems are still immature ‣ The state of things will improve in the future ‣ …but that doesn’t help you right now
  • 40. COPING WITH FAULTS ‣ “Exactly once” semantics ‣ Transactions ‣ Idempotency ‣ Adding an element to a set ‣ Some kinds of sketches (HyperLogLog) ‣ Doesn’t work well for counters
  • 41. LATE DATA ‣ Timeline-oriented operations are common in stream processing ‣ Windowed aggregates ‣ top pages per hour ‣ unique users per hour ‣ request counts per minute ‣ Group-by-key of related events ‣ user session analysis ‣ impression/click association
  • 42. LATE DATA ‣ Similar challenges with both ‣ When can we be sure we have all the data? ‣ Normally, within a minute ‣ Or if a server is slow, a few minutes… ‣ Or if a server is down, a few hours or days… ‣ We don’t want to compromise between data quality and latency
  • 43. REPROCESSING ‣ Something is broken! ‣ Your data was revised ‣ Or your code had a bug
  • 44. WHY HYBRID? Late Data Reprocessin g Transactions * Streaming data store, not actually a stream processo *
  • 45. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Late Data Reprocessin g Batch– yes; Streaming– no Transactions * Streaming data store, not actually a stream processo *
  • 46. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Late Data Reprocessin g Depends on user code Windows based on received time, not actual time Depends on user code Batch– yes; Streaming– no Batch– unlimited; Streaming– windowPeriod Transactions * Streaming data store, not actually a stream processo *
  • 47. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Transactions Late Data Reprocessin g Rewind with fresh state Can use non- streaming Spark Rewind with fresh state Depends on user code Windows based on received time, not actual time Depends on user code Batch– yes; Streaming– no Can use batch ingestion Batch– unlimited; Streaming– windowPeriod* Streaming data store, not actually a stream processo *
  • 48. WHY NOT HYBRID? ‣ Batch-only? ‣ If ingestion latencies are good enough, that’s great! ‣ Streaming-only? ‣ OK if you have transactions and a way to deal with late data ‣ Current tools do require you to be careful
  • 50. DEVELOPMENT ‣ Need code to run on two very different systems ‣ Maintaining two codebases is perilous ‣ Productivity loss ‣ Code drift ‣ Difficulty training new developers
  • 51. PROGRAMMING MODEL ‣ “Query language,” if you prefer ‣ Write once, run… at least twice ‣ Open-source options ‣ Spark + Spark streaming (if you’re in the Spark ecosystem) ‣ Summingbird (key/value aggregation oriented) ‣ Or develop in-house for your use cases ‣ This investment can make sense if you have more pipeline developers than infrastructure developers
  • 52. PROGRAMMING MODEL ‣ We built “Starfire,” a Scala library for stream transformation ‣ Built around operators ‣ map, flatMap, filter ‣ groupBy, join ‣ lookup ‣ sample ‣ union
  • 53. PROGRAMMING MODEL ‣ Load two data streams ‣ Join streams on shared key ‣ Produce combined records ‣ Export data
  • 55. EXECUTION ‣ User code generates a system-agnostic computation graph ‣ Drivers optimize and compile the graph for each system ‣ Current drivers: ‣ Local (for unit tests) ‣ Hadoop (using Cascading) ‣ Storm ‣ Samza (beta)
  • 56. BENEFITS ‣ Hedge your bets around infrastructure ‣ New drivers can run all existing user code ‣ Separate system optimization from code development ‣ Improved grouping engine without any user code changes ‣ We have many more pipeline developers than infrastructure developers ‣ Developer productivity increased ‣ We use Starfire even for batch-only data pipelines
  • 58. OPERATIONS ‣ Need to operate two very different systems ‣ Must maintain “no data left behind” ‣ Cornerstones of operations ‣ Tools ‣ Metrics ‣ Alerts
  • 59. TOOLS ‣ Hadoop job service ‣ Watch S3 for new data from Kafka ‣ Automatically process data after a few hours ‣ Retry failed jobs, alerts on repeated failures
  • 60. TOOLS ‣ Storm job service ‣ Submit jobs to Nimbus, the Storm scheduler ‣ Monitor for failed jobs ‣ Alert on deployment failures
  • 61. TOOLS ‣ Backfill tool ‣ Run backfills for particular intervals ‣ Can be used for data changes or algorithm changes ‣ Ensures alignment on Druid segment granularity
  • 62. TOOLS ‣ Configuration management tool ‣ Control which version of the code should be running ‣ Manage schemas for each pipeline ‣ Manage quotas for each pipeline ‣ Central audit logs for configuration changes
  • 65. DO TRY THIS AT HOME
  • 66. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Storm - storm.incubator.apache.org - @stormprocessor ‣ Hadoop - hadoop.apache.org ‣ Kafka - kafka.apache.org - @apachekafka
  • 67. GLUE storm-kafka Tranquility Camus / Secor Druid Hadoop indexer
  • 68. TAKE AWAYS ‣ Lambda architectures may not be with us forever ‣ But they solve a real problem: eventually consistent data pipelines using popular open-source technologies ‣ Complexity can be managed with tools and practices
  • 70. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/lambda- arch-case-study