SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Building a community fountain around a data stream
Maria Patterson
Astronomer @ University of Washington
Tech Community Organizer @ PyLadies Seattle and PyData Seattle Meetup groups
Twitter: @OpenSciPinay
1 / 32
Maria Patterson @OpenSciPinay
Crown Fountain in detail
2 / 32
About me
2016.08 - now
Scientist @UW
2013 - 2016
Scientist @UChicago
2007 - 2013
MS+PhD @NMSU
2003 - 2007
BA @UChicago
I've worked in astronomy on
Star formation (deep imaging and spectra of baby star regions)
Gas accretion (modeling galaxy gas haloes)
Old stellar streams (4-meter Mosaic image processing)
Properties of star formation and the interstellar medium in galaxy outskirts
I've worked in cloud computing + "Data Commons"
NASA satellite data re-analysis pipelines (Storm + Hadoop + Accumulo)
NOAA's move to the cloud (NOAA Big Data Project with Open Commons Consortium)
API-based data services (Open Science Data Cloud)
A case for data commons: toward data science as a service
Etc blah blah things and stuff
Cleveland native (Go Cavs!)/ Chicago transplant (Go Cubs!)
Distance running Twitter: @runaverage
Eating (pizza, old-fashioned donuts (+drinks), beer)
https://mtpatter.github.io/pepsi-hidden- gures/
Maria Patterson @OpenSciPinay 3 / 32
Maria Patterson @OpenSciPinay 4 / 32
Maria Patterson @OpenSciPinay 5 / 32
Maria Patterson @OpenSciPinay 6 / 32
Personally, I liked the university. They gave us money and facilities. We didn't have to produce anything! You've never
been out of college. You don't know what it's like out there. I worked in the private sector. They expect results.
Maria Patterson @OpenSciPinay 7 / 32
Maria Patterson @OpenSciPinay 8 / 32
Maria Patterson @OpenSciPinay 9 / 32
Maria Patterson @OpenSciPinay 10 / 32
Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
https://github.com/mtpatter/alert_stream
Check out sample Avro schema and data for astronomy
https://github.com/mtpatter/sample-avro-alert
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao
Maria Patterson @OpenSciPinay 11 / 32
Maria Patterson @OpenSciPinay
alerts alerts alerts alerts all alerts
pipelines
distribution
broker 1 broker 2 broker 3 broker 4
alert
database
you me alice bob carol dan tweak
12 / 32
Large alert streams opened as public data:
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB
Maria Patterson @OpenSciPinay 13 / 32
Large alert streams opened as public data:
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB
...compared to a database of Virtual Observatory VOEvent alerts:
Numbers voeventdb.4pisky.org
total April 2014 - July 6, 2017 1,679,719
alert frequency 1 per minute
Maria Patterson @OpenSciPinay 14 / 32
1. Transport system: Apache Kafka
Scalability
Replication
Allows stream "rewind"
2. Data formatting: Apache Avro
Fast parsing with structured messages (typing)
Strictly enforced schemas, but schema evolution
Allows postage stamp cutout les
3. Filtering/ processing: Apache Spark
Direct connection to transport system
Stream interface similar to batch
Allows for Python or simple SQL-like queries
Maria Patterson @OpenSciPinay 15 / 32
Distributed log system/ messaging queue
Reinvented as strongly ordered, pub/sub streaming
platform
Highly scalable, in production at LinkedIn, Net ix,
Microsoft
Great clients + connectors, including Python - good
usability
Transport prototyping: Apache Kafka
Maria Patterson @OpenSciPinay 16 / 32
Transport prototyping: Apache Kafka
1. Mock a stream of alerts sent to Kafka
from lsst.alert.stream import AlertProducer
streamProducer = AlertProducer(topic, schema, **kafkaConf)
streamProducer.send(dict_data, avroEncode=True/False)
streamProducer.flush()
https://github.com/mtpatter/alert_stream
Includes Docker Compose le for Kafka + Zookeeper
Con gures connection to Kafka broker and sets stream topic
Takes dict (with optional stamps)
Can turn Avro encoding on or off
Uses single repeated "alert" in testing
Sends con gurable batch size continuously every 39/45 s
Maria Patterson @OpenSciPinay 17 / 32
Transport prototyping: Apache Kafka
2. Use template consumers to see alerts
from lsst.alert.stream import AlertProducer
streamReader = AlertConsumer(topic, schema, **kafkaConf)
while True:
msg = streamReader.poll(avroDecode=True/False)
if msg is None:
continue
else:
print(msg)
https://github.com/mtpatter/alert_stream
Con gures connection to Kafka broker and subscribes to topic
Script for printing alert content
Avro decoding on or off
Stamp collection on or off
Script for monitoring the stream (drops content)
Maria Patterson @OpenSciPinay 18 / 32
Data formatting: Apache Avro
Maria Patterson @OpenSciPinay 19 / 32
Data formatting: Apache Avro
Schemas de ned with JSON
Dynamic typing- strict adherence
Flexible format- schema evolution
Also used in production, science, recommended by Kafka
Maria Patterson @OpenSciPinay 20 / 32
Data formatting: Apache Avro
1. Follow example schema + data
Schema
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
{"name": "Ben",
"favorite_color": "red",
"favorite_number": random.randint(0,10)}
Maria Patterson @OpenSciPinay 21 / 32
Data formatting: Apache Avro
2. Write alert schemas in Avro format
https://github.com/mtpatter/sample-avro-alert
alert_schema = '''
{
"namespace": "ztf",
"type": "record",
"name": "alert",
"doc": "sample avro alert schema v1.0",
"fields": [
{"name": "alertId", "type": "long", "doc": "add descriptions like this"},
{"name": "candid", "type": "long"},
{"name": "candidate", "type": "ztf.alert.candidate"},
{"name": "prv_candidates", "type": [{
"type": "array",
"items": "ztf.alert.prv_candidate"},
"null"]},
{"name": "cutoutScience", "type": ["ztf.alert.cutout", "null"]},
{"name": "cutoutTemplate", "type": ["ztf.alert.cutout", "null"]},
{"name": "cutoutDifference", "type": ["ztf.alert.cutout", "null"]}
]
}
'''
Maria Patterson @OpenSciPinay 22 / 32
Data formatting: Apache Avro
3. Add alert data
Includes cutout jpgs (or any le format) as bytes
stamp_schema = '''
{"namespace": "ztf.alert",
"type": "record",
"name": "cutout",
"fields": [
{"name": "filename", "type": "string"},
{"name": "stampdata", "type": "bytes"}]
}
'''
cutout_file = 'stamp-54720.fits'
with open(cutout_file, mode='rb') as f:
stampdata = f.read()
{"filename": "stamp-54720.fits",
"stampdata": stampdata}
Maria Patterson @OpenSciPinay 23 / 32
Data formatting: Apache Avro
4. Schema can be different, if the data ts
{'midPointTai': 12314.142412, 'diaSourceId': 111111, 'y': 121.1, 'yVar': 12.1, 'filterName': 'my favorite filter',
'radec': {'dec': 1214141.121, 'ra': 124142.12414}, 'ccdVisitId': 111111,
'x': 112.1, 'snr': 41.1, 'xVar': 1.2, 'base_CircularApertureFlux_3_0_flux': 134134,
'raVar': 5.1, 'ra_dec_Cov': 1.241, 'decVar': 10.1, 'psFlux': 1241.0, 'x_y_Cov': 11.2}
{ ...
"fields": [ {"name": "diaSourceId", "type": "long"},
{"name": "midPointTai", "type": "double"},
{"name": "filterName", "type": "string"},
{"name": "radec", "type": [{"type": "record",
"name": "Radec",
"fields": [
{"name": "ra", "type": "double"},
{"name": "dec", "type": "double"} ]}]},
{"name": "snr", "type": "float"} ]}
{'diaSourceId': 111111, 'snr': 41.099998474121094,
'radec': {'dec': 1214141.121, 'ra': 124142.12414},
'filterName': 'my favorite filter', 'midPointTai': 12314.142412}
Maria Patterson @OpenSciPinay 24 / 32
Possible data ow
Kafka lterService Broker1 Broker2
full stream of alerts
many streams of ltered alerts
Schema cut columns Queries cut records
broker1 stream of alerts
broker2 stream of alerts
Kafka lterService Broker1 Broker2
Maria Patterson @OpenSciPinay 25 / 32
Filtering/Processing
What "lookback time" sets "batch" vs "stream"?
Maria Patterson @OpenSciPinay 26 / 32
Filtering/Processing
What "lookback time" sets "batch" vs "stream"?
Interacting with batch and stream the same way is a huge win
Maria Patterson @OpenSciPinay 27 / 32
Filtering/Processing: Apache Spark
Maria Patterson @OpenSciPinay 28 / 32
Filtering/Processing: Apache Spark
Maria Patterson @OpenSciPinay 29 / 32
Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example map function de nition
def map_alertId(alert):
return alert['alertId']
Apply the map function
alertIds = alerts.map(map_alertId) # apply map
alertIds.pprint() # print to the screen
sparkStreamingContext.start()
Maria Patterson @OpenSciPinay 30 / 32
Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example lter function de nition
def filter_Ra(alert):
return alert['diaSource']['ra'] > 350
Apply the lter function
alertsRA = alerts.filter(filter_Ra) # apply filter
alertsRA.pprint() # print to the screen
sparkStreamingContext.start()
Maria Patterson @OpenSciPinay 31 / 32
Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
https://github.com/mtpatter/alert_stream
Check out sample Avro schema and data for astronomy
https://github.com/mtpatter/sample-avro-alert
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao
Maria Patterson @OpenSciPinay 32 / 32

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Wikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social EngagementWikipedia Views As A Proxy For Social Engagement
Wikipedia Views As A Proxy For Social Engagement
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
User biglm
User biglmUser biglm
User biglm
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapR...
Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapR...Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapR...
Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapR...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 

Ähnlich wie Maria Patterson - Building a community fountain around your data stream

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 

Ähnlich wie Maria Patterson - Building a community fountain around your data stream (20)

All Day DevOps - FLiP Stack for Cloud Data Lakes
All Day DevOps - FLiP Stack for Cloud Data LakesAll Day DevOps - FLiP Stack for Cloud Data Lakes
All Day DevOps - FLiP Stack for Cloud Data Lakes
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdfDustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Meet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMeet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUG
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 

Mehr von PyData

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Maria Patterson - Building a community fountain around your data stream

  • 1. Building a community fountain around a data stream Maria Patterson Astronomer @ University of Washington Tech Community Organizer @ PyLadies Seattle and PyData Seattle Meetup groups Twitter: @OpenSciPinay 1 / 32
  • 2. Maria Patterson @OpenSciPinay Crown Fountain in detail 2 / 32
  • 3. About me 2016.08 - now Scientist @UW 2013 - 2016 Scientist @UChicago 2007 - 2013 MS+PhD @NMSU 2003 - 2007 BA @UChicago I've worked in astronomy on Star formation (deep imaging and spectra of baby star regions) Gas accretion (modeling galaxy gas haloes) Old stellar streams (4-meter Mosaic image processing) Properties of star formation and the interstellar medium in galaxy outskirts I've worked in cloud computing + "Data Commons" NASA satellite data re-analysis pipelines (Storm + Hadoop + Accumulo) NOAA's move to the cloud (NOAA Big Data Project with Open Commons Consortium) API-based data services (Open Science Data Cloud) A case for data commons: toward data science as a service Etc blah blah things and stuff Cleveland native (Go Cavs!)/ Chicago transplant (Go Cubs!) Distance running Twitter: @runaverage Eating (pizza, old-fashioned donuts (+drinks), beer) https://mtpatter.github.io/pepsi-hidden- gures/ Maria Patterson @OpenSciPinay 3 / 32
  • 7. Personally, I liked the university. They gave us money and facilities. We didn't have to produce anything! You've never been out of college. You don't know what it's like out there. I worked in the private sector. They expect results. Maria Patterson @OpenSciPinay 7 / 32
  • 11. Why is this interesting? Astronomical data streams are open to the public, awaiting creative use! We've made tech choices with no hard requirements on programming language, low barrier to entry. Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability. What are we looking at? Transport: Apache Kafka Scalable in cluster mode, replicated, allows for stream "rewind", many use cases Serialization: Apache Avro Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution Filtering: Apache Spark Direct connection to Kafka, batch-like stream interface (Python or SQL-like) What can you do? Deploy Dockerized mock stream testbed with Kafka https://github.com/mtpatter/alert_stream Check out sample Avro schema and data for astronomy https://github.com/mtpatter/sample-avro-alert See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro https://mtpatter.github.io/bilao Maria Patterson @OpenSciPinay 11 / 32
  • 12. Maria Patterson @OpenSciPinay alerts alerts alerts alerts all alerts pipelines distribution broker 1 broker 2 broker 3 broker 4 alert database you me alice bob carol dan tweak 12 / 32
  • 13. Large alert streams opened as public data: Numbers Zwicky Transient Facility Large Synoptic Survey Telescope Numbers Zwicky Transient Facility Large Synoptic Survey Telescope alerts per visit 1,000 10,000 visit frequency every 45s every 39s alerts per night 1 million 10 million size of alert 20+ kb 50-150+ kb nightly data 20-40+ GB 600 GB - 1.5 TB Maria Patterson @OpenSciPinay 13 / 32
  • 14. Large alert streams opened as public data: Numbers Zwicky Transient Facility Large Synoptic Survey Telescope Numbers Zwicky Transient Facility Large Synoptic Survey Telescope alerts per visit 1,000 10,000 visit frequency every 45s every 39s alerts per night 1 million 10 million size of alert 20+ kb 50-150+ kb nightly data 20-40+ GB 600 GB - 1.5 TB ...compared to a database of Virtual Observatory VOEvent alerts: Numbers voeventdb.4pisky.org total April 2014 - July 6, 2017 1,679,719 alert frequency 1 per minute Maria Patterson @OpenSciPinay 14 / 32
  • 15. 1. Transport system: Apache Kafka Scalability Replication Allows stream "rewind" 2. Data formatting: Apache Avro Fast parsing with structured messages (typing) Strictly enforced schemas, but schema evolution Allows postage stamp cutout les 3. Filtering/ processing: Apache Spark Direct connection to transport system Stream interface similar to batch Allows for Python or simple SQL-like queries Maria Patterson @OpenSciPinay 15 / 32
  • 16. Distributed log system/ messaging queue Reinvented as strongly ordered, pub/sub streaming platform Highly scalable, in production at LinkedIn, Net ix, Microsoft Great clients + connectors, including Python - good usability Transport prototyping: Apache Kafka Maria Patterson @OpenSciPinay 16 / 32
  • 17. Transport prototyping: Apache Kafka 1. Mock a stream of alerts sent to Kafka from lsst.alert.stream import AlertProducer streamProducer = AlertProducer(topic, schema, **kafkaConf) streamProducer.send(dict_data, avroEncode=True/False) streamProducer.flush() https://github.com/mtpatter/alert_stream Includes Docker Compose le for Kafka + Zookeeper Con gures connection to Kafka broker and sets stream topic Takes dict (with optional stamps) Can turn Avro encoding on or off Uses single repeated "alert" in testing Sends con gurable batch size continuously every 39/45 s Maria Patterson @OpenSciPinay 17 / 32
  • 18. Transport prototyping: Apache Kafka 2. Use template consumers to see alerts from lsst.alert.stream import AlertProducer streamReader = AlertConsumer(topic, schema, **kafkaConf) while True: msg = streamReader.poll(avroDecode=True/False) if msg is None: continue else: print(msg) https://github.com/mtpatter/alert_stream Con gures connection to Kafka broker and subscribes to topic Script for printing alert content Avro decoding on or off Stamp collection on or off Script for monitoring the stream (drops content) Maria Patterson @OpenSciPinay 18 / 32
  • 19. Data formatting: Apache Avro Maria Patterson @OpenSciPinay 19 / 32
  • 20. Data formatting: Apache Avro Schemas de ned with JSON Dynamic typing- strict adherence Flexible format- schema evolution Also used in production, science, recommended by Kafka Maria Patterson @OpenSciPinay 20 / 32
  • 21. Data formatting: Apache Avro 1. Follow example schema + data Schema { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Data {"name": "Ben", "favorite_color": "red", "favorite_number": random.randint(0,10)} Maria Patterson @OpenSciPinay 21 / 32
  • 22. Data formatting: Apache Avro 2. Write alert schemas in Avro format https://github.com/mtpatter/sample-avro-alert alert_schema = ''' { "namespace": "ztf", "type": "record", "name": "alert", "doc": "sample avro alert schema v1.0", "fields": [ {"name": "alertId", "type": "long", "doc": "add descriptions like this"}, {"name": "candid", "type": "long"}, {"name": "candidate", "type": "ztf.alert.candidate"}, {"name": "prv_candidates", "type": [{ "type": "array", "items": "ztf.alert.prv_candidate"}, "null"]}, {"name": "cutoutScience", "type": ["ztf.alert.cutout", "null"]}, {"name": "cutoutTemplate", "type": ["ztf.alert.cutout", "null"]}, {"name": "cutoutDifference", "type": ["ztf.alert.cutout", "null"]} ] } ''' Maria Patterson @OpenSciPinay 22 / 32
  • 23. Data formatting: Apache Avro 3. Add alert data Includes cutout jpgs (or any le format) as bytes stamp_schema = ''' {"namespace": "ztf.alert", "type": "record", "name": "cutout", "fields": [ {"name": "filename", "type": "string"}, {"name": "stampdata", "type": "bytes"}] } ''' cutout_file = 'stamp-54720.fits' with open(cutout_file, mode='rb') as f: stampdata = f.read() {"filename": "stamp-54720.fits", "stampdata": stampdata} Maria Patterson @OpenSciPinay 23 / 32
  • 24. Data formatting: Apache Avro 4. Schema can be different, if the data ts {'midPointTai': 12314.142412, 'diaSourceId': 111111, 'y': 121.1, 'yVar': 12.1, 'filterName': 'my favorite filter', 'radec': {'dec': 1214141.121, 'ra': 124142.12414}, 'ccdVisitId': 111111, 'x': 112.1, 'snr': 41.1, 'xVar': 1.2, 'base_CircularApertureFlux_3_0_flux': 134134, 'raVar': 5.1, 'ra_dec_Cov': 1.241, 'decVar': 10.1, 'psFlux': 1241.0, 'x_y_Cov': 11.2} { ... "fields": [ {"name": "diaSourceId", "type": "long"}, {"name": "midPointTai", "type": "double"}, {"name": "filterName", "type": "string"}, {"name": "radec", "type": [{"type": "record", "name": "Radec", "fields": [ {"name": "ra", "type": "double"}, {"name": "dec", "type": "double"} ]}]}, {"name": "snr", "type": "float"} ]} {'diaSourceId': 111111, 'snr': 41.099998474121094, 'radec': {'dec': 1214141.121, 'ra': 124142.12414}, 'filterName': 'my favorite filter', 'midPointTai': 12314.142412} Maria Patterson @OpenSciPinay 24 / 32
  • 25. Possible data ow Kafka lterService Broker1 Broker2 full stream of alerts many streams of ltered alerts Schema cut columns Queries cut records broker1 stream of alerts broker2 stream of alerts Kafka lterService Broker1 Broker2 Maria Patterson @OpenSciPinay 25 / 32
  • 26. Filtering/Processing What "lookback time" sets "batch" vs "stream"? Maria Patterson @OpenSciPinay 26 / 32
  • 27. Filtering/Processing What "lookback time" sets "batch" vs "stream"? Interacting with batch and stream the same way is a huge win Maria Patterson @OpenSciPinay 27 / 32
  • 28. Filtering/Processing: Apache Spark Maria Patterson @OpenSciPinay 28 / 32
  • 29. Filtering/Processing: Apache Spark Maria Patterson @OpenSciPinay 29 / 32
  • 30. Filtering the stream https://github.com/mtpatter/bilao Set up connection from pyspark.streaming.kafka import KafkaUtils kafkaStream = KafkaUtils.createDirectStream( ['my-stream'], {'bootstrap.servers': 'kafka-server:9092', 'auto.offset.reset': 'smallest', 'group.id': 'spark-group' }) alerts = kafkaStream.map(lambda x: x[1]) Example map function de nition def map_alertId(alert): return alert['alertId'] Apply the map function alertIds = alerts.map(map_alertId) # apply map alertIds.pprint() # print to the screen sparkStreamingContext.start() Maria Patterson @OpenSciPinay 30 / 32
  • 31. Filtering the stream https://github.com/mtpatter/bilao Set up connection from pyspark.streaming.kafka import KafkaUtils kafkaStream = KafkaUtils.createDirectStream( ['my-stream'], {'bootstrap.servers': 'kafka-server:9092', 'auto.offset.reset': 'smallest', 'group.id': 'spark-group' }) alerts = kafkaStream.map(lambda x: x[1]) Example lter function de nition def filter_Ra(alert): return alert['diaSource']['ra'] > 350 Apply the lter function alertsRA = alerts.filter(filter_Ra) # apply filter alertsRA.pprint() # print to the screen sparkStreamingContext.start() Maria Patterson @OpenSciPinay 31 / 32
  • 32. Why is this interesting? Astronomical data streams are open to the public, awaiting creative use! We've made tech choices with no hard requirements on programming language, low barrier to entry. Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability. What are we looking at? Transport: Apache Kafka Scalable in cluster mode, replicated, allows for stream "rewind", many use cases Serialization: Apache Avro Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution Filtering: Apache Spark Direct connection to Kafka, batch-like stream interface (Python or SQL-like) What can you do? Deploy Dockerized mock stream testbed with Kafka https://github.com/mtpatter/alert_stream Check out sample Avro schema and data for astronomy https://github.com/mtpatter/sample-avro-alert See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro https://mtpatter.github.io/bilao Maria Patterson @OpenSciPinay 32 / 32