Maria Patterson - Building a community fountain around your data stream

Building a community fountain around a data stream
Maria Patterson
Astronomer @ University of Washington
Tech Community Organizer @ PyLadies Seattle and PyData Seattle Meetup groups
Twitter: @OpenSciPinay
1 / 32

Maria Patterson @OpenSciPinay
Crown Fountain in detail
2 / 32

About me
2016.08 - now
Scientist @UW
2013 - 2016
Scientist @UChicago
2007 - 2013
MS+PhD @NMSU
2003 - 2007
BA @UChicago
I've worked in astronomy on
Star formation (deep imaging and spectra of baby star regions)
Gas accretion (modeling galaxy gas haloes)
Old stellar streams (4-meter Mosaic image processing)
Properties of star formation and the interstellar medium in galaxy outskirts
I've worked in cloud computing + "Data Commons"
NASA satellite data re-analysis pipelines (Storm + Hadoop + Accumulo)
NOAA's move to the cloud (NOAA Big Data Project with Open Commons Consortium)
API-based data services (Open Science Data Cloud)
A case for data commons: toward data science as a service
Etc blah blah things and stuff
Cleveland native (Go Cavs!)/ Chicago transplant (Go Cubs!)
Distance running Twitter: @runaverage
Eating (pizza, old-fashioned donuts (+drinks), beer)
https://mtpatter.github.io/pepsi-hidden- gures/
Maria Patterson @OpenSciPinay 3 / 32

Personally, I liked the university. They gave us money and facilities. We didn't have to produce anything! You've never
been out of college. You don't know what it's like out there. I worked in the private sector. They expect results.

Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
https://github.com/mtpatter/alert_stream
Check out sample Avro schema and data for astronomy
https://github.com/mtpatter/sample-avro-alert
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao

Maria Patterson @OpenSciPinay
alerts alerts alerts alerts all alerts
pipelines
distribution
broker 1 broker 2 broker 3 broker 4
alert
database
you me alice bob carol dan tweak
12 / 32

Large alert streams opened as public data:
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB

Large alert streams opened as public data:
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB
...compared to a database of Virtual Observatory VOEvent alerts:
Numbers voeventdb.4pisky.org
total April 2014 - July 6, 2017 1,679,719
alert frequency 1 per minute

1. Transport system: Apache Kafka
Scalability
Replication
Allows stream "rewind"
2. Data formatting: Apache Avro
Fast parsing with structured messages (typing)
Strictly enforced schemas, but schema evolution
Allows postage stamp cutout les
3. Filtering/ processing: Apache Spark
Direct connection to transport system
Stream interface similar to batch
Allows for Python or simple SQL-like queries

Distributed log system/ messaging queue
Reinvented as strongly ordered, pub/sub streaming
platform
Highly scalable, in production at LinkedIn, Net ix,
Microsoft
Great clients + connectors, including Python - good
usability
Transport prototyping: Apache Kafka

1. Mock a stream of alerts sent to Kafka
from lsst.alert.stream import AlertProducer
streamProducer = AlertProducer(topic, schema, **kafkaConf)
streamProducer.send(dict_data, avroEncode=True/False)
streamProducer.flush()
Includes Docker Compose le for Kafka + Zookeeper
Con gures connection to Kafka broker and sets stream topic
Takes dict (with optional stamps)
Can turn Avro encoding on or off
Uses single repeated "alert" in testing
Sends con gurable batch size continuously every 39/45 s

2. Use template consumers to see alerts
from lsst.alert.stream import AlertProducer
streamReader = AlertConsumer(topic, schema, **kafkaConf)
while True:
msg = streamReader.poll(avroDecode=True/False)
if msg is None:
continue
else:
print(msg)
Con gures connection to Kafka broker and subscribes to topic
Script for printing alert content
Avro decoding on or off
Stamp collection on or off
Script for monitoring the stream (drops content)

Data formatting: Apache Avro

Schemas de ned with JSON
Dynamic typing- strict adherence
Flexible format- schema evolution
Also used in production, science, recommended by Kafka

1. Follow example schema + data
Schema
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
{"name": "Ben",
"favorite_color": "red",
"favorite_number": random.randint(0,10)}

2. Write alert schemas in Avro format
alert_schema = '''
{
"namespace": "ztf",
"type": "record",
"name": "alert",
"doc": "sample avro alert schema v1.0",
"fields": [
{"name": "alertId", "type": "long", "doc": "add descriptions like this"},
{"name": "candid", "type": "long"},
{"name": "candidate", "type": "ztf.alert.candidate"},
{"name": "prv_candidates", "type": [{
"type": "array",
"items": "ztf.alert.prv_candidate"},
"null"]},
{"name": "cutoutScience", "type": ["ztf.alert.cutout", "null"]},
{"name": "cutoutTemplate", "type": ["ztf.alert.cutout", "null"]},
{"name": "cutoutDifference", "type": ["ztf.alert.cutout", "null"]}
]
}
'''

3. Add alert data
Includes cutout jpgs (or any le format) as bytes
stamp_schema = '''
{"namespace": "ztf.alert",
"type": "record",
"name": "cutout",
"fields": [
{"name": "filename", "type": "string"},
{"name": "stampdata", "type": "bytes"}]
}
'''
cutout_file = 'stamp-54720.fits'
with open(cutout_file, mode='rb') as f:
stampdata = f.read()
{"filename": "stamp-54720.fits",
"stampdata": stampdata}

4. Schema can be different, if the data ts
{'midPointTai': 12314.142412, 'diaSourceId': 111111, 'y': 121.1, 'yVar': 12.1, 'filterName': 'my favorite filter',
'radec': {'dec': 1214141.121, 'ra': 124142.12414}, 'ccdVisitId': 111111,
'x': 112.1, 'snr': 41.1, 'xVar': 1.2, 'base_CircularApertureFlux_3_0_flux': 134134,
'raVar': 5.1, 'ra_dec_Cov': 1.241, 'decVar': 10.1, 'psFlux': 1241.0, 'x_y_Cov': 11.2}
{ ...
"fields": [ {"name": "diaSourceId", "type": "long"},
{"name": "midPointTai", "type": "double"},
{"name": "filterName", "type": "string"},
{"name": "radec", "type": [{"type": "record",
"name": "Radec",
"fields": [
{"name": "ra", "type": "double"},
{"name": "dec", "type": "double"} ]}]},
{"name": "snr", "type": "float"} ]}
{'diaSourceId': 111111, 'snr': 41.099998474121094,
'radec': {'dec': 1214141.121, 'ra': 124142.12414},
'filterName': 'my favorite filter', 'midPointTai': 12314.142412}

Possible data ow
Kafka lterService Broker1 Broker2
full stream of alerts
many streams of ltered alerts
Schema cut columns Queries cut records
broker1 stream of alerts
broker2 stream of alerts
Kafka lterService Broker1 Broker2

Filtering/Processing
What "lookback time" sets "batch" vs "stream"?

Filtering/Processing
What "lookback time" sets "batch" vs "stream"?
Interacting with batch and stream the same way is a huge win

Filtering/Processing: Apache Spark

Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example map function de nition
def map_alertId(alert):
return alert['alertId']
Apply the map function
alertIds = alerts.map(map_alertId) # apply map
alertIds.pprint() # print to the screen
sparkStreamingContext.start()

Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example lter function de nition
def filter_Ra(alert):
return alert['diaSource']['ra'] > 350
Apply the lter function
alertsRA = alerts.filter(filter_Ra) # apply filter
alertsRA.pprint() # print to the screen
sparkStreamingContext.start()

Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
Check out sample Avro schema and data for astronomy
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao

Maria Patterson - Building a community fountain around your data stream

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Maria Patterson - Building a community fountain around your data stream

Ähnlich wie Maria Patterson - Building a community fountain around your data stream (20)

Mehr von PyData

Mehr von PyData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Maria Patterson - Building a community fountain around your data stream