Maria Patterson - Building a community fountain around your data stream
1. Building a community fountain around a data stream
Maria Patterson
Astronomer @ University of Washington
Tech Community Organizer @ PyLadies Seattle and PyData Seattle Meetup groups
Twitter: @OpenSciPinay
1 / 32
3. About me
2016.08 - now
Scientist @UW
2013 - 2016
Scientist @UChicago
2007 - 2013
MS+PhD @NMSU
2003 - 2007
BA @UChicago
I've worked in astronomy on
Star formation (deep imaging and spectra of baby star regions)
Gas accretion (modeling galaxy gas haloes)
Old stellar streams (4-meter Mosaic image processing)
Properties of star formation and the interstellar medium in galaxy outskirts
I've worked in cloud computing + "Data Commons"
NASA satellite data re-analysis pipelines (Storm + Hadoop + Accumulo)
NOAA's move to the cloud (NOAA Big Data Project with Open Commons Consortium)
API-based data services (Open Science Data Cloud)
A case for data commons: toward data science as a service
Etc blah blah things and stuff
Cleveland native (Go Cavs!)/ Chicago transplant (Go Cubs!)
Distance running Twitter: @runaverage
Eating (pizza, old-fashioned donuts (+drinks), beer)
https://mtpatter.github.io/pepsi-hidden- gures/
Maria Patterson @OpenSciPinay 3 / 32
7. Personally, I liked the university. They gave us money and facilities. We didn't have to produce anything! You've never
been out of college. You don't know what it's like out there. I worked in the private sector. They expect results.
Maria Patterson @OpenSciPinay 7 / 32
11. Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
https://github.com/mtpatter/alert_stream
Check out sample Avro schema and data for astronomy
https://github.com/mtpatter/sample-avro-alert
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao
Maria Patterson @OpenSciPinay 11 / 32
12. Maria Patterson @OpenSciPinay
alerts alerts alerts alerts all alerts
pipelines
distribution
broker 1 broker 2 broker 3 broker 4
alert
database
you me alice bob carol dan tweak
12 / 32
13. Large alert streams opened as public data:
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB
Maria Patterson @OpenSciPinay 13 / 32
14. Large alert streams opened as public data:
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
Numbers Zwicky Transient Facility Large Synoptic Survey Telescope
alerts per visit 1,000 10,000
visit frequency every 45s every 39s
alerts per night 1 million 10 million
size of alert 20+ kb 50-150+ kb
nightly data 20-40+ GB 600 GB - 1.5 TB
...compared to a database of Virtual Observatory VOEvent alerts:
Numbers voeventdb.4pisky.org
total April 2014 - July 6, 2017 1,679,719
alert frequency 1 per minute
Maria Patterson @OpenSciPinay 14 / 32
15. 1. Transport system: Apache Kafka
Scalability
Replication
Allows stream "rewind"
2. Data formatting: Apache Avro
Fast parsing with structured messages (typing)
Strictly enforced schemas, but schema evolution
Allows postage stamp cutout les
3. Filtering/ processing: Apache Spark
Direct connection to transport system
Stream interface similar to batch
Allows for Python or simple SQL-like queries
Maria Patterson @OpenSciPinay 15 / 32
16. Distributed log system/ messaging queue
Reinvented as strongly ordered, pub/sub streaming
platform
Highly scalable, in production at LinkedIn, Net ix,
Microsoft
Great clients + connectors, including Python - good
usability
Transport prototyping: Apache Kafka
Maria Patterson @OpenSciPinay 16 / 32
17. Transport prototyping: Apache Kafka
1. Mock a stream of alerts sent to Kafka
from lsst.alert.stream import AlertProducer
streamProducer = AlertProducer(topic, schema, **kafkaConf)
streamProducer.send(dict_data, avroEncode=True/False)
streamProducer.flush()
https://github.com/mtpatter/alert_stream
Includes Docker Compose le for Kafka + Zookeeper
Con gures connection to Kafka broker and sets stream topic
Takes dict (with optional stamps)
Can turn Avro encoding on or off
Uses single repeated "alert" in testing
Sends con gurable batch size continuously every 39/45 s
Maria Patterson @OpenSciPinay 17 / 32
18. Transport prototyping: Apache Kafka
2. Use template consumers to see alerts
from lsst.alert.stream import AlertProducer
streamReader = AlertConsumer(topic, schema, **kafkaConf)
while True:
msg = streamReader.poll(avroDecode=True/False)
if msg is None:
continue
else:
print(msg)
https://github.com/mtpatter/alert_stream
Con gures connection to Kafka broker and subscribes to topic
Script for printing alert content
Avro decoding on or off
Stamp collection on or off
Script for monitoring the stream (drops content)
Maria Patterson @OpenSciPinay 18 / 32
20. Data formatting: Apache Avro
Schemas de ned with JSON
Dynamic typing- strict adherence
Flexible format- schema evolution
Also used in production, science, recommended by Kafka
Maria Patterson @OpenSciPinay 20 / 32
21. Data formatting: Apache Avro
1. Follow example schema + data
Schema
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
{"name": "Ben",
"favorite_color": "red",
"favorite_number": random.randint(0,10)}
Maria Patterson @OpenSciPinay 21 / 32
25. Possible data ow
Kafka lterService Broker1 Broker2
full stream of alerts
many streams of ltered alerts
Schema cut columns Queries cut records
broker1 stream of alerts
broker2 stream of alerts
Kafka lterService Broker1 Broker2
Maria Patterson @OpenSciPinay 25 / 32
27. Filtering/Processing
What "lookback time" sets "batch" vs "stream"?
Interacting with batch and stream the same way is a huge win
Maria Patterson @OpenSciPinay 27 / 32
30. Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example map function de nition
def map_alertId(alert):
return alert['alertId']
Apply the map function
alertIds = alerts.map(map_alertId) # apply map
alertIds.pprint() # print to the screen
sparkStreamingContext.start()
Maria Patterson @OpenSciPinay 30 / 32
31. Filtering the stream https://github.com/mtpatter/bilao
Set up connection
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createDirectStream(
['my-stream'],
{'bootstrap.servers': 'kafka-server:9092',
'auto.offset.reset': 'smallest',
'group.id': 'spark-group' })
alerts = kafkaStream.map(lambda x: x[1])
Example lter function de nition
def filter_Ra(alert):
return alert['diaSource']['ra'] > 350
Apply the lter function
alertsRA = alerts.filter(filter_Ra) # apply filter
alertsRA.pprint() # print to the screen
sparkStreamingContext.start()
Maria Patterson @OpenSciPinay 31 / 32
32. Why is this interesting?
Astronomical data streams are open to the public, awaiting creative use!
We've made tech choices with no hard requirements on programming language, low barrier to entry.
Ecosystem has wide support and use cases outside of astronomy, enhancing exibility, usability, sustainability.
What are we looking at?
Transport: Apache Kafka
Scalable in cluster mode, replicated, allows for stream "rewind", many use cases
Serialization: Apache Avro
Fast parsing with structured messages (typing), strictly enforced schemas, but allows for schema evolution
Filtering: Apache Spark
Direct connection to Kafka, batch-like stream interface (Python or SQL-like)
What can you do?
Deploy Dockerized mock stream testbed with Kafka
https://github.com/mtpatter/alert_stream
Check out sample Avro schema and data for astronomy
https://github.com/mtpatter/sample-avro-alert
See Dockerized Jupyter notebooks for a how-to on ltering with Spark using Kafka and Avro
https://mtpatter.github.io/bilao
Maria Patterson @OpenSciPinay 32 / 32