"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Maintaining SDIs Using Distributed Task Queues
1. Maintaining Spatial Data Infrastructures (SDIs)
using distributed task queues
Paolo Corti and Ben Lewis
Harvard Center for Geographic Analysis
2017 FOSS4G
Boston
2. Background
Harvard Center for Geographic Analysis
• WorldMap http://worldmap.harvard.edu
– Biggest GeoNode instance on the planet
– https://github.com/cga-harvard/cga-worldmap
• HHypermap http://hh.worldmap.harvard.edu
– Map service registry
– https://github.com/cga-harvard/HHypermap
6. The need for an asynchronous processor
In WorldMap and HHypermap there are operations run by users which are
time consuming and cannot be handled in the context of a web request
● Harvest the metadata of a service and its layers
● Synchronize the metadata of a new or updated layer to the search
engine
● Feed a gazetteer when a new layer is uploaded or updated
● Upload a spatial datasets to the server
● Create a new layer using a table join
7. HTTP request/response cycle must be fast
● In web applications the HTTP
request/response cycle can be
synchronous as long as there are
very quick interactions between
the client and the server
● unfortunately there are cases
when the cycle become slower
● In these situations the best
practice for a web application is
to process asynchronously these
tasks using a task queue
8. Task Queues
Asynchronous processing in a web application can be
delegated to a task queue, which is a system for parallel
execution of tasks in a non-blocking fashion
10. Asynchronous processing model
● The asynchronous processing model is composed by services that
produce processing tasks (producers) and by services which
consume and process these tasks (consumers) accordingly
● A message queue is a broker which facilitates message passing by
providing a protocol or interface which other services can access.
Work can be distributed across threads or machines
● In the context of a web application the producer is the client
application that creates messages based on the user interaction. The
consumer is a daemon process that can consume the messages and
run the needed process
11. Glossary
● Task Queue: a system for parallel execution of tasks in a non-blocking
fashion
● Broker or Message Queue: provides a protocol or interface for messages
exchanging between different services and applications
● Producer: the code that places the tasks to be executed later in the broker
● Consumer or Worker: takes tasks from the broker and process them
● Exchange: takes a message from a producer and route it to zero or more
queues (messages routing)
Tasks must be consumed faster than being produced. If not, add more workers
12. Use cases for task queues
● in web applications some process is taking too much time
and must be processed asynchronously
● heterogeneous applications/services in a given system
architecture need an easy way to reliably communicate
between each other
● periodic operations (vs crontab)
● a way of parallelizing tasks in multi processors
● monitor processes and analyze failing tasks (and execute
them again)
13. Typical use cases for a task queue in a web application
● Thumbnails generation
● Sending bulk email
● Fetching large amounts of data from APIs
● Performing time-intensive calculations
● Expensive queries
● Search engine index synchronization
● Interaction with another application/service
● Replacing cron jobs (backups, maintenance, etc…)
14. Typical use cases for a task queue in a GIS Portal/SDI
● Upload a shapefile to the server (GeoNode)
● Thumbnails generation for layers and maps (GeoNode)
● OGC services harvesting (Harvard Hypermap)
● Geoprocessing operations
● Geospatial data maintenance
16. Message brokers implementations
Most of them are open source!
● RabbitMQ (AMQP, STOMP, JMS)
● Apache ActiveMQ (STOMP, JMS)
● Amazon Simple Queue Service (JMS)
● Apache Kafka
Several standard protocols:
● AMQP, STOMP, JMS, MSMQ (Microsoft .NET)
18. Celery
● asynchronous task queue based on distributed message
passing
● focused on real-time operation, but supports scheduling
as well
● the execution units, called tasks, are executed
concurrently on a single or more worker servers
● it supports many message brokers (RabbitMQ, Redis,
MongoDB, CouchDB, ...)
● written in Python but it can operate with other languages
● great integration with Django!
● great monitoring tools (Flower, django-celery-results)
19. RabbitMQ
● RabbitMQ is a message broker: it accepts and
forwards messages
● most widely deployed open source broker (35k+
deployments)
● support many message protocols
● supported by many operating systems and
languages
● Written in Erlang
21. A real use case: Harvard Hypermap
HHypermap (Harvard Hypermap)
Registry is a platform that manages
OWS, Esri REST, and other types of
map service harvesting, and
orchestration and maintains uptime
statistics for services and layers.
Where possible, layers are cached
by MapProxy.
HHypermap provides thousands of
remote layers to WorldMap users