Skip to content

Draft: Unidirectional API (DB polling)

Bengfort requested to merge unidirectional-api-jobs into main

Potential fix for #1 (closed)

external interface

We need a bi-directional communication channel between castellum and scheduler that can be initiated completely on the castellum side.

The classic option on the web are WebSockets. However, that protocol is quite hard to implement. So instead I used the much simpler Server-Sent-Events (SSE) in one direction and POST requests in the other direction. This is a mechanism I have already used successfully in other projects. It is a bit rough around the edges but does the job.

internal interface

Our django setup generally works like this: There is a bunch of worker processes. Each time a request comes in it gets delegates to one of the workers. If all workers are busy the request has to wait.

  • When a request comes in, one worker is busy with that request.
  • Another worker is constantly busy with the SSE connection. Somehow the request worker has to communicate with the SSE worker to pass a message to castellum.
  • Castellum responds by sending yet another request, so a third worker is busy processing that response. That response worker again has to pass the response back to initial request worker.

Communication between workers is not something you would usually do. I implemented that by using the database: The workers all access the new Job model and poll for it to reach a specific state.

Since the SSE request is now initiated by castellum we can think about sending all relevant information over that channel instead of doing yet another request to fetch the details.

issues

  • I was careful to avoid race conditions where possible, but this is a minefield. I think the best we can do is to treat this whole connection as unreliable and make the next layer idempotent so actions can just be replayed.
  • I was not yet able to implement a reliable locking mechanism for SSE connections. So ATM it is possible to have more than instance of castellum listening, in which case the jobs each instance would receive some of the jobs. I am not yet sure about the impact.
  • If a worker is terminated before the job times out, the job might remain in the database. I am not yet sure how to identify and clean up such zombies.
  • SSE can be tricky with proxies, especially when there are long stretches of silence. However, the castellum side could be implemented so it reconnects automatically when the connection drops. If the higher level protocol is halfway good this will just result in an error message so that a user can retry.
Edited by Bengfort

Merge request reports