From 9231dd51f3b32c5cf9310fba18b024a739f25b4b Mon Sep 17 00:00:00 2001 From: Frederick Muriuki Muriithi Date: Wed, 23 Oct 2024 13:48:05 -0500 Subject: Add documentation on background jobs. --- docs/dev/background_jobs.md | 62 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 docs/dev/background_jobs.md (limited to 'docs/dev') diff --git a/docs/dev/background_jobs.md b/docs/dev/background_jobs.md new file mode 100644 index 0000000..1a41636 --- /dev/null +++ b/docs/dev/background_jobs.md @@ -0,0 +1,62 @@ +# Background Jobs + +We run background jobs for long-running processes, e.g. quality-assurance checks +across multiple huge files, inserting huge data to databases, etc. The system +needs to keep track of the progress of these jobs and communicate the state to +the user whenever the user requests. + +This details some thoughts on how to handle these jobs, especially in failure +conditions. + +We currently use Redis[^redis] to keep track of the state of the background +processes. + +Every background job started will have a Redis[^redis] key with the prefix `gn-uploader:jobs` + +## Users + +Currently (2024-10-23T13:29UTC-05:00), we do not track the user that started the job. Moving forward, we will track this information. + +We could have the keys be something like, `gn-uploader:jobs::`. + +Another option is track any particular users jobs with a key of the form +`gn-uploader:users::jobs` and in that case, have the job keys take the +form `gn-uploader:jobs:`. I (@fredmanglis) favour this option over +having the user's ID in the jobs keys directly, since it provides a way to +interact with **ALL** the jobs without indirecting through each specific user. +This is a useful ability to have, especially for system administrative tasks. + +## Multiprocessing Within Jobs + +Some jobs, e.g. quality-assurance jobs, can run multiple threads/processes +themselves. This brings up a problem because Redis[^redis] does not allow +parallel access to a key, especially for writing. + +We also do not want to create bottlenecks by writing to the same key from +multiple threads/processes. + +The design I have currently come up with, that might work is as follows: + +- At any point just before where multiple threads/processes are started, a list + of new keys, each of which will collect the output from a single thread, will + be built. +- These keys are recorded in the parent's redis key data +- The threads/processes are started and do whatever they need, pushing their + outputs to the appropriate keys within redis. + +The new keys for the children threads/processe could build on the theme + + +## Fetching Jobs Status + +Different jobs could have different ways of requirements for handling/processing +their outputs, and those of any children they might spawn. The system will need +to provide a way to pass in the correct function/code to process the outputs at +the point where the job status is requested. + +This implies that we need to track the type of job in order to be able to select +the correct code for processing such output. + +## Links + +- [^redis]: https://redis.io/ -- cgit v1.2.3