diff options
author | Frederick Muriuki Muriithi | 2024-10-23 13:48:05 -0500 |
---|---|---|
committer | Frederick Muriuki Muriithi | 2024-10-23 13:48:05 -0500 |
commit | 9231dd51f3b32c5cf9310fba18b024a739f25b4b (patch) | |
tree | b2c01d74b559193561579ca580e2079a382c19c9 | |
parent | b09ec88789c7e54c8ca75c0d68c1adf54870aafc (diff) | |
download | gn-uploader-9231dd51f3b32c5cf9310fba18b024a739f25b4b.tar.gz |
Add documentation on background jobs.
-rw-r--r-- | docs/dev/background_jobs.md | 62 |
1 files changed, 62 insertions, 0 deletions
diff --git a/docs/dev/background_jobs.md b/docs/dev/background_jobs.md new file mode 100644 index 0000000..1a41636 --- /dev/null +++ b/docs/dev/background_jobs.md @@ -0,0 +1,62 @@ +# Background Jobs + +We run background jobs for long-running processes, e.g. quality-assurance checks +across multiple huge files, inserting huge data to databases, etc. The system +needs to keep track of the progress of these jobs and communicate the state to +the user whenever the user requests. + +This details some thoughts on how to handle these jobs, especially in failure +conditions. + +We currently use Redis[^redis] to keep track of the state of the background +processes. + +Every background job started will have a Redis[^redis] key with the prefix `gn-uploader:jobs` + +## Users + +Currently (2024-10-23T13:29UTC-05:00), we do not track the user that started the job. Moving forward, we will track this information. + +We could have the keys be something like, `gn-uploader:jobs:<user-id>:<job-id>`. + +Another option is track any particular users jobs with a key of the form +`gn-uploader:users:<user-id>:jobs` and in that case, have the job keys take the +form `gn-uploader:jobs:<job-id>`. I (@fredmanglis) favour this option over +having the user's ID in the jobs keys directly, since it provides a way to +interact with **ALL** the jobs without indirecting through each specific user. +This is a useful ability to have, especially for system administrative tasks. + +## Multiprocessing Within Jobs + +Some jobs, e.g. quality-assurance jobs, can run multiple threads/processes +themselves. This brings up a problem because Redis[^redis] does not allow +parallel access to a key, especially for writing. + +We also do not want to create bottlenecks by writing to the same key from +multiple threads/processes. + +The design I have currently come up with, that might work is as follows: + +- At any point just before where multiple threads/processes are started, a list + of new keys, each of which will collect the output from a single thread, will + be built. +- These keys are recorded in the parent's redis key data +- The threads/processes are started and do whatever they need, pushing their + outputs to the appropriate keys within redis. + +The new keys for the children threads/processe could build on the theme + + +## Fetching Jobs Status + +Different jobs could have different ways of requirements for handling/processing +their outputs, and those of any children they might spawn. The system will need +to provide a way to pass in the correct function/code to process the outputs at +the point where the job status is requested. + +This implies that we need to track the type of job in order to be able to select +the correct code for processing such output. + +## Links + +- [^redis]: https://redis.io/ |