From 9231dd51f3b32c5cf9310fba18b024a739f25b4b Mon Sep 17 00:00:00 2001
From: Frederick Muriuki Muriithi
Date: Wed, 23 Oct 2024 13:48:05 -0500
Subject: Add documentation on background jobs.

---
 docs/dev/background_jobs.md | 62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 docs/dev/background_jobs.md

(limited to 'docs/dev')
diff --git a/docs/dev/background_jobs.md b/docs/dev/background_jobs.md
new file mode 100644
index 0000000..1a41636
--- /dev/null
+++ b/docs/dev/background_jobs.md
@@ -0,0 +1,62 @@
+# Background Jobs
+
+We run background jobs for long-running processes, e.g. quality-assurance checks
+across multiple huge files, inserting huge data to databases, etc. The system
+needs to keep track of the progress of these jobs and communicate the state to
+the user whenever the user requests.
+
+This details some thoughts on how to handle these jobs, especially in failure
+conditions.
+
+We currently use Redis[^redis] to keep track of the state of the background
+processes.
+
+Every background job started will have a Redis[^redis] key with the prefix `gn-uploader:jobs`
+
+## Users
+
+Currently (2024-10-23T13:29UTC-05:00), we do not track the user that started the job. Moving forward, we will track this information.
+
+We could have the keys be something like, `gn-uploader:jobs:<user-id>:<job-id>`.
+
+Another option is track any particular users jobs with a key of the form
+`gn-uploader:users:<user-id>:jobs` and in that case, have the job keys take the
+form `gn-uploader:jobs:<job-id>`. I (@fredmanglis) favour this option over
+having the user's ID in the jobs keys directly, since it provides a way to
+interact with **ALL** the jobs without indirecting through each specific user.
+This is a useful ability to have, especially for system administrative tasks.
+
+## Multiprocessing Within Jobs
+
+Some jobs, e.g. quality-assurance jobs, can run multiple threads/processes
+themselves. This brings up a problem because Redis[^redis] does not allow
+parallel access to a key, especially for writing.
+
+We also do not want to create bottlenecks by writing to the same key from
+multiple threads/processes.
+
+The design I have currently come up with, that might work is as follows:
+
+- At any point just before where multiple threads/processes are started, a list
+  of new keys, each of which will collect the output from a single thread, will
+  be built.
+- These keys are recorded in the parent's redis key data
+- The threads/processes are started and do whatever they need, pushing their
+  outputs to the appropriate keys within redis.
+  
+The new keys for the children threads/processe could build on the theme
+  
+  
+## Fetching Jobs Status
+
+Different jobs could have different ways of requirements for handling/processing
+their outputs, and those of any children they might spawn. The system will need
+to provide a way to pass in the correct function/code to process the outputs at
+the point where the job status is requested.
+
+This implies that we need to track the type of job in order to be able to select
+the correct code for processing such output.
+
+## Links
+
+- [^redis]: https://redis.io/
-- 
cgit 1.4.1