aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorFrederick Muriuki Muriithi2024-10-23 13:48:05 -0500
committerFrederick Muriuki Muriithi2024-10-23 13:48:05 -0500
commit9231dd51f3b32c5cf9310fba18b024a739f25b4b (patch)
treeb2c01d74b559193561579ca580e2079a382c19c9
parentb09ec88789c7e54c8ca75c0d68c1adf54870aafc (diff)
downloadgn-uploader-9231dd51f3b32c5cf9310fba18b024a739f25b4b.tar.gz
Add documentation on background jobs.
-rw-r--r--docs/dev/background_jobs.md62
1 files changed, 62 insertions, 0 deletions
diff --git a/docs/dev/background_jobs.md b/docs/dev/background_jobs.md
new file mode 100644
index 0000000..1a41636
--- /dev/null
+++ b/docs/dev/background_jobs.md
@@ -0,0 +1,62 @@
+# Background Jobs
+
+We run background jobs for long-running processes, e.g. quality-assurance checks
+across multiple huge files, inserting huge data to databases, etc. The system
+needs to keep track of the progress of these jobs and communicate the state to
+the user whenever the user requests.
+
+This details some thoughts on how to handle these jobs, especially in failure
+conditions.
+
+We currently use Redis[^redis] to keep track of the state of the background
+processes.
+
+Every background job started will have a Redis[^redis] key with the prefix `gn-uploader:jobs`
+
+## Users
+
+Currently (2024-10-23T13:29UTC-05:00), we do not track the user that started the job. Moving forward, we will track this information.
+
+We could have the keys be something like, `gn-uploader:jobs:<user-id>:<job-id>`.
+
+Another option is track any particular users jobs with a key of the form
+`gn-uploader:users:<user-id>:jobs` and in that case, have the job keys take the
+form `gn-uploader:jobs:<job-id>`. I (@fredmanglis) favour this option over
+having the user's ID in the jobs keys directly, since it provides a way to
+interact with **ALL** the jobs without indirecting through each specific user.
+This is a useful ability to have, especially for system administrative tasks.
+
+## Multiprocessing Within Jobs
+
+Some jobs, e.g. quality-assurance jobs, can run multiple threads/processes
+themselves. This brings up a problem because Redis[^redis] does not allow
+parallel access to a key, especially for writing.
+
+We also do not want to create bottlenecks by writing to the same key from
+multiple threads/processes.
+
+The design I have currently come up with, that might work is as follows:
+
+- At any point just before where multiple threads/processes are started, a list
+ of new keys, each of which will collect the output from a single thread, will
+ be built.
+- These keys are recorded in the parent's redis key data
+- The threads/processes are started and do whatever they need, pushing their
+ outputs to the appropriate keys within redis.
+
+The new keys for the children threads/processe could build on the theme
+
+
+## Fetching Jobs Status
+
+Different jobs could have different ways of requirements for handling/processing
+their outputs, and those of any children they might spawn. The system will need
+to provide a way to pass in the correct function/code to process the outputs at
+the point where the job status is requested.
+
+This implies that we need to track the type of job in order to be able to select
+the correct code for processing such output.
+
+## Links
+
+- [^redis]: https://redis.io/