docs/dev/background_jobs.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

# Background Jobs

We run background jobs for long-running processes, e.g. quality-assurance checks
across multiple huge files, inserting huge data to databases, etc. The system
needs to keep track of the progress of these jobs and communicate the state to
the user whenever the user requests.

This details some thoughts on how to handle these jobs, especially in failure
conditions.

We currently use Redis[^redis] to keep track of the state of the background
processes.

Every background job started will have a Redis[^redis] key with the prefix `gn-uploader:jobs`

## Users

Currently (2024-10-23T13:29UTC-05:00), we do not track the user that started the job. Moving forward, we will track this information.

We could have the keys be something like, `gn-uploader:jobs:<user-id>:<job-id>`.

Another option is track any particular users jobs with a key of the form
`gn-uploader:users:<user-id>:jobs` and in that case, have the job keys take the
form `gn-uploader:jobs:<job-id>`. I (@fredmanglis) favour this option over
having the user's ID in the jobs keys directly, since it provides a way to
interact with **ALL** the jobs without indirecting through each specific user.
This is a useful ability to have, especially for system administrative tasks.

## Multiprocessing Within Jobs

Some jobs, e.g. quality-assurance jobs, can run multiple threads/processes
themselves. This brings up a problem because Redis[^redis] does not allow
parallel access to a key, especially for writing.

We also do not want to create bottlenecks by writing to the same key from
multiple threads/processes.

The design I have currently come up with, that might work is as follows:

- At any point just before where multiple threads/processes are started, a list
  of new keys, each of which will collect the output from a single thread, will
  be built.
- These keys are recorded in the parent's redis key data
- The threads/processes are started and do whatever they need, pushing their
  outputs to the appropriate keys within redis.
  
The new keys for the children threads/processe could build on the theme
  
  
## Fetching Jobs Status

Different jobs could have different ways of requirements for handling/processing
their outputs, and those of any children they might spawn. The system will need
to provide a way to pass in the correct function/code to process the outputs at
the point where the job status is requested.

This implies that we need to track the type of job in order to be able to select
the correct code for processing such output.

## Links

- [^redis]: https://redis.io/