diff options
author | Pjotr Prins | 2021-11-25 08:29:50 +0100 |
---|---|---|
committer | Pjotr Prins | 2021-11-25 08:29:50 +0100 |
commit | c97aa198bf95cb59ece70f3d8460d4537800e92c (patch) | |
tree | 967f1648ec75eab0438b4edaf4c75c95eb342fd4 | |
parent | ba56073820818a8edc198c823b016248fde2c244 (diff) | |
download | gn-gemtext-c97aa198bf95cb59ece70f3d8460d4537800e92c.tar.gz |
Using locking code in sheepdog too
-rw-r--r-- | issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 8 |
1 files changed, 6 insertions, 2 deletions
diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi index 6441376..6cb92f6 100644 --- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi +++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi @@ -19,9 +19,9 @@ The 'obvious' fix would be to create an error handler in GEMMA itself that would => https://www.cplusplus.com/reference/exception/set_terminate/ -The problem is that it is NOT a catch all. If there is a hardware fault - a hanging CPU core, for example, which we see - or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort. +The problem is that it is NOT a catch all. If there is a hardware fault or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort (I'll take care of it in a GEMMA rewrite). -It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time. +It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing using the `--joblog` and `--resume` switches. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time. ## Dealing with locks @@ -30,3 +30,7 @@ There is another parallel issue (pun intended) where gemma-wrapper is invoked tw One solution is to write a lock file using the inputs as a hash. The lock file can contain a PID and we can check if that is still alive. I should do the same for sheepdog locks(!) => https://github.com/genetics-statistics/gemma-wrapper/commit/e7e516ec5a6ffc5b398302fa204685a40e76e171 Added locking support + +I added the same code to replace sheepdog locks. Sheepdog is running every minute on our machines so it is a great test case. + +=> https://github.com/pjotrp/deploy/commit/4790b81ee897c8244280169edb3cac751eb0a9b3 sheepdog locking |