summaryrefslogtreecommitdiff
path: root/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
diff options
context:
space:
mode:
authorFrederick Muriuki Muriithi2021-12-22 11:49:30 +0300
committerFrederick Muriuki Muriithi2021-12-22 11:49:30 +0300
commit2f56ee37183938270197d9bd968648e65584513c (patch)
tree195c6ca08188a61d1669403c3109de8a96427a7a /issues/gemma/gemma-wrapper-has-incomplete-files.gmi
parent512bc12aaac7189253a62b2be105472a34821263 (diff)
parentbe16a6a7f1a7e2dfa074e858c26ff6a9b6aa86de (diff)
downloadgn-gemtext-2f56ee37183938270197d9bd968648e65584513c.tar.gz
Merge branch 'main' of github.com:genenetwork/gn-gemtext-threads
Diffstat (limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')
-rw-r--r--issues/gemma/gemma-wrapper-has-incomplete-files.gmi42
1 files changed, 42 insertions, 0 deletions
diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
new file mode 100644
index 0000000..d530fb4
--- /dev/null
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -0,0 +1,42 @@
+# gemma-wrapper has incomplete files
+
+Gemma wrapper caches files - but it can happen a cached file is incomplete and never updated again. The problem appears when GNU parallel is invoked and hits an error. The task here is to make gemma-wrapper transactional.
+
+## Tags
+
+* assigned: pjotrp, zachs
+
+## Tasks
+
+* [X] parse parallel job log for failed tasks and remove the output files.
+* [X] create a (global) lock file for gemma-wrapper
+
+## Info
+
+GNU parallel can fail, but does not tell how individual processes did. Need to check if it can return a thread (number). If not we have the option of checking the GEMMA status file and/or see if the output file is complete (by counting number of lines).
+
+The 'obvious' fix would be to create an error handler in GEMMA itself that would clean up output files on error exit. E.g. using
+
+=> https://www.cplusplus.com/reference/exception/set_terminate/
+
+The problem is that it is NOT a catch all. If there is a hardware fault or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort (I'll take care of it in a GEMMA rewrite).
+
+It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing using the `--joblog` and `--resume` switches. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time.
+
+## Delete files on failure
+
+Implemented in
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/624ed0d805f29ab682cffbe46bc104dffd0d713c
+
+## Dealing with locks
+
+There is another parallel issue (pun intended) where gemma-wrapper is invoked twice for the same job. This is quite possible when people get impatient waiting for a first job to finish.
+
+One solution is to write a lock file using the inputs as a hash. The lock file can contain a PID and we can check if that is still alive. I should do the same for sheepdog locks(!)
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/e7e516ec5a6ffc5b398302fa204685a40e76e171 Added locking support
+
+I added the same code to replace sheepdog locks. Sheepdog is running every minute on our machines so it is a great test case.
+
+=> https://github.com/pjotrp/deploy/commit/4790b81ee897c8244280169edb3cac751eb0a9b3 sheepdog locking