From 3787440fb80f3d0e0f6834aaf9fda07d6dd9b2e1 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sun, 14 Nov 2021 16:15:19 -0600
Subject: gemma-wrapper transactions

---
 .../gemma/gemma-wrapper-has-incomplete-files.gmi   | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100644 issues/gemma/gemma-wrapper-has-incomplete-files.gmi

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
new file mode 100644
index 0000000..4bea71d
--- /dev/null
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -0,0 +1,22 @@
+# gemma-wrapper has incomplete files
+
+Gemma wrapper caches files - but it can happen a cached file is incomplete and never updated again. The problem appears when GNU parallel is invoked and hits an error. The task here is to make gemma-wrapper transactional.
+
+## Tags
+
+* assigned: pjotrp, zachs
+
+## Tasks
+
+* [ ] parse parallel job log for failed tasks and remove the output files.
+* [ ] create a (global) lock file for gemma-wrapper
+
+## Info
+
+GNU parallel can fail, but does not tell how individual processes did. Need to check if it can return a thread (number). If not we have the option of checking the GEMMA status file and/or see if the output file is complete (by counting number of lines).
+
+Turns out GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files!
+
+There is another parallel issue (pun intended) where gemma-wrapper is invoked twice for the same job. This is quite possible when people get impatient waiting for a first job to finish.
+
+One solution is to write a lock file using the inputs as a hash. The lock file can contain a PID and we can check if that is still alive. I should do the same for sheepdog locks(!)
-- 
cgit v1.2.3


From 1d1d0f4f448a2b0e15f10a3a7030d93f955cb421 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sat, 20 Nov 2021 12:45:16 +0100
Subject: gemma-wrapper: add lock support

---
 issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 4 ++++
 1 file changed, 4 insertions(+)

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
index 4bea71d..3c9a7ad 100644
--- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -17,6 +17,10 @@ GNU parallel can fail, but does not tell how individual processes did. Need to c
 
 Turns out GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files!
 
+## Dealing with locks
+
 There is another parallel issue (pun intended) where gemma-wrapper is invoked twice for the same job. This is quite possible when people get impatient waiting for a first job to finish.
 
 One solution is to write a lock file using the inputs as a hash. The lock file can contain a PID and we can check if that is still alive. I should do the same for sheepdog locks(!)
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/e7e516ec5a6ffc5b398302fa204685a40e76e171 Added locking support
-- 
cgit v1.2.3


From 722fc762c7749470b1f7a22cfd74e7acda8146a1 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sat, 20 Nov 2021 12:56:05 +0100
Subject: gemma-wrapper: add lock support

---
 issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
index 3c9a7ad..7b8093a 100644
--- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -9,7 +9,7 @@ Gemma wrapper caches files - but it can happen a cached file is incomplete and n
 ## Tasks
 
 * [ ] parse parallel job log for failed tasks and remove the output files.
-* [ ] create a (global) lock file for gemma-wrapper
+* [X] create a (global) lock file for gemma-wrapper
 
 ## Info
 
-- 
cgit v1.2.3


From 2f09b7c24096ae7de0fef7df12de4f5f36d0514d Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sun, 21 Nov 2021 10:29:37 +0100
Subject: gemma musings

---
 issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
index 7b8093a..6441376 100644
--- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -15,7 +15,13 @@ Gemma wrapper caches files - but it can happen a cached file is incomplete and n
 
 GNU parallel can fail, but does not tell how individual processes did. Need to check if it can return a thread (number). If not we have the option of checking the GEMMA status file and/or see if the output file is complete (by counting number of lines).
 
-Turns out GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files!
+The 'obvious' fix would be to create an error handler in GEMMA itself that would clean up output files on error exit. E.g. using
+
+=> https://www.cplusplus.com/reference/exception/set_terminate/
+
+The problem is that it is NOT a catch all. If there is a hardware fault - a hanging CPU core, for example, which we see - or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort.
+
+It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time.
 
 ## Dealing with locks
 
-- 
cgit v1.2.3


From c97aa198bf95cb59ece70f3d8460d4537800e92c Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Thu, 25 Nov 2021 08:29:50 +0100
Subject: Using locking code in sheepdog too

---
 issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
index 6441376..6cb92f6 100644
--- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -19,9 +19,9 @@ The 'obvious' fix would be to create an error handler in GEMMA itself that would
 
 => https://www.cplusplus.com/reference/exception/set_terminate/
 
-The problem is that it is NOT a catch all. If there is a hardware fault - a hanging CPU core, for example, which we see - or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort.
+The problem is that it is NOT a catch all. If there is a hardware fault or a problem in a library, such as openblas, there is no guarantee that the terminate handler will be called. Another complication is that a terminate handler needs to be aware of the files being output - i.e., we need to carry the state down somehow. I think we can probably address these issues as much is handled in the GEMMA PARAM class, but it is not worth the effort (I'll take care of it in a GEMMA rewrite).
 
-It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time.
+It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing using the `--joblog` and `--resume` switches. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time.
 
 ## Dealing with locks
 
@@ -30,3 +30,7 @@ There is another parallel issue (pun intended) where gemma-wrapper is invoked tw
 One solution is to write a lock file using the inputs as a hash. The lock file can contain a PID and we can check if that is still alive. I should do the same for sheepdog locks(!)
 
 => https://github.com/genetics-statistics/gemma-wrapper/commit/e7e516ec5a6ffc5b398302fa204685a40e76e171 Added locking support
+
+I added the same code to replace sheepdog locks. Sheepdog is running every minute on our machines so it is a great test case.
+
+=> https://github.com/pjotrp/deploy/commit/4790b81ee897c8244280169edb3cac751eb0a9b3 sheepdog locking
-- 
cgit v1.2.3


From d326fd0036a43b4e2b1edfe8137ff5c36570d221 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Thu, 25 Nov 2021 09:25:55 +0100
Subject: Fixed incomplete output for parallel gemma

---
 issues/gemma/gemma-wrapper-has-incomplete-files.gmi | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

(limited to 'issues/gemma/gemma-wrapper-has-incomplete-files.gmi')

diff --git a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
index 6cb92f6..d530fb4 100644
--- a/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
+++ b/issues/gemma/gemma-wrapper-has-incomplete-files.gmi
@@ -8,7 +8,7 @@ Gemma wrapper caches files - but it can happen a cached file is incomplete and n
 
 ## Tasks
 
-* [ ] parse parallel job log for failed tasks and remove the output files.
+* [X] parse parallel job log for failed tasks and remove the output files.
 * [X] create a (global) lock file for gemma-wrapper
 
 ## Info
@@ -23,6 +23,12 @@ The problem is that it is NOT a catch all. If there is a hardware fault or a pro
 
 It turns out that GNU parallel can keep track of jobs in a job log - and even rerun the ones missing using the `--joblog` and `--resume` switches. The last we don't need because we are using a cache. But we can use the log file to remove any incomplete output files! To me this is the obvious solution because 'parallel' is monitoring outside the GEMMA process and is a hardened piece of software. On failure it simply designates runs that way and we can clean up any (partly) produced files followed by a safe rerun. The lock routine below ascertains no processes are creating the same output at the same time.
 
+## Delete files on failure
+
+Implemented in
+
+=> https://github.com/genetics-statistics/gemma-wrapper/commit/624ed0d805f29ab682cffbe46bc104dffd0d713c
+
 ## Dealing with locks
 
 There is another parallel issue (pun intended) where gemma-wrapper is invoked twice for the same job. This is quite possible when people get impatient waiting for a first job to finish.
-- 
cgit v1.2.3