2 files changed, 91 insertions, 6 deletions
diff --git a/topics/data/R-qtl2-format-notes.gmi b/topics/data/R-qtl2-format-notes.gmi
new file mode 100644
index 0000000..3397b5e
--- /dev/null
+++ b/topics/data/R-qtl2-format-notes.gmi
@@ -0,0 +1,72 @@
+# R/qtl2 and GEMMA Format Notes
+
+This document is mostly to help other non-biologists figure out their way around the format(s) of the R/qtl2 files. It mostly deals with the meaning/significance of the various fields.
+
+From the R/qtl2 format documentation:
+
+> The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names.
+
+and
+
+> All of these CSV files may be transposed relative to the form described below.
+
+We are going to consider the "non-transposed" form here, for ease of documentation: simply flip the meanings as appropriate for the transposed files.
+
+To convert between formats we should probably use python as that is what can use as 'esperanto'.
+
+## Control files
+
+Both GN and R/qtl2 have control files. For GN it basically describes the individuals (genometypes) and looks like:
+
+```js
+{
+        "mat": "C57BL/6J",
+        "pat": "DBA/2J",
+        "f1s": ["B6D2F1", "D2B6F1"],
+        "genofile" : [{
+                "title" : "WGS-based (Mar2022)",
+                "location" : "BXD.8.geno",
+                "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44",
+ ...]}]}
+```
+
+In gn-guile this gets parsed in gn/data/genotype.scm to fetch the individuals that match the genotype and phenotype layouts.
+
+## pheno files and phenotypes
+
+The standard GEMMA input files are not very good for trouble shooting. R/qtl2 has at least the individual or genometype ID for every line:
+
+```
+id,bolting_days,seed_weight,seed_area,ttl_seedspfruit,branches,height,pc_seeds_aborted,fruit_length
+MAGIC.1,15.33,17.15,0.64,45.11,10.5,NA,0,14.95
+MAGIC.2,22,22.71,0.75,49.11,4.33,42.33,1.09,13.27
+MAGIC.3,23,21.03,0.68,57,4.67,50,0,13.9
+```
+
+This is a good standard and can match with the control files.
+
+## geno files
+
+> The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names.
+
+For GeneNetwork, this means that the first column contains the Sample names (previously "strain names"). The first row would be a list of markers.
+
+## gmap and pmap files
+
+The first column of the gmap/pmap file contains genetic marker values. There are no Individuals/samples (or strains) here.
+
+## phenocovar files
+
+These seem to contain extra metadata for the phenotypes.
+
+The first column is the list of phenotype identifiers whereas the first column is a list of metadata headers (phenotype covariates).
+
+As an example,
+=> https://github.com/rqtl/qtl2data/blob/main/BXD/bxd_phenocovar.csv The phenocovar file for BXD mice
+
+We see here that this contains the individual identifier (id), and a description for each individual/sample.
+
+# References
+
+=> https://kbroman.org/qtl2/assets/vignettes/input_files.html
+=> https://github.com/rqtl/qtl2data
diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi
index 75e3bfd..d22778a 100644
--- a/topics/data/precompute/steps.gmi
+++ b/topics/data/precompute/steps.gmi
@@ -13,8 +13,18 @@ We will track precompute steps here. We will have:
 Trait archives will have steps for
 
 * [X] step p1: list-traits-to-compute
-* [+] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper
-* [ ] step p3: gemma-to-lmdb: create a clean vector
+* [X] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper
+* [X] step p3: gemma-to-lmdb: create a clean vector
+
+Start precompute
+
+* [ ] Fetch traits on tux04
+* [ ] Set up runner on tux04 and others
+* [ ] Run on Octopus
+
+Work on published data
+
+* [ ] Fetch traits
 
 The DB itself can be updated from these
 
@@ -22,8 +32,11 @@ The DB itself can be updated from these
 
 Later
 
+* [ ] Rqtl2: Compute Rqtl2 vector
 * [ ] bulklmm: Compute bulklmm vector
 
+Interestingly this work coincides with Arun's work on CWL. Rather than trying to write a workflow in bash, we'll use ccwl and accompanying tools to scale up the effort.
+
 # Tags
 
 * assigned: pjotrp
@@ -36,10 +49,10 @@ Later
 
 * [ ] Check Artyoms LMDB version for kinship and maybe add LOCO
 * [+] Create JSON metadata controller for every compute incl. type of content
-* [+] Create genotype archive
-* [+] Create kinship archive
+* [X] Create genotype archive
+* [X] Create kinship archive
 * [+] Create trait archives
-* [+] Kick off lmm9 step
+* [X] Kick off lmm9 step
 * [ ] Update DB step v1
 
 # Step p1: list traits to compute
@@ -62,7 +75,7 @@ At this point we can write
 {"2":9.40338,"3":10.196,"4":10.1093,"5":9.42362,"6":9.8285,"7":10.0808,"8":9.17844,"9":10.1527,"10":10.1167,"11":9.88551,"13":9.58127,"15":9.82312,"17":9.88005,"19":10.0761,"20":10.2739,"21":9.54171,"22":10.1056,"23":10.5702,"25":10.1433,"26":9.68685,"28":9.98464,"29":10.132,"30":9.96049,"31":10.2055,"35":10.1406,"36":9.94794,"37":9.96864,"39":9.31048}
 ```
 
-Note that it (potentially) includes the parents. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with:
+Note that it (potentially) includes the parents and that is corrected when generating the phenotype file for GEMMA. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with:
 
 ```scheme
 (use-modules  (rnrs bytevectors)