From c5923437c8abea6fb7bc940e119923c8566e80cf Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Fri, 24 Nov 2023 15:37:15 +0100
Subject: Precompute update

---
 .../mariadb/precompute-mapping-input-data.gmi      | 59 ++++++++++++++++++++--
 1 file changed, 56 insertions(+), 3 deletions(-)

(limited to 'topics/systems/mariadb/precompute-mapping-input-data.gmi')

diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index 85a11ee..caad794 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -916,9 +916,62 @@ select * from StrainXRef WHERE InbredSetId=1 AND Used_for_mapping="Y"
 276 rows
 ```
 
-so it looks like we need some extra logic.
-
-
+so it looks like we need some extra logic to fetch the used individuals from the actual genotype file(s).
+
+## Writing the phenotype file
+
+For gemma we need to feed a phenotype file that has only the individuals that are in the genotype file (the other 'missing' phenotype values should be NAs).
+
+The genotype header in GN2 is read from a `BXD.json` file. Code is at
+wqflask/wqflask/marker_regression/run_mapping.py:get_genofile_samplelist(dataset)
+
+Zach writes:
+
+> The JSON files were basically just a "stopgap" solution to allow for
+> multiple genotype files for any given group (which previously wasn't
+> possible), and they don't even exist for most groups - the code uses
+> them only if they exist for a given group. They contain the
+> title/label, filename, and sample lists for situations where the other
+> genotype files only contain a subset of the samples from the full
+> sample list, but otherwise (and usually) the samplelist is just fetched
+> from the .geno file. It's also possible to fetch that subset of samples
+> from the other/alternate .geno files, but it's easier to just fetch the
+> samples as a list from that .json file than it is to parse it out of
+> the other .geno files (and the JSON file would still be used to to get
+> the "title" and filenames, so it's already being read in the code).
+> Basically the JSON files are just a sort of "genotype file metadata"
+> thing to deal with the existence of multiple genotype files for a
+> single group and provide the "attributes" of the files in question
+> (namely the title and filename - the sample lists were only added later
+> once we started to get genotype files that didn't contain the full
+> samplelist for the group in question). The "titles" appear in the
+> drop-down on the trait page* - in their absence it will just use
+> "{group_name}.geno" (for example BXD.geno). So if you deleted BXD.json,
+> mapping would still work for BXD, but it would just default to BXD.geno
+> and wouldn't give the option of other files. This information could
+> probably also be stored somehow in the .geno file, but I'm not sure how
+> to easily do so since that's a weird proprietary format and would
+> seemingly require some sort of direct parsing in the GN code (while
+> JSON is more simple to deal with).
+
+=> https://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P
+
+> If you scroll down to mapping and Inspect Element on the
+> Genotypes drop-down you'll see how the values are the actual filenames
+> while the text is the title from the JSON file
+
+So this is genotype metadata which could go into virtuoso (where all metadata belongs).
+Anyway, we'll get there when the time comes, but for now I can use it to see what genotypes we have in the genotype file and write the phenotype file accordingly.
+
+I note we also have a GenoData table in mariadb which has allele info. The script
+
+=> https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/load_genotypes.py
+
+loads these from the BXD.geno-type files. As far as I can tell GenoData is not used in GN2+GN3. Other Geno files are basically used to fetch metadata on genotypes.
+
+Genotype state lives in 4 places. Time to create a 5th one with lmdb ;). At least the BXD.json file lines up with BXD.8.geno and the BIMBAM version with 235 inds.
+
+Using this information we created our first phenotype file and GEMMA run!
 
 ## More complicated datasets
 
-- 
cgit v1.2.3