summaryrefslogtreecommitdiff
path: root/topics
diff options
context:
space:
mode:
authorPjotr Prins2024-05-07 11:24:27 +0200
committerPjotr Prins2024-05-07 11:24:27 +0200
commita33b062cb2631064c9d5323ddc208933caf28624 (patch)
treee5e98f9f2c14f72ccea4e499c0c0984ae64b0950 /topics
parenta7bf2095c48bb8cd8fead123849d7e64c85b5bfb (diff)
downloadgn-gemtext-a33b062cb2631064c9d5323ddc208933caf28624.tar.gz
Precompute
Diffstat (limited to 'topics')
-rw-r--r--topics/data/precompute/steps.gmi44
-rw-r--r--topics/systems/mariadb/precompute-mapping-input-data.gmi57
2 files changed, 95 insertions, 6 deletions
diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi
new file mode 100644
index 0000000..e72366e
--- /dev/null
+++ b/topics/data/precompute/steps.gmi
@@ -0,0 +1,44 @@
+# Precompute steps
+
+At this stage precompute fetches a trait from the DB and runs GEMMA. Next it tar balls up the vector for later use. It also updates the database with the latest info.
+
+To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure.
+
+We will track precompute steps here. We will have:
+
+* [ ] steps g: genotype archives (first we only do BXD-latest, include BXD.json)
+* [ ] steps k: kinship archives (first we only do BXD-latest)
+* [ ] steps p: trait archives (first we do p1-4)
+
+Trait archives will have steps for
+
+* [ ] step p1: list-traits-to-compute
+* [ ] step p2: trait-values-export: get trait values from mariadb
+* [ ] step p3: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper
+* [ ] step p4: gemma-to-lmdb: create a clean vector
+
+The DB itself can be updated from these
+
+* [ ] step p5: updated-db-v1: update DB using single LOD score, number of samples and
+
+Later
+
+* [ ] bulklmm: Compute bulklmm vector
+
+# Tags
+
+* assigned: pjotrp
+* type: precompute, gemma
+* status: in progress
+* priority: high
+* keywords: ui, correlations
+
+# Tasks
+
+* [ ] Check Artyoms LMDB version for kinship and maybe add LOCO
+* [ ] Create JSON metadata controller for every compute incl. type of content
+* [ ] Create genotype archive
+* [ ] Create kinship archive
+* [ ] Create trait archives
+* [ ] Kick off lmm9 step
+* [ ] Update DB step v1
diff --git a/topics/systems/mariadb/precompute-mapping-input-data.gmi b/topics/systems/mariadb/precompute-mapping-input-data.gmi
index d8ebe15..2e73590 100644
--- a/topics/systems/mariadb/precompute-mapping-input-data.gmi
+++ b/topics/systems/mariadb/precompute-mapping-input-data.gmi
@@ -2,6 +2,12 @@
GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.
+At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See
+
+=> /topics/data/precompute/steps
+
+The mariadb database gets (re)updated from the computed data, parsing metadata and forcing an update. We only need to handle state to track creating batches for (offline) compute. As this can happen on a single machine we don't need to do anything real fancy. Just write out the last updated traitid, i.e., DataId from ProbeSetXRef which is indexed. Before doing anything we use Locus_old to identify updates by setting it to NULL. See the header of list-traits-to-compute.scm.
+
# Tags
* assigned: pjotrp
@@ -12,16 +18,15 @@ GN relies on precomputed mapping scores for search and other functionality. Here
# Tasks
-* [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine
-* [ ] Update the table values using GEMMA output (single highest score)
+See
-Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date.
-This is achieved by naming the input and output files as a hash on their DB inputs.
+=> topics/data/precompute/steps
-Next for running the full batch:
+Next, for running the full batch:
* [X] Store all GEMMA values efficiently
* [ ] Include metadata record in lmdb and as JSON file
+* [ ] Include metadata record on compute status
* [ ] Remove junk from tarball
* [ ] List significant markers as metadata
* [ ] Reread below info
@@ -72,7 +77,7 @@ MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5;
+----+----------+
```
-and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.
+and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today (let alone the latest?). Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.
There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06"
@@ -1085,6 +1090,9 @@ For one result we have
}
```
+Continue with steps:
+
+=> /topics/data/precompute/steps
# Notes
@@ -1112,6 +1120,9 @@ The l_lrt runs between 0.0 and 1.0. The smaller the number the higher the log10
The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In our case it is always exactly 10^5 or -10^5 for some reason. According to GEMMA doc that means it is a pure genetic effect when positive.
+## Xapian
+
+We want to add the log values, effect size and number of individuals to the xapian search
## NAs in GN
@@ -1146,3 +1157,37 @@ because it has 71 BXD samples and 32 other samples.
This paper discusses a number of approaches that may be interesting:
=> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-023-00331-3 Automated quantitative trait locus analysis (AutoQTL)
+
+## Fetch strain IDs
+
+The following join will fetch StrainID in a dataset
+
+```
+MariaDB [db_webqtl]> select StrainId, Locus, DataId, ProbeSetId, ProbeSetFreezeId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL ORDER BY DataId LIMIT 5;
++----------+-----------+--------+------------+------------------+
+| StrainId | Locus | DataId | ProbeSetId | ProbeSetFreezeId |
++----------+-----------+--------+------------+------------------+
+| 1 | rs6394483 | 115467 | 13825 | 8 |
+| 2 | rs6394483 | 115467 | 13825 | 8 |
+| 3 | rs6394483 | 115467 | 13825 | 8 |
+| 4 | rs6394483 | 115467 | 13825 | 8 |
+| 5 | rs6394483 | 115467 | 13825 | 8 |
++----------+-----------+--------+------------+------------------+
+5 rows in set (0.205 sec)
+```
+
+## Count data sets
+
+Using above line we can count the number of times BXD1 was used:
+
+```
+MariaDB [db_webqtl]> select count(StrainId) from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL AND StrainId=1 ORDER BY DataId LIMIT 5;
++-----------------+
+| count(StrainId) |
++-----------------+
+| 10432197 |
++-----------------+
+1 row in set (39.545 sec)
+```
+
+Or the main BXDs