summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMunyoki Kilyungi2023-05-03 19:42:57 +0300
committerMunyoki Kilyungi2023-05-03 19:42:57 +0300
commit0dc9584a9c464df7c150eefaf4c70fe4cf7b3db5 (patch)
treee5e223d6492a99a95b45e8f3afa2325d1b7a7b42
parent9e19f27929ed20f52394b65f21db0084b7ff8235 (diff)
downloadgn-gemtext-0dc9584a9c464df7c150eefaf4c70fe4cf7b3db5.tar.gz
Update issue on dumping sample data to LMDB
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
-rw-r--r--issues/dump-sample-data-to-lmdb.gmi27
1 files changed, 20 insertions, 7 deletions
diff --git a/issues/dump-sample-data-to-lmdb.gmi b/issues/dump-sample-data-to-lmdb.gmi
index 18ac808..d87c3f3 100644
--- a/issues/dump-sample-data-to-lmdb.gmi
+++ b/issues/dump-sample-data-to-lmdb.gmi
@@ -2,13 +2,9 @@
* assigned: bonfacem
* priority: high
-* status: in progress
+* status: stalled
* keywords: lmdb, rdf
-## Description
-
-For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.
-
## Tasks
Dump data and add relevant RDF Metadata for LMDB for:
@@ -16,8 +12,11 @@ Dump data and add relevant RDF Metadata for LMDB for:
* [ ] probesets
* [ ] genotypes
* [ ] GN2/3 Integration
+* [ ] Have files and named files available through RDF
-## General Notes
+## Description
+
+For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.
To fetch all data, including case-attributes data, for published phenotypes in SQL (using BXD_10007 as an example), you would use the following:
@@ -41,4 +40,18 @@ GROUP BY InbredSetId, cxref.StrainId) B ON A.StrainId = B.StrainId;
See this answer for how a join was performed on 2 different queries:
-=> https://dba.stackexchange.com/questions/146509/joining-results-of-two-mysql-queries
+=> https://dba.stackexchange.com/questions/146509/joining-results-of-two-mysql-queries Joining results of two mysql queries
+
+#### Correlations
+
+Correlations are slow. As of Tuesday April 4, 2023 at 1:37pm:
+
+GN1 took *29 sec* (completed) vs GN2 *38 sec* (completed) for 1457545_at in the Hippocampus Consortium M430v2 (Jun06) PDNN
+
+=> http://genenetwork.org/webqtl/main.py?FormID=sharinginfo&GN_AccessionId=112 Hippocampus Consortium M430v2 (Jun06) PDNN
+
+GN1 took *1.56 mins* (completed) vs GN2 *5.24 mins* (completed) for 10528873 in the UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level
+
+=> https://gn1.genenetwork.org/webqtl/main.py?FormID=sharinginfo&InfoPageName=UTHSC_BXD_H_0912 UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level
+
+Research Question: What effect would using LMDB have on correlations over text-file caching and sql fetches?