summaryrefslogtreecommitdiff
path: root/issues/dump-sample-data-to-lmdb.gmi
diff options
context:
space:
mode:
Diffstat (limited to 'issues/dump-sample-data-to-lmdb.gmi')
-rw-r--r--issues/dump-sample-data-to-lmdb.gmi27
1 files changed, 20 insertions, 7 deletions
diff --git a/issues/dump-sample-data-to-lmdb.gmi b/issues/dump-sample-data-to-lmdb.gmi
index 18ac808..d87c3f3 100644
--- a/issues/dump-sample-data-to-lmdb.gmi
+++ b/issues/dump-sample-data-to-lmdb.gmi
@@ -2,13 +2,9 @@
* assigned: bonfacem
* priority: high
-* status: in progress
+* status: stalled
* keywords: lmdb, rdf
-## Description
-
-For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.
-
## Tasks
Dump data and add relevant RDF Metadata for LMDB for:
@@ -16,8 +12,11 @@ Dump data and add relevant RDF Metadata for LMDB for:
* [ ] probesets
* [ ] genotypes
* [ ] GN2/3 Integration
+* [ ] Have files and named files available through RDF
-## General Notes
+## Description
+
+For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.
To fetch all data, including case-attributes data, for published phenotypes in SQL (using BXD_10007 as an example), you would use the following:
@@ -41,4 +40,18 @@ GROUP BY InbredSetId, cxref.StrainId) B ON A.StrainId = B.StrainId;
See this answer for how a join was performed on 2 different queries:
-=> https://dba.stackexchange.com/questions/146509/joining-results-of-two-mysql-queries
+=> https://dba.stackexchange.com/questions/146509/joining-results-of-two-mysql-queries Joining results of two mysql queries
+
+#### Correlations
+
+Correlations are slow. As of Tuesday April 4, 2023 at 1:37pm:
+
+GN1 took *29 sec* (completed) vs GN2 *38 sec* (completed) for 1457545_at in the Hippocampus Consortium M430v2 (Jun06) PDNN
+
+=> http://genenetwork.org/webqtl/main.py?FormID=sharinginfo&GN_AccessionId=112 Hippocampus Consortium M430v2 (Jun06) PDNN
+
+GN1 took *1.56 mins* (completed) vs GN2 *5.24 mins* (completed) for 10528873 in the UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level
+
+=> https://gn1.genenetwork.org/webqtl/main.py?FormID=sharinginfo&InfoPageName=UTHSC_BXD_H_0912 UTHSC BXD Aged Hippocampus Affy Mouse Gene 1.0 ST (Sep12) RMA Exon Level
+
+Research Question: What effect would using LMDB have on correlations over text-file caching and sql fetches?