summaryrefslogtreecommitdiff
path: root/issues/dump-sample-data-to-lmdb.gmi
diff options
context:
space:
mode:
authorMunyoki Kilyungi2022-12-15 19:42:13 +0300
committerMunyoki Kilyungi2022-12-15 19:42:13 +0300
commit1ee54541862151bbaa65552a5c3058a0ebff6de6 (patch)
treea29d833f38d6082694dffc68031321b320373ff7 /issues/dump-sample-data-to-lmdb.gmi
parent126b14a652d737737f7136cfe03acee60ad893a3 (diff)
downloadgn-gemtext-1ee54541862151bbaa65552a5c3058a0ebff6de6.tar.gz
Dump sample data to LMDB for GN2/3
* issues/dump-sample-data-to-lmdb.gmi: New issue.
Diffstat (limited to 'issues/dump-sample-data-to-lmdb.gmi')
-rw-r--r--issues/dump-sample-data-to-lmdb.gmi44
1 files changed, 44 insertions, 0 deletions
diff --git a/issues/dump-sample-data-to-lmdb.gmi b/issues/dump-sample-data-to-lmdb.gmi
new file mode 100644
index 0000000..18ac808
--- /dev/null
+++ b/issues/dump-sample-data-to-lmdb.gmi
@@ -0,0 +1,44 @@
+# Dump sample data to LMDB
+
+* assigned: bonfacem
+* priority: high
+* status: in progress
+* keywords: lmdb, rdf
+
+## Description
+
+For GeneNetwork2, a dataset is made up of multiple traits, each with its own sample data. The trait's name is a combination of the species name and the trait's ID (for genotypes/probesets this may not be the case), which is obtained from a SQL table. The objective of this task is to store each trait's sample data in LMDB, allowing it to be accessed quickly in GN2/3 via RDF, which will decouple the data from the python-base classes/objects it is associated with, significantly improving sample data access speed.
+
+## Tasks
+Dump data and add relevant RDF Metadata for LMDB for:
+
+* [-] [un-]published phenotypes
+* [ ] probesets
+* [ ] genotypes
+* [ ] GN2/3 Integration
+
+## General Notes
+
+To fetch all data, including case-attributes data, for published phenotypes in SQL (using BXD_10007 as an example), you would use the following:
+
+=> http://genenetwork.org/show_trait?trait_id=10007&dataset=BXDPublish
+
+```
+SELECT A.Name, A.Value, A.SE, A.Count, B.CaseAttributes FROM
+(SELECT DISTINCT st.Name as 'Name', ifnull(pd.value, 'x') as 'Value',
+ifnull(ps.error, 'x') as 'SE', ifnull(ns.count, 'x') as 'Count', ps.StrainId as 'StrainId'
+FROM PublishFreeze pf JOIN PublishXRef px ON px.InbredSetId = pf.InbredSetId
+JOIN PublishData pd ON pd.Id = px.DataId JOIN Strain st ON pd.StrainId = st.Id
+LEFT JOIN PublishSE ps ON ps.DataId = pd.Id AND ps.StrainId = pd.StrainId
+LEFT JOIN NStrain ns ON ns.DataId = pd.Id AND ns.StrainId = pd.StrainId
+WHERE px.PhenotypeId = 35 ORDER BY st.Name) A
+JOIN
+(SELECT cxref.StrainId as StrainId, group_concat(ca.Name, '=', cxref.Value) as "CaseAttributes"
+FROM CaseAttributeXRefNew cxref LEFT JOIN CaseAttribute ca
+ON ca.Id = cxref.CaseAttributeId
+GROUP BY InbredSetId, cxref.StrainId) B ON A.StrainId = B.StrainId;
+```
+
+See this answer for how a join was performed on 2 different queries:
+
+=> https://dba.stackexchange.com/questions/146509/joining-results-of-two-mysql-queries