From 2edb63e9f09172755c87b148d781b4bb75033c53 Mon Sep 17 00:00:00 2001
From: Frederick Muriuki Muriithi
Date: Mon, 4 Apr 2022 11:11:36 +0300
Subject: Add a technical specification for GeneNetwork

* Issue:
https://issues.genenetwork.org/topics/documentation/gn-hacking-documentation.html

Add a technical specification document for GeneNetwork to detail the
considerations taken, and technologies used in implementing the system.
---
 docs/images/data_flow.png        | Bin 0 -> 23017 bytes
 docs/technical_specification.org | 221 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 221 insertions(+)
 create mode 100644 docs/images/data_flow.png
 create mode 100644 docs/technical_specification.org
diff --git a/docs/images/data_flow.png b/docs/images/data_flow.png
new file mode 100644
index 0000000..421f908
Binary files /dev/null and b/docs/images/data_flow.png differ
diff --git a/docs/technical_specification.org b/docs/technical_specification.org
new file mode 100644
index 0000000..049c78b
--- /dev/null
+++ b/docs/technical_specification.org
@@ -0,0 +1,221 @@
+#+STARTUP: contents inlineimages shrink
+#+OPTIONS: ^:{}
+#+TITLE: Technical Specification for GeneNetwork
+#+AUTHOR: Frederick Muriuki Muriithi
+
+* Introduction
+
+This document will detail the internal implementation for the software.
+
+- Data structures
+- Programming languages
+- Database schema(s)
+- Programming Language(s)
+
+* Data
+
+Below is a flowchart of the data and the dependencies between the data:
+
+** TODO Complete this flowchart
+
+#+BEGIN_SRC dot :cmd dot :cmdline -Tpng :file images/data_flow.png :exports results
+  digraph Data {
+        Node[shape=none,margin=0];
+
+        species [label=<
+        <b>Species</b><br />
+        <i>e.g. mouse, rat, human, etc.</i>
+        >];
+
+        group [label=<<b>Group</b>>];
+
+        datasets [label=<<b>Datasets</b>>];
+
+        traits [label=<<b>Traits</b>>];
+
+        sample_data [label=<
+        <b>Sample Data</b><br />
+        <i>The actual data for each sample (individual/strain mean).</i>
+        >];
+
+
+          species -> group [style=dotted];
+          group -> datasets [style=dotted];
+          species -> datasets;
+          datasets -> traits;
+          traits -> sample_data;
+  }
+#+END_SRC
+
+#+RESULTS:
+[[file:images/data_flow.png]]
+  
+** Species
+
+This is a string that denotes which biological species the data in consideration belongs to.
+Examples of these are:
+- mouse
+- rat
+- human
+
+** Group
+
+A "category" for datasets with unique samplelists.
+
+A group can contain data from a single study or from many studies.
+
+This is a crutch due to the database design, and is not a true representation of
+the data. We should strive to get rid of it.
+
+Group information is found in the *InbredSet* table in the database
+
+** TODO Datasets
+
+These are groupings for the traits. The main differences between the datasets
+has to do with the relevant metadata for each grouping.
+
+There are three main groupings, which are:
+
+- Phenotype
+- Genotype
+- mRNA Expression (ProbeSet)
+
+*** Phenotype
+
+This grouping is for traits that are *NOT* expressions of genotype data (e.g.
+sex, weight, etc).
+
+Phenotype traits are authenticated at the trait level, since there are no
+sub-groupings of them within a group.
+
+*** Genotype
+
+The traits in this category are markers. Their sample data are the genotypes
+(encoded 1/0/1, though genotype probabilities could also be stored).
+
+This data also exists in genotype files (in both .geno and BIMBAM formats). This
+data is stored in the database to allow for its use as cofactors when mapping.
+
+*** mRNA Expression (ProbeSet)
+
+The traits in this category are mRNA transcripts from locations in the genome.
+The sample data for these traits are expression values for the traits.
+
+Individual datasets are from individual analyses, split by tissue.
+
+*** Temp
+
+This is a pseudo-grouping which costitutes of User-generated data. These have no
+real metadata. They are just lists of samples/strains (from a specific group)
+and values
+
+The data for these is entered directly via the "Submit Trait" page or it is derived from the PCA traits generated from the "Correlation Matrix" page
+
+**** TODO Samplelists: What are they? Are they related to sample data in traits?
+
+** TODO Traits
+
+A trait is a set of values for each sample/strain.
+
+A trait can have metadata. The metadata varies with the dataset in which the trait belongs.
+
+The metadata is not useful, for most part, in any computations, but it is useful
+for the display of the traits to the user, and in search.
+
+*** Sample Data
+
+This is the actual data for each sample.
+
+A sample can be an individual, or the strain mean (average of multiple
+individuals)
+
+The following are the expected "fields" for each instance of sample data:
+
+- *value*: a numerical value. For "strain mean" items, it is the mean value.
+- *variance*: /variance/ for an individual and /standard-error/ for "strain mean"
+- *ndata*/*num_cases*: Number of samples for "strain mean" values. NIL for
+  individuals (should probably be set to 1)
+
+Sample data can have metadata, referred to as "*Case Attributes*". This is
+mostly non-numeric data e.g. sex, treatment type, etc.
+
+**** TODO Can there be a trait with no sample data? What are the consequences of such a trait existing?
+
+*** Common Metadata
+
+- *num_overlap*: When doing correlations, it sets the number of overlapping
+  samples for all target traits (number of samples where both the primary and
+  target traits have values). This is not a true attribute, rather, it is a
+  temporary attribute assigned to a trait "object".
+- *haveinfo*: Get rid of this. An implementation detail. Indicates whether there
+  was information found in the database for the given trait.
+- *mean*: Average of all sample data values
+- *description_display*/*abbreviation*: Phenotype description and abbreviated
+  version (for display in figures or as a "trait symbol"). Depends on the
+  authentication.
+
+*** Phenotype Metadata
+
+- *name*: PublishXRef ID. In GeneNetwork, it is a combination of a three-letter group code (InbredSet.InbredSetCode) and this ID makes up the displayed name
+- *lrs*: Likelihood Ratio Statistic. In the code, this is the maximim QTL[fn:1]
+  association value when the trait in question is mapped using qtlreaper[fn:2]
+- *LRS_location_repr*: The location of the top *QTL* (above) as a printable
+  string (chromosome + megabase location) e.g. "Chr2:20.019888"
+- *pvalue*: P-value for the maximum *QTL*. There's a chance that the
+  *lrs score* is derived from this. *VERIFY!!!!!*
+- *additive*: additive effect[fn:3] for the maximum *QTL*
+- *pre_publication_description*: Description of the phenotype before publication.
+  Depending on the authentication, the *pre_publication_description* (which is
+  not really human readable) is shown instead of the
+  *post_publication_description*.
+- *post_publication_description*: For published phenotypes, this is the
+  description of the phenotype after publication.
+- *pubmed_id*/*pubmed_link*/*pubmed_text*: For published phenotypes, these are
+  the identifier, link to, and text of the publication on the PubMed[fn:4] site
+
+*** mRNA Expression (ProbeSet) Metadata
+
+- *name*: Transcript name
+- *location*: Location of the mRNA transcript in the genome (chromosome +
+  megabase location)
+- *lrs*: Likelihood Ratio Statistic. In the code, this is the maximim QTL[fn:1]
+  association value when the trait in question is mapped using qtlreaper[fn:2]
+- *LRS_location_repr*: The location of the top *QTL* (above) as a printable
+  string (chromosome + megabase location) e.g. "Chr2:20.019888"
+- *pvalue*: P-value for the maximum *QTL*. There's a chance that the
+  *lrs score* is derived from this. *VERIFY!!!!!*
+- *additive*: additive effect[fn:3] for the maximum *QTL*
+- *symbol*: gene symbol
+- *sequence*: The DNA sequence
+- *probe_target_description*/*strand_probe*/*cell_id*: *VERIFY!!!!*
+
+*** Genotype
+
+- *location*: Location of the genetic marker
+
+* Operations on Traits
+
+**** TODO What are the valid operations on traits?
+**** TODO What data is expected for each operation?
+
+* Operations on Datasets
+
+**** TODO Are there operations on whole datasets?
+**** TODO If yes, what are they and what data does each operation expect?
+
+* Database Design and Schema
+
+....
+
+* Other headlines go here ...
+
+* Footnotes
+
+[fn:1] Quantitative Trait Locus (QTL): A genetic region that most strongly modulates a particular trait. https://en.wikipedia.org/wiki/Genome-wide_association_study (-log10 p-value is basically an alternative measure of association to LRS). The mapping results that lrs/LRS_location_repr/additive are derived from are usually represented by a Manhattan Plot (as shown in the first figure of the wiki page). The max QTL would be the peak value in that figure (and its associated chromosome location)
+
+[fn:2] qtlreaper: <provide link to qtlreaper>
+
+[fn:3] https://genenetwork.org/glossary/#a (Additive Allele Effect) – the GN glossary entry for this can explain it better than I can.
+It’s basically another value associated with the max QTL.
+
+[fn:4] PubMed https://pubmed.ncbi.nlm.nih.gov/
-- 
cgit 1.4.1