From 2edb63e9f09172755c87b148d781b4bb75033c53 Mon Sep 17 00:00:00 2001 From: Frederick Muriuki Muriithi Date: Mon, 4 Apr 2022 11:11:36 +0300 Subject: Add a technical specification for GeneNetwork * Issue: https://issues.genenetwork.org/topics/documentation/gn-hacking-documentation.html Add a technical specification document for GeneNetwork to detail the considerations taken, and technologies used in implementing the system. --- docs/images/data_flow.png | Bin 0 -> 23017 bytes docs/technical_specification.org | 221 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 221 insertions(+) create mode 100644 docs/images/data_flow.png create mode 100644 docs/technical_specification.org diff --git a/docs/images/data_flow.png b/docs/images/data_flow.png new file mode 100644 index 0000000..421f908 Binary files /dev/null and b/docs/images/data_flow.png differ diff --git a/docs/technical_specification.org b/docs/technical_specification.org new file mode 100644 index 0000000..049c78b --- /dev/null +++ b/docs/technical_specification.org @@ -0,0 +1,221 @@ +#+STARTUP: contents inlineimages shrink +#+OPTIONS: ^:{} +#+TITLE: Technical Specification for GeneNetwork +#+AUTHOR: Frederick Muriuki Muriithi + +* Introduction + +This document will detail the internal implementation for the software. + +- Data structures +- Programming languages +- Database schema(s) +- Programming Language(s) + +* Data + +Below is a flowchart of the data and the dependencies between the data: + +** TODO Complete this flowchart + +#+BEGIN_SRC dot :cmd dot :cmdline -Tpng :file images/data_flow.png :exports results + digraph Data { + Node[shape=none,margin=0]; + + species [label=< + Species
+ e.g. mouse, rat, human, etc. + >]; + + group [label=<Group>]; + + datasets [label=<Datasets>]; + + traits [label=<Traits>]; + + sample_data [label=< + Sample Data
+ The actual data for each sample (individual/strain mean). + >]; + + + species -> group [style=dotted]; + group -> datasets [style=dotted]; + species -> datasets; + datasets -> traits; + traits -> sample_data; + } +#+END_SRC + +#+RESULTS: +[[file:images/data_flow.png]] + +** Species + +This is a string that denotes which biological species the data in consideration belongs to. +Examples of these are: +- mouse +- rat +- human + +** Group + +A "category" for datasets with unique samplelists. + +A group can contain data from a single study or from many studies. + +This is a crutch due to the database design, and is not a true representation of +the data. We should strive to get rid of it. + +Group information is found in the *InbredSet* table in the database + +** TODO Datasets + +These are groupings for the traits. The main differences between the datasets +has to do with the relevant metadata for each grouping. + +There are three main groupings, which are: + +- Phenotype +- Genotype +- mRNA Expression (ProbeSet) + +*** Phenotype + +This grouping is for traits that are *NOT* expressions of genotype data (e.g. +sex, weight, etc). + +Phenotype traits are authenticated at the trait level, since there are no +sub-groupings of them within a group. + +*** Genotype + +The traits in this category are markers. Their sample data are the genotypes +(encoded 1/0/1, though genotype probabilities could also be stored). + +This data also exists in genotype files (in both .geno and BIMBAM formats). This +data is stored in the database to allow for its use as cofactors when mapping. + +*** mRNA Expression (ProbeSet) + +The traits in this category are mRNA transcripts from locations in the genome. +The sample data for these traits are expression values for the traits. + +Individual datasets are from individual analyses, split by tissue. + +*** Temp + +This is a pseudo-grouping which costitutes of User-generated data. These have no +real metadata. They are just lists of samples/strains (from a specific group) +and values + +The data for these is entered directly via the "Submit Trait" page or it is derived from the PCA traits generated from the "Correlation Matrix" page + +**** TODO Samplelists: What are they? Are they related to sample data in traits? + +** TODO Traits + +A trait is a set of values for each sample/strain. + +A trait can have metadata. The metadata varies with the dataset in which the trait belongs. + +The metadata is not useful, for most part, in any computations, but it is useful +for the display of the traits to the user, and in search. + +*** Sample Data + +This is the actual data for each sample. + +A sample can be an individual, or the strain mean (average of multiple +individuals) + +The following are the expected "fields" for each instance of sample data: + +- *value*: a numerical value. For "strain mean" items, it is the mean value. +- *variance*: /variance/ for an individual and /standard-error/ for "strain mean" +- *ndata*/*num_cases*: Number of samples for "strain mean" values. NIL for + individuals (should probably be set to 1) + +Sample data can have metadata, referred to as "*Case Attributes*". This is +mostly non-numeric data e.g. sex, treatment type, etc. + +**** TODO Can there be a trait with no sample data? What are the consequences of such a trait existing? + +*** Common Metadata + +- *num_overlap*: When doing correlations, it sets the number of overlapping + samples for all target traits (number of samples where both the primary and + target traits have values). This is not a true attribute, rather, it is a + temporary attribute assigned to a trait "object". +- *haveinfo*: Get rid of this. An implementation detail. Indicates whether there + was information found in the database for the given trait. +- *mean*: Average of all sample data values +- *description_display*/*abbreviation*: Phenotype description and abbreviated + version (for display in figures or as a "trait symbol"). Depends on the + authentication. + +*** Phenotype Metadata + +- *name*: PublishXRef ID. In GeneNetwork, it is a combination of a three-letter group code (InbredSet.InbredSetCode) and this ID makes up the displayed name +- *lrs*: Likelihood Ratio Statistic. In the code, this is the maximim QTL[fn:1] + association value when the trait in question is mapped using qtlreaper[fn:2] +- *LRS_location_repr*: The location of the top *QTL* (above) as a printable + string (chromosome + megabase location) e.g. "Chr2:20.019888" +- *pvalue*: P-value for the maximum *QTL*. There's a chance that the + *lrs score* is derived from this. *VERIFY!!!!!* +- *additive*: additive effect[fn:3] for the maximum *QTL* +- *pre_publication_description*: Description of the phenotype before publication. + Depending on the authentication, the *pre_publication_description* (which is + not really human readable) is shown instead of the + *post_publication_description*. +- *post_publication_description*: For published phenotypes, this is the + description of the phenotype after publication. +- *pubmed_id*/*pubmed_link*/*pubmed_text*: For published phenotypes, these are + the identifier, link to, and text of the publication on the PubMed[fn:4] site + +*** mRNA Expression (ProbeSet) Metadata + +- *name*: Transcript name +- *location*: Location of the mRNA transcript in the genome (chromosome + + megabase location) +- *lrs*: Likelihood Ratio Statistic. In the code, this is the maximim QTL[fn:1] + association value when the trait in question is mapped using qtlreaper[fn:2] +- *LRS_location_repr*: The location of the top *QTL* (above) as a printable + string (chromosome + megabase location) e.g. "Chr2:20.019888" +- *pvalue*: P-value for the maximum *QTL*. There's a chance that the + *lrs score* is derived from this. *VERIFY!!!!!* +- *additive*: additive effect[fn:3] for the maximum *QTL* +- *symbol*: gene symbol +- *sequence*: The DNA sequence +- *probe_target_description*/*strand_probe*/*cell_id*: *VERIFY!!!!* + +*** Genotype + +- *location*: Location of the genetic marker + +* Operations on Traits + +**** TODO What are the valid operations on traits? +**** TODO What data is expected for each operation? + +* Operations on Datasets + +**** TODO Are there operations on whole datasets? +**** TODO If yes, what are they and what data does each operation expect? + +* Database Design and Schema + +.... + +* Other headlines go here ... + +* Footnotes + +[fn:1] Quantitative Trait Locus (QTL): A genetic region that most strongly modulates a particular trait. https://en.wikipedia.org/wiki/Genome-wide_association_study (-log10 p-value is basically an alternative measure of association to LRS). The mapping results that lrs/LRS_location_repr/additive are derived from are usually represented by a Manhattan Plot (as shown in the first figure of the wiki page). The max QTL would be the peak value in that figure (and its associated chromosome location) + +[fn:2] qtlreaper: + +[fn:3] https://genenetwork.org/glossary/#a (Additive Allele Effect) – the GN glossary entry for this can explain it better than I can. +It’s basically another value associated with the max QTL. + +[fn:4] PubMed https://pubmed.ncbi.nlm.nih.gov/ -- cgit v1.2.3