Introduction
This document will detail the internal implementation for the software.
- Data structures
- Programming languages
- Database schema(s)
- Programming Language(s)
Data
Below is a flowchart of the data and the dependencies between the data:
TODO Complete this flowchart
Species
This is a string that denotes which biological species the data in consideration belongs to. Examples of these are:
- mouse
- rat
- human
Group
A "category" for datasets with unique samplelists.
A group can contain data from a single study or from many studies.
This is a crutch due to the database design, and is not a true representation of the data. We should strive to get rid of it.
Group information is found in the InbredSet table in the database
TODO Datasets
These are groupings for the traits. The main differences between the datasets has to do with the relevant metadata for each grouping.
There are three main groupings, which are:
- Phenotype
- Genotype
- mRNA Expression (ProbeSet)
Phenotype
This grouping is for traits that are NOT expressions of genotype data (e.g. sex, weight, etc).
Phenotype traits are authenticated at the trait level, since there are no sub-groupings of them within a group.
Genotype
The traits in this category are markers. Their sample data are the genotypes (encoded 1/0/1, though genotype probabilities could also be stored).
This data also exists in genotype files (in both .geno and BIMBAM formats). This data is stored in the database to allow for its use as cofactors when mapping.
mRNA Expression (ProbeSet)
The traits in this category are mRNA transcripts from locations in the genome. The sample data for these traits are expression values for the traits.
Individual datasets are from individual analyses, split by tissue.
Temp
This is a pseudo-grouping which costitutes of User-generated data. These have no real metadata. They are just lists of samples/strains (from a specific group) and values
The data for these is entered directly via the "Submit Trait" page or it is derived from the PCA traits generated from the "Correlation Matrix" page
TODO Traits
A trait is a set of values for each sample/strain.
A trait can have metadata. The metadata varies with the dataset in which the trait belongs.
The metadata is not useful, for most part, in any computations, but it is useful for the display of the traits to the user, and in search.
Sample Data
This is the actual data for each sample.
A sample can be an individual, or the strain mean (average of multiple individuals)
The following are the expected "fields" for each instance of sample data:
- value: a numerical value. For "strain mean" items, it is the mean value.
- variance: variance for an individual and standard-error for "strain mean"
- ndata*/*num_cases: Number of samples for "strain mean" values. NIL for individuals (should probably be set to 1)
Sample data can have metadata, referred to as "Case Attributes". This is mostly non-numeric data e.g. sex, treatment type, etc.
Common Metadata
- num_overlap: When doing correlations, it sets the number of overlapping samples for all target traits (number of samples where both the primary and target traits have values). This is not a true attribute, rather, it is a temporary attribute assigned to a trait "object".
- haveinfo: Get rid of this. An implementation detail. Indicates whether there was information found in the database for the given trait.
- mean: Average of all sample data values
- description_display*/*abbreviation: Phenotype description and abbreviated version (for display in figures or as a "trait symbol"). Depends on the authentication.
Phenotype Metadata
- name: PublishXRef ID. In GeneNetwork, it is a combination of a three-letter group code (InbredSet.InbredSetCode) and this ID makes up the displayed name
- lrs: Likelihood Ratio Statistic. In the code, this is the maximim QTL1 association value when the trait in question is mapped using qtlreaper2
- LRS_location_repr: The location of the top QTL (above) as a printable string (chromosome + megabase location) e.g. "Chr2:20.019888"
- pvalue: P-value for the maximum QTL. There's a chance that the lrs score is derived from this. VERIFY!!!!!
- additive: additive effect3 for the maximum QTL
- pre_publication_description: Description of the phenotype before publication. Depending on the authentication, the pre_publication_description (which is not really human readable) is shown instead of the post_publication_description.
- post_publication_description: For published phenotypes, this is the description of the phenotype after publication.
- pubmed_id*/*pubmed_link*/*pubmed_text: For published phenotypes, these are the identifier, link to, and text of the publication on the PubMed4 site
mRNA Expression (ProbeSet) Metadata
- name: Transcript name
- location: Location of the mRNA transcript in the genome (chromosome + megabase location)
- lrs: Likelihood Ratio Statistic. In the code, this is the maximim QTL1 association value when the trait in question is mapped using qtlreaper2
- LRS_location_repr: The location of the top QTL (above) as a printable string (chromosome + megabase location) e.g. "Chr2:20.019888"
- pvalue: P-value for the maximum QTL. There's a chance that the lrs score is derived from this. VERIFY!!!!!
- additive: additive effect3 for the maximum QTL
- symbol: gene symbol
- sequence: The DNA sequence
- probe_target_description*/*strand_probe*/*cell_id: VERIFY!!!!
Genotype
- location: Location of the genetic marker
Operations on Traits
Operations on Datasets
Database Design and Schema
….
Other headlines go here …
Footnotes:
Quantitative Trait Locus (QTL): A genetic region that most strongly modulates a particular trait. https://en.wikipedia.org/wiki/Genome-wide_association_study (-log10 p-value is basically an alternative measure of association to LRS). The mapping results that lrs/LRS_location_repr/additive are derived from are usually represented by a Manhattan Plot (as shown in the first figure of the wiki page). The max QTL would be the peak value in that figure (and its associated chromosome location)
qtlreaper: <provide link to qtlreaper>
https://genenetwork.org/glossary/#a (Additive Allele Effect) – the GN glossary entry for this can explain it better than I can. It’s basically another value associated with the max QTL.