topics/documentation/gn-hacking-documentation.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

# GeneNetwork Hacking Documentation

In addition to the user/client documentation, we need documentation for would be contributors to the project. This documentation can also reside in the

=> https://github.com/genenetwork/gn-docs GeneNetwork Documentation repository

This document is concerned with the system design and the documentation of the system that is relevant te the implementation of the system. It explores some areas of GeneNetwork that might be unclear, or require some documentation. It is a discussion document to help with clarifying some concepts and critiquing the documentation.

The goal of this document is to encourage and track the documentation that assist with the development of the system.

## Tags

* assigned: fredm
* type: documentation
* status: in progress
* priority: medium
* keywords: traits, datasets, hacking


## Unifying Principles

Like GNU Guix has the concepts of Packages, Monads, G-Expressions and the like as some of the unifying principles that help with understanding, contributing to and extending the project, GeneNetwork has some, albeit undocumented, principles unifying the system. Understanding these principles is crucial for the would-be contributor.

As far as I (fredm) can tell [as of 2022-03-11], the unifying principles for the system are:

* datasets
* traits

### Datasets

Datasets 'contain' or organise traits. The do not have much in terms of direct operations on them - most operations against a dataset operate agaist the traits within that dataset.

They can be envisioned as a bag of traits.

Common dataset traits are:

* dataset_name <string>: Name of the dataset
* dataset_type <string>: Type of dataset. Valid values are 'Temp', 'Publish', 'ProbeSet' and 'Genotype'
* group_name <string>: ??

### Traits

A trait is a abstract concept - with the somewhat concrete forms being

* Genotype
* ProbeSet
* Publish
* Temp

From the GeneNetwork2 repository, specifically the `wqflask.base.trait` module:

```
... a trait in webqtl, can be either Microarray, Published phenotype, genotype,
or user input trait
```

From the `wqflask.base.trait.GeneralTrait` class, the common properties for all the trait types above are:

* dataset <Dataset>: a pointer to the dataset that the trait is a member of
* trait_name <string>: the name of the trait
* cellid <?>: ?
* identification: <string?>: ?
* haveinfo <boolean>: ?
* sequence <?>: ?
* data <dict>: ? - In GN2, retrieval of this is indirect, via the dataset but it is a trait property.
* view <boolean>: ?
* locus <None or ?>: ?
* lrs <None or real number?>: Lifetime reproductive success?
* pValue <None or real number?>: ?
* mean <None or real number?>: ?
* additive <None or real number?>: ?
* num_overlap <None or integer?>: ?
* strand_probe <None or ?>: ?
* symbol <None or ?>: ?
* display_name <string>: a name to use in the display of the trait on the UI
* LRS_score_repr <string>: ?


The *data* property of a trait has items with at least the following important properties:

* sample/strain name- some sort of name e.g. BXD12
* value - a numerical value corresponding to the sample/strain
* variance - a numerical value corresponding to the sample/strain
* ndata - a numerical value

the trait properties above are the ones I have run into that seem to be used in computations mostly.

There are other properties like:

* mb <?>: Megabases?
* chr <?>: Chromosome?
* location <?>: ?

that are used less often.

Some extra properties for 'ProbeSet' traits:

* description <None or string>: ?
* probe_target_description <None or string>: ?

Some extra properties for 'Publish' traits:

* confidential <?>: ?
* pre_publication_description <string>: ?
* post_publication_description <string>: ?

When doing computations, it is unnecessary to load the display-only properties of a trait, deferring this to when/if we need to display such to the user/client.

### Molecular data vs. Phenotypes

How do these factor into the system?

According to my current understanding, these are different views into the data, and there might not be a clear distinction between them.

What is the mapping between these and the traits above?


## Update 2022-04-04

Work has began on writing a technical specification document detailing the current state of the system, and the possible future implementation goals of the system. See branch `technical-specification` in the GeneNetwork3 repository.


## Update 2022-07-18

### Platforms

Stored in the *GeneChip* table in the database.

* TODO: Elaborate what these are once you understand them

### Groups

These are in the *InbredSet* table in the database.

* What are they?

Groups are linked to the Species.

### Studies

Stored in the *ProbeFreeze* table in the database.

Linked to platforms (ChipId), groups (InbredSetId) and tissues (TissueId).

A study can have multiple experiments, each of which provides a dataset.

### Datasets

Stored in the *ProbeSetFreeze* table in the database.

Linked to studies (ProbeFreezeId) and averaging methods (AvgId).

From my understanding so far, a dataset is the results from a single experiment within a study.

### Annotations

Stored in *ProbeSet* table in the database