features/data-structures.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191

# Data structures


* Species, e.g. 'Mouse', are split into groups, such as 'BXD bone studies'
* Experiments are described in metadata
* A group can contain multiple families (see rat below) divided into subgroups
* A trait, e.g. 'body weight' is a vector of measured data points the belongs to a study
* A genotype vector can also be a trait
* A trait is always a member of group
* A trait is part of a study/sample described in metadata
* Theoretically traits can belong to multiple groups
* An attribute can be a trait
* An attribute can be a cofactor (also a vector)
* An attribute is like a trait, but not used in computations, other than as a cofactor
* Attributes are editable by group owners
* We can have shared vocabulary for traits and attributes

But

* A trait is shown with attributes as cofactors in the mapping tool
* A cofactor can be a trait
* A cofactor can be an attribute
* A cofactor is not stored in the database - it is an optional vector

(i.e., in terminology cofactors and attributes and traits somewhat overlap)

## Groups

In GN datasets are organised in groups. On the main menu you can see
BXD datasets are grouped into 'BXD aged hippocampus' or 'BXD bone studies'
reflecting higher level interests. Groups are formed around a strain
(here BXD) and are linked to experiments and sample lists.

A group, family, cohort, population is almost always a set of N cases or
individuals or isogenic animals treated as 'individuals'.  The BXD family
of strains is a good and complex example. We can treat the 100+ BXD strains
as if they were 100 genetic individuals and collapse traits for 10
animals each into one value with an error term. Or we can treat all 100 x
10 animals as actual individuals. Even though we use the same animals in
both cases, they are treated in GN as two separate GROUPS.

From a computational perspective a GROUP is a set that (particularly) can be used to
compute correlations among traits. Coming back to the two BXD groups (N =
100 strain means; or N = 1000 individuals), we can only
compute correlations within either mean data or individual data.

Groups are maintained in the inaptly named 'InbredSet' table. E.g.

```
MariaDB [db_webqtl]> select * from InbredSet limit 3;
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
| Id | InbredSetId | InbredSetName     | Name   | SpeciesId | FullName          | public | MappingMethodId | GeneticType | Family                                           | FamilyOrder | MenuOrderId | InbredSetCode |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
|  1 |           1 | BXD               | BXD    |         1 | BXD Family        |      2 | 1               | riset       | Reference Populations (replicate average, SE, N) |           1 |           0 | BXD           |
|  2 |           2 | B6D2F2 OHSU Brain | B6D2F2 |         1 | B6D2F2 OHSU Brain |      2 | 1               | intercross  | Crosses, AIL, HS                                 |           3 |           0 | NULL          |
|  4 |           4 | AXB/BXA           | AXBXA  |         1 | AXB/BXA Family    |      2 | 1               | NULL        | Reference Populations (replicate average, SE, N) |           1 |           0 | AXB           |
+----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+
3 rows in set (0.000 sec)
```


## What is a trait?

A trait is a vector of floats or integers for a GROUP. Body weight
is a simple example of a trait. Eye color, if coded numerically, is a
trait.

A trait will usually have metadata, but the trait data itself boils
down to single vector of values for a specific group that can be used
to compute univariate statistics, (means, variances, etc),
correlations between traits within a group, maps of trait variance for
that group, and higher order properties. A trait could be a vector of
more complex numerical types than just scalars. But up to now all
traits that we have mapped or studied in GN are just simple vectors of
numbers.

Traits can also be genotypes that are coded as integers (usually). Some
genotypes are coded as floats if they represent genotype probabilities.

In GeneNetwork a single trait value (a scalar) always belongs to a
genetically-defined unit/case/individual/clone/strain/F1 hybrid. A single
trait vector (what I usually mean when I talk about a trait) always belongs
to at least one GROUP.

A trait vector can belong to multiple Groups if the groups overlap in
membership. For example, the rat Hybrid Rat Diversity Population (HRDP)
consists of the HXB family, the LEXF family, and a bunch of other inbred
rat strains. HRDP traits can therefore be split into subgroups. This is a
pain from a programming perspective, since a data matrix of TRAITS-by-GROUP
may be a sparse matrix. And the GUI becomes more complex, since the user may
want to slice and dice the GROUP in multiple ways, for example—just map the
HXB family, just map the LEXF family, or map everything together.

## Case attributes

An attribute can (theoretically) be any trait as defined above, or it
can be a short alphanumeric code used primarily as a cofactor in
analysis. Sex is a good example of an attribute that can be coded as
an integer (0 or 1 or x=unknown) and used computationally as if it
were any other trait, or it can be coded as M and F and use for
display and as a cofactor. But some attributes are not even
cofactors. For example, an Attribute column may define which strains
or cases were used in Study X by Roy et al in 2021. In this situation,
the GUI and the attribute are used to quickly sort or select or
exclude particular cases.

Attributes are a recent addition to GeneNetwork. The motivation was to
provide the user with a display of the most important cofactors of a
study.  For example, in our large study of lifespan in the BXD mice,
we wanted to provide "low level" data on each individual animal.

In this situation, the sex, strain, diet, aar tag number, resource
reference ID, the epoch of the BXD strain (when the BXD strain was made),
and even the study in which cases were specifically used—all of those are
considered attributes.

The last three attribute columns that you see in the screenshot below
(KM20, SR21, EW21) refer to three papers (e.g. SR21 = Suheeta Roy 2021)
that have used subsets of these animals. None of these attributes are used
directly in computations. They are used to sort and filter. But notice that
one of these ATTRIBUTES is also  the most important trait in this study—the
Longevity column attribute is the same as the VALUE (Trait BDL_10001). This
highlights the fact that a trait can become an attribute, but not all
attributes can become traits. Who would compute a correlation against ear
tag number?

=> https://genenetwork.org/show_trait?trait_id=10001&dataset=BXD-LongevityPublish

Attributes generally belong to a GROUP, not to an individual TRAIT. But
for display purposes, every trait will show a set of ATTRIBUTES. This is a
source of potential confusion.

Who can edit case attributes?  Attributes should only be editable by
the GROUP owner or perhaps by GeneNetwork curators.

How do we make sure we can compare attributes between datasets if the
naming is haphazard?  Attributes are only a GROUP property (e.g. BXD
Family, AKXD Family, GTEx). The way I think about them today, they
cannot be used computationally across GROUPS. They can be used across
traits within GROUPS.

Can we have global case attributes?  We could have shared vocabulary
for attributes, but I do not know how a global case attribute would be
used computationally.  For example, sex, age, body weight, lab
identifiers, date of analysis, will almost always be useful attributes
(and also some of those are traits) no matter what the GROUP, provided
the GROUP consists of true individuals.  So a common vocabulary of
ATTRIBUTES make great sense, but computationally ATTRIBUTES as I think
about them today, belong just to a group (or overlapping set of
groups).

However, it would be cool to compare differences in gene expression in
the liver of BXD mice, HXD rats, and GTEx humans as a function of sex
and age.

## RDF and relationship databases

Naming is hard. Properties might have been more descriptive than attributes. But is
a publication or a cage name a property or attribute for individual
mouse? Not really. RDFS has vocabulary for such

=> https://www.sti-innsbruck.at/sites/default/files/courses/fileadmin/documents/semweb13-14/SW-Lecture7.pdf

I think, rather than bringing it in its one term we should be using relationships. So a publication would be a property of trait and therefore of group.

* publication belongs to trait
* individual partOf trait

That way publications are connected to individual mouse.

* trait belongs to group

That way publications are connected to group

That also makes the ownership path clear. Publications are handled at the trait level, not at the group level. This makes the discussion around how attributes are handled *much* clearer. Meanwhile

* individual partOf epoch
* status partOf individual

Status is also arbitrary. Here status might be individualLocation or
something more descriptive. I.e.

* individual hasLocation location

But it is also clear that the 'status' attribute is handled at the individual level and not at the group level. Maybe I get this wrong, but then we can decide at what level 'status' belongs for editing. Right?

Anyway, you can see I am talking relationships that are descriptive and can be parsed by AI. Also this brings out where today's attributes belong.

Anything enumerable can be used as a covariate, that includes location, handler etc.

The RRID is a strain level property that we display as if it is an individual level property.