topics/data-uploads/gn-uploader-requirements.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281

# GeneNetwork Uploader Requirements

*OR*: "Reattaching the Head on Frederick's Headless-Chicken Development :-)"

## Tags

* assigned: fredm, flisso
* type: doc, requirements
* priority: high
* status: open
* keywords: data uploads, gn-uploader, uploader, data uploader
* interested: acenteno, pjotrp, zsloan, robw, bonfacekilz

## Introduction

I (Frederick M. Muriithi) have been building the
=> https://git.genenetwork.org/gn-uploader/ GeneNetwork Data Uploader
project.

As part of that work, we have come across a number of both implicit and explicit requirements to facilitate the end-goal of allowing users to upload new data to the system. This document discusses these requirements, while also offering up some possible solutions for some of the requirements.

NOTE: This is an evolving document, and will change as the requirements, technology and understanding changes.

### Direction

The purpose of the system is to allow users to upload their data and be able to analyse it with the system.

There are two major schools of thought regarding this:

* Basic upload of data for testing and analysis, with (a) curation step(s) later to make it conform to standards
* Upload with strict checks to ensure data conforms to standards before upload is successful

The first of the schools of thought will allow the users to upload data and play around with it, correcting any errors before it is finally acceptable enough to upload to the main database. This implies use of a data staging area, or even a separate testing database to hold the data. There might need to be a GeneNetwork system with access to the staging area or testing database to allow the user(s) to do analysis of these "incomplete" data.

The second school of thought requires that the data the user uploads be complete, i.e. the numerical data is correct and there is complete accompanying metadata for the data, including descriptions that fulfil all requirements. We might need a curation step here too.

## General Notes

We should probably start modifying our tables to use FULL IDENTIFIERS for the data. The full identifier includes ALL identifiers for any parent tables/data in addition to the specific record identifier.

e.g. For data, say in a ProbeSet table, instead of simply ProbeSet(`Id`) as an identifier, the complete identifier would be something like:

> ProbeSet(`SpeciesId`, `PlatformId`, `Id`)

while for ProbeFreeze table would be something like:

> ProbeFreeze(`SpeciesId`, `PlatformId`, `PopulationId`, `ProbeSetId`, `Id`)

and for data in ProbeSetData, it would be something like:

> ProbeSetData(`SpeciesId`, `PlatformId`, `PopulationId`, `ProbeSetId`, `ProbeFreezeId`, `ProbeSetFreezeId`)

We can then have table indexes composed of one or more of the elements of the *FULL IDENTIFIER* for faster queries.

**NOTE 01**: The FULL IDENTIFIERS above should be hieararchical, beginning with the "oldest" ancestor and ending with the current record's ID.

**NOTE 02**: The examples of the FULL IDENTIFIERS above might not be complete. I'll update them as I tease more information from the database.

## Data Categories

There are different data "categories" that could be uploaded into the system, some of which are dependent on others already existing on the system, before they can be uploaded. The "categories" are:

### Species information

All the various data of interest to the system are grouped under one species or another. This means that there is a possibility that the user could want to upload data that belongs to a species that does not already exist on GeneNetwork. We might therefore need a way to allow the upload of new Species information, maybe with a verification step before the data hits the database.

> Species --> {{{ data of various sorts }}}

The important species information we need to collect are:

* Species name e.g. mouse, rat, blue gum
* Scientific name e.g. Mus musculus, Rattus norvegicus, Eucalyptus globulus, etc
* Family e.g. Vertebrates, Plants, etc.

Maybe we should do the whole "Classification" thing (Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species) e.g. for mouse, Eukaryota, Animalia, Chordata, Mammalia, Rodentia, Muridae, Mus, Muscululus. I do note, however, that this might be too much also, what with the sub-phylums and sub-genus, etc.

There is a benefit, however, of having such information: we can index the data by some of these fields, enabling us to query by say, 'Class' to quickly get all species of mammals in the system.

### Platform Information

Hierarchy
> Species --> Platform --> {{{ data of various sorts }}}

These are (sequencing?) platforms used for the generation of the data that is then to be uploaded. Some of the platforms are registered with
=> https://www.ncbi.nlm.nih.gov/geo/browse/?view=platforms NCBI's Gene Expression Omnibus system
and some are not.

Each platforms also seem to be tied to a specific organism/species (please confirm).

The information regarding the platforms that we need is:

* SpeciesId (see 'Species information' above)
* Platform Name
* Name
* Title
* GEOPlatform (Optional): NCBI's GEO indentifier for the platform e.g. GPL34216

=> https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#GeneChip rules for platform names.

The Platform details on NCBI also include sample data. If user specifies a GEO platform ID, we could fetch such details and auto-populate relevant tables, perhaps. We'd (well, mostly I, Fred) need to figure out whether NCBI provides an API for such.

### Genotype Information

Hierarchy
> Species --> Genotype --> {{{ data of various sorts }}}

* SpeciesId (see 'Species information' above): This is an internal identifier for the Species information we have collected before.
* Name: Name of genotype
* Marker name:
* Chromosome:
* Megabases: This is location information in megabase pairs
* Sequence:
* Source: Provider of the information, e.g. an institute, person, etc. Is this 

We could index the genotype information by the following fields:

* SpeciesId: For faster queries for a particular species' genotypes
* ...

### Assembly Information

* mm8
* mm10
* mm11
* ...
etc.

I still do not wholly comprehend this. This might be related to the platform information.

From the 'Geno' table, I see the fields 'Mb_mm8' and 'Chr_mm8' that indicate that this information can affect data possibly downstream to the Geno data.

We probably need a way to separate these from the Geno table, while maintaining the link to the downstream data.

Tables affected by this information:

* Geno
* Chr_Length
* ...

### Population Information

This is the second major organisational grouping of the data under the Species, i.e. data is organised hierachically under Species, then Population:

> Species --> Population --> {{{ data of various sorts }}}

* SpeciesId (see 'Species information' above)
* Population name: InbredSetName, Name, and FullName. What are the differences?
* GeneticType: e.g. riset, intercross, etc.
* Family:
* MappingMethodId:

### Samples/Cases/Individuals Information

Hierarchy
> Species --> Population --> Samples --> {{{ data of various sorts }}}

The data we need to collect/have for the samples are:

* SpeciesId (see 'Species Information' above)
* PopulationId (see 'Population Information' above)
* Name: Official sample name/symbol
* Alias: An alias for the sample (Optional)
* Symbol: short strain symbol used in graphs and tables - looks like a display thing; look into this.

** Samples might also be related to the platform: see 'Platform Information' above.

From the existing `Strain` table, it seems you can only have one-and-only-one sample for a particular species with a specific name.

> MariaDB [db_webqtl]> SHOW CREATE TABLE Strain;
> ...
> | Strain | CREATE TABLE `Strain` (
>   `Id` int(20) NOT NULL AUTO_INCREMENT,
>   `Name` varchar(100) DEFAULT NULL,
>   `Name2` varchar(100) DEFAULT NULL,
>   `SpeciesId` smallint(5) unsigned NOT NULL DEFAULT 0,
>   `Symbol` varchar(20) DEFAULT NULL,
>   `Alias` varchar(255) DEFAULT NULL,
>   PRIMARY KEY (`Id`),
>   UNIQUE KEY `Name` (`Name`,`SpeciesId`),
>   KEY `Symbol` (`Symbol`)
> ) ENGINE=InnoDB AUTO_INCREMENT=180927 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |
> ...

We could index this information by any one, or combinations of the following fields:

* SpeciesId
* PopulationId

and maybe even drop the need for the 'StrainXRef' table. (*To be considered*)

### Tissue Information

Hierarchy
> Species --> ?? ... ?? --> Tissue --> {{{ data of various sorts }}}

Felix discovered the need for this when uploading the Arabidopsis Thaliana data into the test database with the uploader. Expression data to be uploaded has to be linked to a tissue, and the existing tissue information (as of before 2024-02-22T09:45+03:00UTC) seems to only belong to vertebrates, not plants.

**Find out more about tissue and linkages to other data**

Tables:

* Tissue
* TissueProbeFreeze
* TissueProbeSetData
* TissueProbeSetFreeze
* TissueProbeSetXRef

...

### Expression Data Information


Hierarchy
> Species --> ?? ... ?? --> Expression Data --> {{{ data of various sorts }}}

The ' --> ?? ... ?? --> ' section winds through Platform, Population, Genotype, Tissue, Samples etc before making its way to the expression data information. I still need to unwind the hieararchy and list the paths here.

Affects the following database tables:

* ProbeSet
* ProbeFreeze
* ProbeSetXRef
* ProbeSetData: Data matrix - numerical values for use in analyses
* ProbeSetSE: Standard error values for use in analyses

Some mandatory data we need:

* SpeciesId (see 'Species Information' above)
* PlatformId (see 'Platform Information' above)
* Name: Phenotype identifier for the platform above
* Gene Symbol: ...
* Chromosome:
* Megabases:
* Description: A description for the phenotype
* GeneId: Entrez gene ID from NCBI
* Strand_Gene/Strand_Probe: he DNA strand (+ or -) of the gene assigned to the phenotype. Leading or lagging strand.

Maybe the *Chromosome* and  *Megabases* value could be replaced by a single link to a ChromosomeId or such... maybe a table linking the chromosome to its specific assembly e.g.

> Probeset(ChromosomeAssemblyId) --> (Id)ChromosomeAssembly(ChromosomeId) --> Chromosome(Id)

...

### Publish Phenotype Data

We need a way for the uploader to distinguish between "Expression Data Phenotypes" and these "Publish Phenotypes".

I have not previously dealt with uploading "Publish Phenotype" (or "Classic Phenotype") data. This section begins an exploration on how that would come about.

Database tables affected:

* Phenotype
* Publication
* PublishData
* PublishFreeze: Links to population
* PublishSE
* PublishXRef: Links to population, phenotype, publication, and PublishData

These have a form very similar to the expression data.

Some important data required:

* Units: Units of measurement for the phenotype
=> https://info.genenetwork.org/faq.php#q-22 Description for "Publish Phenotypes"
* Others? ...

...

## Descriptions

For "Publish Phenotypes", the descriptions have strict requirement, listed at the link below:
=> https://info.genenetwork.org/faq.php#q-22

=> /topics/data-uploads/data_dictionaries_20230222.txt.zip The list of "General Category and Ontology Terms" as of 2024-02-22T11:58+03:00UTC
is only usable for vertebrates. We will need an extended list for other species, e.g. plant species, invertebrates, etc.

## Display

Data should be saved to the database with as accurate and precise information as possible. This means the data in the database could have a large-ish number of decimal places.

The UI (User interface) should then truncate or round off those decimal places as needed to give the user a nice display of the data, and maybe a table or key with the non-modified values as necessary.