topics/data-uploads/inserting-data.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

# Data Uploads: Inserting Data

## Tags

* assigned: bonfacem, zachs
* type: bug
* priority: high
* status: in progress
* keywords: data uploads

### Introduction

The current uploader work documented in `editing-data.gmi` only caters
for the following operations by people with the right access:

- Editing phenotype and probeset metadata

- Editing sample data from a published phenotype

- Deleting sample data from published phenotypes

- Inserting data from strains that already exist in a ".geno" file

However, one of our beta users ran into a problem when attempting to
insert new trait data for BDL_10001.  ATM, we can't add new samples.
Also, adding case attributes for new samples is a manual process.  New
samples cannot be added because new genotype files need to be
generated when new strains/ samples are added.  Addition of these
genotype files has always been manual.  We can add strain data
(inserting new strains into the Strain table(s) if it's not already
there) by hacking existing code.  However, this will show up-- the
strains-- in a separate "sample group" on the trait page and won't be
used for mapping until the new .geno file containing the new strains
is generated.

How is this genotype file generated?

[From Rob] Genotypes should be added by code.  For some species like
humans, this won't happen; but for experimental animals and plant,
many families may grow and spread e.g. BXDs grow to the BXD Dax, then
they expand tot he BXDx Collaborative Cross DAX.

New strains are added by entering strains based on groups(InbredSetId) and SpeciesId.  This is why we have the "StrainXRef" table.  This is demonstrated below:

```
MariaDB [db_webqtl]> desc Strain;

+-----------+----------------------+------+-----+---------+----------------+

| Field     | Type                 | Null | Key | Default | Extra          |

+-----------+----------------------+------+-----+---------+----------------+

| Id        | int(20)              | NO   | PRI | NULL    | auto_increment |

| Name      | varchar(100)         | YES  | MUL | NULL    |                |

| Name2     | varchar(100)         | YES  |     | NULL    |                |

| SpeciesId | smallint(5) unsigned | NO   |     | 0       |                |

| Symbol    | varchar(20)          | YES  | MUL | NULL    |                |

| Alias     | varchar(255)         | YES  |     | NULL    |                |

+-----------+----------------------+------+-----+---------+----------------+

6 rows in set (0.001 sec)

MariaDB [db_webqtl]> desc StrainXRef;

+------------------+----------------------+------+-----+---------+-------+

| Field            | Type                 | Null | Key | Default | Extra |

+------------------+----------------------+------+-----+---------+-------+

| InbredSetId      | smallint(5) unsigned | NO   | PRI | 0       |       |

| StrainId         | int(20)              | NO   | PRI | NULL    |       |

| OrderId          | int(20)              | YES  |     | NULL    |       |

| Used_for_mapping | char(1)              | YES  |     | N       |       |

| PedigreeStatus   | varchar(255)         | YES  |     | NULL    |       |

+------------------+----------------------+------+-----+---------+-------+

5 rows in set (0.001 sec)

MariaDB [db_webqtl]> select max(Id) from Strain;

+---------+

| max(Id) |

+---------+

|   66085 |

+---------+

1 row in set (0.000 sec)

MariaDB [db_webqtl]> insert into Strain (Name,
Name2,SpeciesId,Symbol,Alias) value ("Test1","Test1",30,"Test1","Test1");

Query OK, 1 row affected (0.000 sec)

MariaDB [db_webqtl]> select max(Id) from Strain;

+---------+

| max(Id) |

+---------+

|   66086 |

+---------+

1 row in set (0.000 sec)

MariaDB [db_webqtl]> select * from Strain where Id=66086;

+-------+-------+-------+-----------+--------+-------+

| Id    | Name  | Name2 | SpeciesId | Symbol | Alias |

+-------+-------+-------+-----------+--------+-------+

| 66086 | Test1 | Test1 |        30 | Test1  | Test1 |

+-------+-------+-------+-----------+--------+-------+

1 row in set (0.000 sec)

```

### Problems

- Integration with genotype files complicates things e.g. we can only
  generate individual BXD genotypes from the "main" BXD genotype files
  iff when we have the strains of each individual stored somewhere.

- CaseAttributes need to be updated manually.  One option is to enable
  this using the UI.  ATM we need to query the Case Attribute tables
  to look the BXD strain for each individual when generating the
  genotype files.

- We should ideally be able to generate a set of genotype files or a
  set of DB tables with all possible 20,000 BXD family genomes.  As a
  reference see David's smoothed WGS-based genotype files.

- Genotype files will most likely not be static.  Soon, we should be
  able to support the need for them to change during runtime.

- When the sample list for BDL_10001 is updated, will the sample list
  for all other records be synchronized automatically.


### Goal

- (Low hanging fruit) Ability to add insert new strains.

- There are 1000s of mice with computable genomes that have never been
  born yet.  Can we compute their phenotype?  This is as difficult as
  getting to Mars.