summaryrefslogtreecommitdiff
path: root/topics/systems/mariadb/ProbeSetXRef.gmi
blob: cdf08f4ecddfea928322c1bb018ac16c381194c7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# ProbeSetXRef

## Tags

* keywords: ProbeSetXRef, database, mariadb, innodb
* type: enhancement, documentation
* assigned: pjotrp
* status: unclear
* priority: medium

## Table ProbeSetXRef

Juggling indexes and transforming to InnoDB led to a massive speed increase for typical ProbeSetXRef queries. Global search for brca2 went down from 13s to 4s! Try

=> http://genenetwork.org/gsearch?type=gene&terms=brca2

Not only global search is faster, *all* search is significantly faster.

=> http://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=sh*&search_terms_and=&FormID=searchResult

File data size increased from 3.3GB to 5.4GB (1.6x).

Note that the full wildcard '*' search throws an error or empty results. That needs a fix.

A primary key index was introduced (required by innodb) and joined indexes were split in two. I also reduced the text index to 5 characters for each locus to reduce the data pump.

It turned out that a join on a mixed latin1 and utf8 field does not work which makes sense when you think about it. Here ProbeSetXRef.locus joins against geno.name.

The short of it is that it is ill advised to change the charset table by table! So, we'll stick with latin1 until we convert all tables. Needs proper testing anyway.

Some optimisation is still possible - removing the Old columns and perhaps the Locus VARCHAR column can be reduced in size.

Using prototocol from

=> ./move-to-innodb.gmi

### Check table structure

```
SHOW CREATE TABLE ProbeSetXRef;
```

```
| ProbeSetXRef | CREATE TABLE 'ProbeSetXRef' (
  'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0,
  'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0,
  'DataId' int(10) unsigned NOT NULL DEFAULT 0,
  'Locus_old' char(20) DEFAULT NULL,
  'LRS_old' double DEFAULT NULL,
  'pValue_old' double DEFAULT NULL,
  'mean' double DEFAULT NULL,
  'se' double DEFAULT NULL,
  'Locus' varchar(50) DEFAULT NULL,
  'LRS' double DEFAULT NULL,
  'pValue' double DEFAULT NULL,
  'additive' double DEFAULT NULL,
  'h2' float DEFAULT NULL,
  UNIQUE KEY 'ProbeSetId' ('ProbeSetFreezeId','ProbeSetId'),
  UNIQUE KEY 'DataId_IDX' ('DataId'),
  KEY 'Locus_IDX' ('Locus')
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
```

For every probe set (read dataset measuring point):


```
select * from ProbeSetXRef limit 2;
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old            | pValue_old | mean             | se                  | Locus      | LRS              | pValue | additive           | h2   |
+------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+
|                1 |          1 |      1 | 10.095.400 |   13.3971627898894 |      0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 |  12.590069931048 |  0.269 |        -0.28515625 | NULL |
|                1 |          2 |      2 | D15Mit189  | 10.042057464356201 |      0.431 | 9.90165714285714 |  0.0374686634976217 | rs29535974 | 10.5970737900941 |  0.304 | -0.116783333333333 | NULL |
+------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+
```

where ProbeSetFreezeId is the dataset (experiment). ProbesetId refers
to the probe set information (measuring point). DataId points to the
data point. The other values are used for search.

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef;
+----------+
| count(*) |
+----------+
| 48076905 |
+----------+
```

### Check GN1,2,3 code for use of table

```
rg ProbeSetXRef --color=always |less -R
rg ProbeSetXRef --type=py -l|fzf --preview="rg --color=always -A 20 ProbeSetXRef {}" --preview-window=right:85%:wrap
```

In SQL where statements these are usually combined

ProbeSetXRef.ProbeSetFreezeId
ProbeSetXRef.ProbeSetId

and sometimes

ProbeSetXRef.DataId
ProbeSetXRef.Locus

As can be seen from above table definition the indices are matching

```
  UNIQUE KEY 'ProbeSetId' ('ProbeSetFreezeId','ProbeSetId'),
  UNIQUE KEY 'DataId_IDX' ('DataId'),
  KEY 'Locus_IDX' ('Locus')
```

The combination of

```
select count(distinct ProbeSetFreezeId,ProbeSetId) from ProbeSetXRef limit 10;
```

is unique. Now we should also notice that DataId is unique and will make a smaller primary index. It is wort trying to split the combined ('ProbeSetFreezeId','ProbeSetId') into two indices. I'll do that once we are on innodb. Oh wait, I have to set the primary key first, I don't think I can change that. Let's create a new table for testing:

```
CREATE TABLE mytest AS SELECT * FROM ProbeSetXRef;
ALTER TABLE mytest ADD PRIMARY KEY(DataId);
ALTER TABLE mytest ADD KEY(ProbeSetFreezeId);
ALTER TABLE mytest ADD KEY(ProbeSetId);
SHOW CREATE TABLE mytest;
```


### Create test

Some select statement and maybe a page of GN2.

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5;
+----------+
| count(*) |
+----------+
|    19068 |
+----------+
1 row in set (1.752 sec)

MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<300 and ProbeSetId<1000 and pValue>0.5;
+----------+
| count(*) |
+----------+
|    19068 |
+----------+
1 row in set (13.781 sec)
```

And InnoDB mytest has a 3x speedup for the real query:

```
MariaDB [db_webqtl]>  select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5;
+----------+
| count(*) |
+----------+
|    19068 |
+----------+
1 row in set (1.748 sec)

MariaDB [db_webqtl]>  select count(*) from mytest where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5;
+----------+
| count(*) |
+----------+
|    19068 |
+----------+
1 row in set (0.445 sec)
```

Note the second query traverses the full file and the mytest version is the same speed.

### Create primary key

```
ALTER TABLE ProbeSetXRef
  ADD PRIMARY KEY(ProbeSetFreezeId,ProbeSetId);
```

```
MariaDB [db_webqtl]> DROP INDEX ProbeSetId ON ProbeSetXRef;
MariaDB [db_webqtl]> show index from ProbeSetXRef;
```


### Create indices
### Convert to innodb

```
ALTER TABLE ProbeSetXRef ENGINE = InnoDB;
```

### Change charset

There are some text fields for locus, I think it is safe to translate those to utf8.

```
ALTER TABLE mytest CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
ALTER TABLE mytest ADD KEY(Locus(5));
```

Note I specified a size for Locus to keep the index nimble.

### Update fulltext

For those cases see bottom of move-to-innodb.gmi.

### Run optimiser

```
OPTIMIZE NO_WRITE_TO_BINLOG TABLE mytest;
```

### Final table

```
RENAME TABLE ProbeSetXRef TO ProbeSetXRef_old, mytest TO ProbeSetXRef;
```

```
| mytest | CREATE TABLE 'mytest' (
  'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0,
  'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0,
  'DataId' int(10) unsigned NOT NULL DEFAULT 0,
  'Locus_old' char(20) DEFAULT NULL,
  'LRS_old' double DEFAULT NULL,
  'pValue_old' double DEFAULT NULL,
  'mean' double DEFAULT NULL,
  'se' double DEFAULT NULL,
  'Locus' varchar(50) DEFAULT NULL,
  'LRS' double DEFAULT NULL,
  'pValue' double DEFAULT NULL,
  'additive' double DEFAULT NULL,
  'h2' float DEFAULT NULL,
  PRIMARY KEY ('DataId'),
  KEY 'ProbeSetFreezeId' ('ProbeSetFreezeId'),
  KEY 'ProbeSetId' ('ProbeSetId'),
  KEY 'Locus_2' ('Locus'(5))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
```


### Check test

```
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5;
+----------+
| count(*) |
+----------+
|    19068 |
+----------+
1 row in set (0.058 sec)
```

Now I had 3 pages open

=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P
=> http://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=shh&search_terms_and=&FormID=searchResult
=> http://genenetwork.org/gsearch?type=gene&terms=brca2

The first two pages loaded the same. But the third (global search) took much longer than the 15 seconds it did before I modified ProbeSetXRef. Doh! It should be

```
It took the server 15.71517s seconds to process this page.
It took your browser 0.598 second(s) to render this page
```

The query timed out, so I could look in the slow query log we have.

(But first I added a standard test on sheepdog to make sure we see these regressions.)

```
vim /var/log/mysql/mysql-slow.log

 # Time: 211230  8:20:36
 # User@Host: webqtlout[webqtlout] @ localhost []
 # Thread_id: 11771  Schema: db_webqtl  QC_hit: No
 # Query_time: 470.965035  Lock_time: 0.000258  Rows_sent: 1017  Rows_examined: 115097830
 # Rows_affected: 0  Bytes_sent: 253724
SELECT ProbeSetFreeze.'Name', ProbeSetFreeze.'FullName', ProbeSet.'Name', ProbeSet.'Symbol', CAST(ProbeSet.'description' AS BINARY), CAST(ProbeSet.'Probe_Target_Description' AS BINARY), ProbeSet.'Chr', ProbeSet.'Mb', ProbeSetXRef.'Mean', ProbeSetXRef.'LRS', ProbeSetXRef.'Locus', ProbeSetXRef.'pValue', ProbeSetXRef.'additive', Geno.'Chr' as geno_chr, Geno.'Mb' as geno_mb FROM Species
  INNER JOIN InbredSet ON InbredSet.'SpeciesId'= Species.'Id'
  INNER JOIN ProbeFreeze ON ProbeFreeze.'InbredSetId' = InbredSet.'Id' INNER JOIN Tissue ON ProbeFreeze.'TissueId' = Tissue.'Id'
  INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.'ProbeFreezeId' = ProbeFreeze.'Id'
  INNER JOIN ProbeSetXRef ON ProbeSetXRef.'ProbeSetFreezeId' = ProbeSetFreeze.'Id'
  INNER JOIN ProbeSet ON ProbeSet.'Id' = ProbeSetXRef.'ProbeSetId'
  LEFT JOIN Geno ON ProbeSetXRef.'Locus' = Geno.'Name' AND Geno.'SpeciesId' = Species.'Id'
  WHERE ((((MATCH (ProbeSet.Name, ProbeSet.description, ProbeSet.symbol, alias, GenbankId, UniGeneId, Probe_Target_Description) AGAINST ('sh*' IN BOOLEAN MODE))) AND ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112 )) and ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112 ORDER BY ProbeSet.symbol ASC;
```

this is a different query which logs after a >60s query (it is one of my monitors looking for 'sh*'. This is an interesting one to look at, but it does not match my query.

The actual query showed up as

```
SELECT
                Species.'Name' AS species_name,
                InbredSet.'Name' AS inbredset_name,
                Tissue.'Name' AS tissue_name,
                ProbeSetFreeze.Name AS probesetfreeze_name,
                ProbeSetFreeze.FullName AS probesetfreeze_fullname,
                ProbeSet.Name AS probeset_name,
                ProbeSet.Symbol AS probeset_symbol,
                CAST(ProbeSet.'description' AS BINARY) AS probeset_description,
                ProbeSet.Chr AS chr,
                ProbeSet.Mb AS mb,
                ProbeSetXRef.Mean AS mean,
                ProbeSetXRef.LRS AS lrs,
                ProbeSetXRef.'Locus' AS locus,
                ProbeSetXRef.'pValue' AS pvalue,
                ProbeSetXRef.'additive' AS additive,
                ProbeSetFreeze.Id AS probesetfreeze_id,
                Geno.Chr as geno_chr,
                Geno.Mb as geno_mb
                FROM Species
                INNER JOIN InbredSet ON InbredSet.'SpeciesId'=Species.'Id'
                INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet.'Id'
                INNER JOIN Tissue ON ProbeFreeze.'TissueId'=Tissue.'Id'
                INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id
                INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id
                INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId
                LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id
                WHERE ( MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ('brca2' IN BOOLEAN MODE) )
                AND ProbeSetFreeze.confidentiality < 1
                AND ProbeSetFreeze.public > 0
                ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name, probeset_name
                LIMIT 6000;
```

which looks somewhat similar. There are some joins and I think my index size restriction of Locus(5) may be the problem here. The INNER JOIN keyword selects records that have matching values in both tables, so maybe the Locus has trouble matching.

Add a full index:

```
ALTER TABLE ProbeSetXRef ADD KEY(Locus);
```

The alternative is that mixed innodb and myisam joins are expensive, or more likely the mixed charsets are not playing well. I can try the EXPLAIN statement next.

Well, it turned out that a join on a mixed latin1 and utf8 field does not work which makes sense when you think about it. It joins against geno.name. Now in GN geno.name is usally compared against a string or ProbeSetXRef.locus.

The short of it is that it is ill advised to change the charset table by table! So, we'll stick with latin1 until we convert all tables.


```
SELECT
                Species.'Name' AS species_name,
                InbredSet.'Name' AS inbredset_name,
                Tissue.'Name' AS tissue_name,
                ProbeSetFreeze.Name AS probesetfreeze_name,
                ProbeSetFreeze.FullName AS probesetfreeze_fullname,
                ProbeSet.Name AS probeset_name,
                ProbeSet.Symbol AS probeset_symbol,
                CAST(ProbeSet.'description' AS BINARY) AS probeset_description,
                ProbeSet.Chr AS chr,
                ProbeSet.Mb AS mb,
                mytest.Mean AS mean,
                mytest.LRS AS lrs,
                mytest.'Locus' AS locus,
                mytest.'pValue' AS pvalue,
                mytest.'additive' AS additive,
                ProbeSetFreeze.Id AS probesetfreeze_id,
                Geno.Chr as geno_chr,
                Geno.Mb as geno_mb
                FROM Species
                INNER JOIN InbredSet ON InbredSet.'SpeciesId'=Species.'Id'
                INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet.'Id'
                INNER JOIN Tissue ON ProbeFreeze.'TissueId'=Tissue.'Id'
                INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id
                INNER JOIN mytest ON mytest.ProbeSetFreezeId=ProbeSetFreeze.Id
                INNER JOIN ProbeSet ON ProbeSet.Id = mytest.ProbeSetId
                LEFT JOIN Geno ON mytest.Locus = Geno.Name AND Geno.SpeciesId = Species.Id
                WHERE ( MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ("brca2" IN BOOLEAN MODE) )
                AND ProbeSetFreeze.confidentiality < 1
                AND ProbeSetFreeze.public > 0
                ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name,  probeset_name LIMIT 600;
```

Should run 600 rows in set (11.638 sec)

Changing the index now does 6000 rows in set (0.240 sec)!

The new table structure is

```
| ProbeSetXRef | CREATE TABLE 'ProbeSetXRef' (
  'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0,
  'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0,
  'DataId' int(10) unsigned NOT NULL DEFAULT 0,
  'Locus_old' char(20) DEFAULT NULL,
  'LRS_old' double DEFAULT NULL,
  'pValue_old' double DEFAULT NULL,
  'mean' double DEFAULT NULL,
  'se' double DEFAULT NULL,                                                                                              'Locus' varchar(50) DEFAULT NULL,                                                                                      'LRS' double DEFAULT NULL,
  'pValue' double DEFAULT NULL,
  'additive' double DEFAULT NULL,                                                                                        'h2' float DEFAULT NULL,
  PRIMARY KEY ('DataId'),
  KEY 'ProbeSetFreezeId' ('ProbeSetFreezeId'),
  KEY 'ProbeSetId' ('ProbeSetId'),
  KEY 'Locus_2' ('Locus'(5))
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
```