summaryrefslogtreecommitdiff
path: root/issues/systems/mariadb/ProbeSetData.gmi
blob: 5233e09d1dc901b03276f239110a5628dff8a326 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# ProbeSetData

Here we take a closer look at the ProbeSetData table and indices. At the end we have reduced the size by 25% and decided to keep the myisam format for this single table. In the future it may be replaced by lmdb or similar.

## Tags

* assigned: pjotrp
* status: unclear
* priority: medium
* type: enhancement
* keywords: database, mariadb, innodb

## Description

This is by far the largest table (~200Gb). I need to add disk space to be able to host it on the NVME and move stuff around. Final move is GN2 code and we have over 400Gb free.

This time I failed porting to InnoDB (see Migration below):

* [X] Move database to larger drive (stop Mariadb for final stage)
* [X] Stop binary log (SET sql_log_bin = 0;)
* [X] Run second instance of mariadb using a Guix container, upgrade too?
* [X] Drop the indices
* [X] Try different sizes of innodb exports
* [X] Make (ordered) conversion and test performance
* [X] Rebuild indices
* [X] Test performance
* [ ] Muck out ibdata1 and transaction logs
* [ ] Restart binary log (SET sql_log_bin = 1;)

I disabled these and they need to be restored:

* [X] binary log
* [X] backup script
* [X] flush db - leave that off for now
* [X] cron system-check
* [X] systemd MYSQLD_OPTS setting

The following fetches the data to be used for mapping:

```
SELECT
                            Strain.Name, ProbeSetData.value, ProbeSetSE.error, NStrain.count, Strain.Name2
                    FROM
                            (ProbeSetData, ProbeSetFreeze,
                             Strain, ProbeSet, ProbeSetXRef)
                    left join ProbeSetSE on
                            (ProbeSetSE.DataId = ProbeSetData.Id AND ProbeSetSE.StrainId = ProbeSetData.StrainId)
                    left join NStrain on
                            (NStrain.DataId = ProbeSetData.Id AND
                            NStrain.StrainId = ProbeSetData.StrainId)
                    WHERE
                            ProbeSet.Name = '4336695' AND ProbeSetXRef.ProbeSetId = ProbeSet.Id AND
                            ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id AND
                            ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA' AND
                            ProbeSetXRef.DataId = ProbeSetData.Id AND
                            ProbeSetData.StrainId = Strain.Id
                    Order BY
                            Strain.Name;
```

Prototocol from

=> ./move-to-innodb.gmi

Every table update has to follow the template:

## Reboot Tux01 and Tux02

=> ../reboot-tux01-tux02.gmi

## Table upgrade

### Check recent backups

* [X] ascertain there is a backup
* [X] copy original files

### Make temporary table

```
mysql 1.5K Jun 27  2021 ProbeSetData.frm
mysql  63G Dec  4 22:15 ProbeSetData.MYD
mysql 118G Dec 28 08:52 ProbeSetData.MYI
mysql 1.5K Jan  1 20:45 mytest.frm
mysql  20G Jan  1 21:34 mytest.ibd
```

If you have enough space, you can create a copy of the actual table and do the work on that:

Original:

```
| ProbeSetData | CREATE TABLE 'ProbeSetData' (
  'Id' int(10) unsigned NOT NULL DEFAULT 0,
  'StrainId' int(20) NOT NULL,
  'value' float NOT NULL,
  UNIQUE KEY 'DataId' ('Id','StrainId'),
  KEY 'strainid' ('StrainId')
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
```

New:

```
DROP table mytest2;
CREATE table mytest2(
  Id INT unsigned NOT NULL DEFAULT 0,
  StrainId MEDIUMINT unsigned NOT NULL DEFAULT 0,
  value float NOT NULL,
  primary key(Id,StrainId),
  key (StrainId)) CHARACTER SET utf8mb4 ;
```

Note loading latin1 is slower than utf8! I tried.

And fill it killing quite a few birds with one stone :). You may want to disable the binary logger before running:

```
INSERT INTO mytest2 (Id,StrainId,value) SELECT Id,StrainId,value FROM ProbeSetData LIMIT 51631478;
```


Makes a table at 1% of size - a table of 2.3GB. I wonder why the full thing took up 330Gb before we ran out of disk space.

Show progress

```
SELECT TABLE_SCHEMA,TABLE_NAME,DATA_LENGTH,AVG_ROW_LENGTH,TABLE_ROWS FROM information_schema.TABLE
S WHERE DATA_LENGTH>1000000000 order by data_length;
```

May add order by Id,StrainID next time.

Once the process is done, you can rename the tables:

```
DROP ProbeSetData;
RENAME TABLE tbl_name TO new_tbl_name, tbl_name2 TO ProbeSetData;
```

### Check table structure

```
SHOW CREATE TABLE ProbeSetData;
```

This is a really simple table

```
| ProbeSetData | CREATE TABLE 'ProbeSetData' (
  'Id' int(10) unsigned NOT NULL DEFAULT 0,
  'StrainId' int(20) NOT NULL,
  'value' float NOT NULL,
  UNIQUE KEY 'DataId' ('Id','StrainId'),
  KEY 'strainid' ('StrainId')
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
```

```
MariaDB [db_webqtl]> select * from ProbeSetData limit 2;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
+----+----------+-------+
2 rows in set (0.000 sec)
```

But large

```
MariaDB [db_webqtl]> select count(*) from ProbeSetData;
+------------+
| count(*)   |
+------------+
| 5141631478 |
+------------+
1 row in set (0.000 sec)

```

This parses the full table (MYISAM version)

```
MariaDB [db_webqtl]> select count(distinct id) from ProbeSetData;
+--------------------+
| count(distinct id) |
+--------------------+
|           48119047 |
+--------------------+
1 row in set (4 min 40.035 sec)
```

Do this also on Penguin2.

```
| ProbeSetData | CREATE TABLE 'ProbeSetData' (
  'Id' int(10) unsigned NOT NULL DEFAULT 0,
  'StrainId' smallint(5) unsigned NOT NULL DEFAULT 0,
  'value' float NOT NULL,
  UNIQUE KEY 'DataId' ('Id','StrainId')
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
```

The table layout is a bit puzzling.

```
MariaDB [db_webqtl]> select max(StrainId) from ProbeSetData;
+---------------+
| max(StrainId) |
+---------------+
|         65777 |
+---------------+
MariaDB [db_webqtl]> select max(Id) from ProbeSetData;
+----------+
| max(Id)  |
+----------+
| 91406791 |
+----------+
```

For StrainID MediumInt is a better fit with unsigned range is 0 to 16777215 and Id should be a INT (instead of BIGINT for the time being). That will make the table+indices smaller and faster. Note that MediumInt does not save RAM, just disk space. But that can impact IO performance by 25%.

When a value overflows MariaDB will throw an error. Therefore it is safe to pick the smallest possible size. So even if we grow out of values we can quickly resize the table in, say, 10 years. That is, if we are still using SQL.

### Check GN1,2,3 code for use of table

```
rg ProbeSetData --color=always |less -R
rg ProbeSetData --type=py -l|fzf --preview="rg --color=always -A 20 ProbeSetData {}" --preview-window=right:85%:wrap
```

No surprises from the code scan.

### Create test

Some select statement and maybe a page of GN2.

The tests we can share from ProbeSetXRef.

Page of GN2 is the mapping page, e.g.

=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P

### Create primary key

```
ALTER TABLE ProbeSetData
  ADD PRIMARY KEY(Id,StrainId);
```

(no quotes for column names)

### Create indices

```
SHOW INDEX FROM ProbeSetData;
DROP INDEX ProbeSetId ON ProbeSetData;
```

### Convert to innodb

```
ALTER TABLE ProbeSetData ENGINE = InnoDB;
```

### Change charset

```
ALTER TABLE ProbeSetSE CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
```

Use utf8 instead!

### Update fulltext

For those cases see bottom of move-to-innodb.gmi.

### Run optimiser

```
OPTIMIZE NO_WRITE_TO_BINLOG TABLE ProbeSetData;
```

After running the optimiser rename the tables

```
RENAME TABLE orig TO orig_old, mytest TO orig;
```

### Check test

## Migration (20230312)

ProbeSetData is the final table that needs to be migrated to innodb. The main problem is that it is huge and transforming the table runs out of disk space. In the first step I made a borg snapshot on

```
root@tux01:/export2/backup# borg create --progress --stats borg-mariadb-snapshot::migrate1a /export/local/home/mariadb/
```

next copy the database to a new partition:

```
root@tux01:/export4/local/home/mariadb/database/db_webqtl# rsync -vaP /var/lib/mysql/db_webqtl/* . --delete --bwlimit=20M
```

Note I throttle the speed because the system can become quite unusable at full copy speed. Next I stopped Mariadb and made sure the copy is completed. After restarding mariadb I could continue work on the copy using a guix shell instance as described in

=> setting-up-local-development-database.gmi

Steps were as a normal user

```
tux01:/export4/local/home/mariadb$ ~/opt/guix-pull/bin/guix pull -p ~/opt/guix-latest
. ~/opt/guix-latest/etc/profile
mkdir var
guix shell -C -N coreutils sed mariadb --share=var=/var
mysqld_safe --datadir='./database' --user=$USER --nowatch --socket=/var/run/mysqld/mysqld.sock
mysql --socket=/var/run/mysqld/mysqld.sock -uwebqtlout -p db_webqtl
```

OK, now it is running and we can start experimenting with the table outside the main database setup. Remember we had

```
ProbeSetData | CREATE TABLE 'ProbeSetData' (
  'Id' int(10) unsigned NOT NULL DEFAULT 0,
  'StrainId' int(20) NOT NULL,
  'value' float NOT NULL,
  UNIQUE KEY 'DataId' ('Id','StrainId'),
  KEY 'strainid' ('StrainId')
) ENGINE=MyISAM DEFAULT CHARSET=latin1
```

```
DROP INDEX strainid ON ProbeSetData;
DROP INDEX DataId ON ProbeSetData;
```

of course it starts making a copy of the whole table and takes hours(!) This is why we need over 200Gb free both on the DB directory and the tempdir of the mariadb server.

```
select count(Id) from ProbeSetData;
+------------+
| count(Id)  |
+------------+
| 5173425135 |
+------------+
MariaDB [db_webqtl]> select max(Id),max(StrainId) from ProbeSetData;
+----------+---------------+
| max(Id)  | max(StrainId) |
+----------+---------------+
| 92199993 |         71224 |
+----------+---------------+
MariaDB [db_webqtl]> select * from ProbeSetData limit 4;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
+----+----------+-------+
```

```
ALTER TABLE ProbeSetData MODIFY StrainId mediumint UNSIGNED NOT NULL;
```

Now the table is 58Gb without indices. Convert to innodb and add indices

```
CREATE INDEX id_index ON ProbeSetData(Id);
CREATE INDEX strainid_index ON ProbeSetData(StrainId);
ALTER TABLE ProbeSetData ENGINE = InnoDB;
```

Now we are at 150Gb.

The conversion to InnoDB, annoyingly, leads to an exceedingly large file. I think we'll keep the myisam version for ProbeSetData for the time being. It does not do text comparison, so we should see no slow downs mixing with InnoDB. From this exercise I know we can reduce the size a little. Also ProbeSet can probably be made smaller by combining id+strainid where id is dataset id and perhaps even reducing the value size by compressing it into a smaller float. Using lmdb may not be that beneficial because, even though the data is a vector, it will be costly to traverse by strainid. Checking the code base that does not happen much without selecting on Id first. So keys on Id may work for vectors of strain id and values. Introducing lmdb would require rewriting about 5 queries in GN1 and GN2 each.

## Notes

I found it is a bad idea to remove large .ibd files by hand because mariadb wants to recreate them to play the transaction log.