summaryrefslogtreecommitdiff
path: root/issues/systems/mariadb/ProbeSetData.gmi
diff options
context:
space:
mode:
authorArun Isaac2022-07-01 20:13:04 +0530
committerArun Isaac2022-07-01 20:13:04 +0530
commit0386602e96a1774ce436e0d9263d40583328da79 (patch)
treeeefde10b191cbdac9dfd236af701e951c76ba743 /issues/systems/mariadb/ProbeSetData.gmi
parent2f84a722e5944bd4458abbd45fd637a9286bab04 (diff)
downloadgn-gemtext-0386602e96a1774ce436e0d9263d40583328da79.tar.gz
Rescue issues lost in the topics/systems/mariadb directory.
Diffstat (limited to 'issues/systems/mariadb/ProbeSetData.gmi')
-rw-r--r--issues/systems/mariadb/ProbeSetData.gmi289
1 files changed, 289 insertions, 0 deletions
diff --git a/issues/systems/mariadb/ProbeSetData.gmi b/issues/systems/mariadb/ProbeSetData.gmi
new file mode 100644
index 0000000..672b64c
--- /dev/null
+++ b/issues/systems/mariadb/ProbeSetData.gmi
@@ -0,0 +1,289 @@
+# ProbeSetData
+
+## Tags
+
+* assigned: pjotrp
+* status: unclear
+* priority: medium
+* type: enhancement
+* keywords: database, mariadb, innodb
+
+## Description
+
+This is by far the largest table (~200Gb). I need to add disk space to be able to host it on the NVME and move stuff around. Final move is GN2 code and we have over 400Gb free.
+
+This time I failed porting to InnoDB. Next time:
+
+* [ ] Move database to large drive
+* [ ] Run second instance of mariadb, upgrade too
+* [ ] Export a small version of ISAM to ISAM
+* [ ] Try different sizes of innodb exports
+* [ ] Make (ordered) conversion and test performance
+* [ ] Muck out ibdata1 and transaction logs
+
+I disabled these and they need to be restored:
+
+* [X] binary log
+* [X] backup script
+* [ ] flush db - leave that off for now
+* [X] cron system-check
+* [X] systemd MYSQLD_OPTS setting
+
+The following fetches the data to be used for mapping:
+
+```
+SELECT
+ Strain.Name, ProbeSetData.value, ProbeSetSE.error, NStrain.count, Strain.Name2
+ FROM
+ (ProbeSetData, ProbeSetFreeze,
+ Strain, ProbeSet, ProbeSetXRef)
+ left join ProbeSetSE on
+ (ProbeSetSE.DataId = ProbeSetData.Id AND ProbeSetSE.StrainId = ProbeSetData.StrainId)
+ left join NStrain on
+ (NStrain.DataId = ProbeSetData.Id AND
+ NStrain.StrainId = ProbeSetData.StrainId)
+ WHERE
+ ProbeSet.Name = '4336695' AND ProbeSetXRef.ProbeSetId = ProbeSet.Id AND
+ ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id AND
+ ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA' AND
+ ProbeSetXRef.DataId = ProbeSetData.Id AND
+ ProbeSetData.StrainId = Strain.Id
+ Order BY
+ Strain.Name;
+```
+
+Prototocol from
+
+=> ./move-to-innodb.gmi
+
+Every table update has to follow the template:
+
+## Reboot Tux01 and Tux02
+
+=> ../reboot-tux01-tux02.gmi
+
+
+## Table upgrade
+
+### Check recent backups
+
+* [X] ascertain there is a backup
+* [X] copy original files
+
+### Make temporary table
+
+```
+mysql 1.5K Jun 27 2021 ProbeSetData.frm
+mysql 63G Dec 4 22:15 ProbeSetData.MYD
+mysql 118G Dec 28 08:52 ProbeSetData.MYI
+mysql 1.5K Jan 1 20:45 mytest.frm
+mysql 20G Jan 1 21:34 mytest.ibd
+```
+
+If you have enough space, you can create a copy of the actual table and do the work on that:
+
+Original:
+
+```
+| ProbeSetData | CREATE TABLE 'ProbeSetData' (
+ 'Id' int(10) unsigned NOT NULL DEFAULT 0,
+ 'StrainId' int(20) NOT NULL,
+ 'value' float NOT NULL,
+ UNIQUE KEY 'DataId' ('Id','StrainId'),
+ KEY 'strainid' ('StrainId')
+) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+```
+
+New:
+
+```
+DROP table mytest2;
+CREATE table mytest2(
+ Id INT unsigned NOT NULL DEFAULT 0,
+ StrainId MEDIUMINT unsigned NOT NULL DEFAULT 0,
+ value float NOT NULL,
+ primary key(Id,StrainId),
+ key (StrainId)) CHARACTER SET utf8mb4 ;
+```
+
+Note loading latin1 is slower than utf8! I tried.
+
+And fill it killing quite a few birds with one stone :). You may want to disable the binary logger before running:
+
+```
+INSERT INTO mytest2 (Id,StrainId,value) SELECT Id,StrainId,value FROM ProbeSetData LIMIT 51631478;
+```
+
+
+Makes a table at 1% of size - a table of 2.3GB. I wonder why the full thing took up 330Gb before we ran out of disk space.
+
+Show progress
+
+```
+SELECT TABLE_SCHEMA,TABLE_NAME,DATA_LENGTH,AVG_ROW_LENGTH,TABLE_ROWS FROM information_schema.TABLE
+S WHERE DATA_LENGTH>1000000000 order by data_length;
+```
+
+May add order by Id,StrainID next time.
+
+Once the process is done, you can rename the tables:
+
+```
+DROP ProbeSetData;
+RENAME TABLE tbl_name TO new_tbl_name, tbl_name2 TO ProbeSetData;
+```
+
+### Check table structure
+
+```
+SHOW CREATE TABLE ProbeSetData;
+```
+
+This is a really simple table
+
+```
+| ProbeSetData | CREATE TABLE 'ProbeSetData' (
+ 'Id' int(10) unsigned NOT NULL DEFAULT 0,
+ 'StrainId' int(20) NOT NULL,
+ 'value' float NOT NULL,
+ UNIQUE KEY 'DataId' ('Id','StrainId'),
+ KEY 'strainid' ('StrainId')
+) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+```
+
+```
+MariaDB [db_webqtl]> select * from ProbeSetData limit 2;
++----+----------+-------+
+| Id | StrainId | value |
++----+----------+-------+
+| 1 | 1 | 5.742 |
+| 1 | 2 | 5.006 |
++----+----------+-------+
+2 rows in set (0.000 sec)
+```
+
+But large
+
+```
+MariaDB [db_webqtl]> select count(*) from ProbeSetData;
++------------+
+| count(*) |
++------------+
+| 5141631478 |
++------------+
+1 row in set (0.000 sec)
+
+```
+
+This parses the full table (MYISAM version)
+
+```
+MariaDB [db_webqtl]> select count(distinct id) from ProbeSetData;
++--------------------+
+| count(distinct id) |
++--------------------+
+| 48119047 |
++--------------------+
+1 row in set (4 min 40.035 sec)
+```
+
+Do this also on Penguin2.
+
+```
+| ProbeSetData | CREATE TABLE 'ProbeSetData' (
+ 'Id' int(10) unsigned NOT NULL DEFAULT 0,
+ 'StrainId' smallint(5) unsigned NOT NULL DEFAULT 0,
+ 'value' float NOT NULL,
+ UNIQUE KEY 'DataId' ('Id','StrainId')
+) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+```
+
+The table layout is a bit puzzling.
+
+```
+MariaDB [db_webqtl]> select max(StrainId) from ProbeSetData;
++---------------+
+| max(StrainId) |
++---------------+
+| 65777 |
++---------------+
+MariaDB [db_webqtl]> select max(Id) from ProbeSetData;
++----------+
+| max(Id) |
++----------+
+| 91406791 |
++----------+
+```
+
+For StrainID MediumInt is a better fit with unsigned range is 0 to 16777215 and Id should be a INT (instead of BIGINT for the time being). That will make the table+indices smaller and faster. Note that MediumInt does not save RAM, just disk space. But that can impact IO performance by 25%.
+
+When a value overflows MariaDB will throw an error. Therefore it is safe to pick the smallest possible size. So even if we grow out of values we can quickly resize the table in, say, 10 years. That is, if we are still using SQL.
+
+### Check GN1,2,3 code for use of table
+
+```
+rg ProbeSetData --color=always |less -R
+rg ProbeSetData --type=py -l|fzf --preview="rg --color=always -A 20 ProbeSetData {}" --preview-window=right:85%:wrap
+```
+
+No surprises from the code scan.
+
+### Create test
+
+Some select statement and maybe a page of GN2.
+
+The tests we can share from ProbeSetXRef.
+
+Page of GN2 is the mapping page, e.g.
+
+=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P
+
+### Create primary key
+
+```
+ALTER TABLE ProbeSetData
+ ADD PRIMARY KEY(Id,StrainId);
+```
+
+(no quotes for column names)
+
+### Create indices
+
+```
+SHOW INDEX FROM ProbeSetData;
+DROP INDEX ProbeSetId ON ProbeSetData;
+```
+
+### Convert to innodb
+
+```
+ALTER TABLE ProbeSetData ENGINE = InnoDB;
+```
+
+### Change charset
+
+```
+ALTER TABLE ProbeSetSE CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
+```
+
+### Update fulltext
+
+For those cases see bottom of move-to-innodb.gmi.
+
+### Run optimiser
+
+```
+OPTIMIZE NO_WRITE_TO_BINLOG TABLE ProbeSetData;
+```
+
+After running the optimiser rename the tables
+
+```
+RENAME TABLE orig TO orig_old, mytest TO orig;
+```
+
+### Check test
+
+## Notes
+
+I found it is a bad idea to remove large .ibd files by hand because mariadb wants to recreate them to play the transaction log.