diff options
Diffstat (limited to 'issues/systems')
-rw-r--r-- | issues/systems/mariadb/ProbeData.gmi | 48 | ||||
-rw-r--r-- | issues/systems/mariadb/ProbeSE.gmi | 15 | ||||
-rw-r--r-- | issues/systems/mariadb/ProbeSetData.gmi | 289 | ||||
-rw-r--r-- | issues/systems/mariadb/ProbeSetXRef.gmi | 403 | ||||
-rw-r--r-- | issues/systems/mariadb/cleanup.gmi | 43 | ||||
-rw-r--r-- | issues/systems/mariadb/move-to-innodb.gmi | 594 |
6 files changed, 1392 insertions, 0 deletions
diff --git a/issues/systems/mariadb/ProbeData.gmi b/issues/systems/mariadb/ProbeData.gmi new file mode 100644 index 0000000..c339571 --- /dev/null +++ b/issues/systems/mariadb/ProbeData.gmi @@ -0,0 +1,48 @@ +# ProbeData + +## Tags + +* assigned: pjotrp, aruni +* status: unclear +* priority: medium +* type: enhancement +* keywords: database, mariadb, innodb, ProbeData + +## Description + +Probe level data is used to examine the correlation structure among the +N probes that have the same nominal target. Sometimes several probes +are badly behaved or contain SNPs or indels. +The well-behaved probes were then be used in GN1, at the user's +discretion, to make an eigengene that sometimes performs quite a bit +better than the Affymetrix probeset. Essentially, the user could design +their own probesets. And the probe level data is quite fascinating to +dissect some types of cis-eQTLs—the COMT story I have attached is a +good example. Here is figure 1 that exploits this unique feature: + +Ideally, the probe level data would be in GN2 with the same basic +functions as in GN1. + +All we need in GN2/3 is a new table to display the probe level +expression (mean) with their metadata (melting temperature, sequence, +location, etc). The probeset ID is the Table header and name (the +parent), and the probes in the table are the children. Using our now +standard DataTable format should work well. +We have a similar parent-child relation among traits with peptides and +proteins. All of the peptides of a single protein are should have +the same parent probeset/protein. And peptides could be entered as +"probes" in the same way that we did for Affymetrix. + +Arun—I wonder whether this hierarchy could be usefully combined to +handle time-series data. Probably not ;-) +In the case of probes and probesets there is almost never any overlap +of probe sequence—all are disjoint. That is also usually true of +peptides and proteins. + +Pjotr, the reason we have not added much probe level data to GN1 or GN2 +is because we did not have the bandwidth. Arthur simply did not have +time and I did not push the issue. Instead we just started loading the +probe level data separately as if they were probesets. This is what we +have done for peptide data and the reason that there are now "parallel" +data sets—one labeled "protein" and another as "peptide" or as "gene +level" and "exon level". We just collapse the hierarchy. diff --git a/issues/systems/mariadb/ProbeSE.gmi b/issues/systems/mariadb/ProbeSE.gmi new file mode 100644 index 0000000..2c36514 --- /dev/null +++ b/issues/systems/mariadb/ProbeSE.gmi @@ -0,0 +1,15 @@ +# ProbeSE + +## Tags + +* assigned: pjotrp +* status: unclear +* priority: medium +* type: enhancement +* keywords: database, mariadb, innodb, ProbeSE + +## Description + +Zach pointed out that ProbeSE is used on GN1 with + +=> http://gn1.genenetwork.org/webqtl/main.py?FormID=showProbeInfo&database=HC_M2_0606_P&ProbeSetID=1427571_at&CellID=None&RISet=BXD&incparentsf1=ON diff --git a/issues/systems/mariadb/ProbeSetData.gmi b/issues/systems/mariadb/ProbeSetData.gmi new file mode 100644 index 0000000..672b64c --- /dev/null +++ b/issues/systems/mariadb/ProbeSetData.gmi @@ -0,0 +1,289 @@ +# ProbeSetData + +## Tags + +* assigned: pjotrp +* status: unclear +* priority: medium +* type: enhancement +* keywords: database, mariadb, innodb + +## Description + +This is by far the largest table (~200Gb). I need to add disk space to be able to host it on the NVME and move stuff around. Final move is GN2 code and we have over 400Gb free. + +This time I failed porting to InnoDB. Next time: + +* [ ] Move database to large drive +* [ ] Run second instance of mariadb, upgrade too +* [ ] Export a small version of ISAM to ISAM +* [ ] Try different sizes of innodb exports +* [ ] Make (ordered) conversion and test performance +* [ ] Muck out ibdata1 and transaction logs + +I disabled these and they need to be restored: + +* [X] binary log +* [X] backup script +* [ ] flush db - leave that off for now +* [X] cron system-check +* [X] systemd MYSQLD_OPTS setting + +The following fetches the data to be used for mapping: + +``` +SELECT + Strain.Name, ProbeSetData.value, ProbeSetSE.error, NStrain.count, Strain.Name2 + FROM + (ProbeSetData, ProbeSetFreeze, + Strain, ProbeSet, ProbeSetXRef) + left join ProbeSetSE on + (ProbeSetSE.DataId = ProbeSetData.Id AND ProbeSetSE.StrainId = ProbeSetData.StrainId) + left join NStrain on + (NStrain.DataId = ProbeSetData.Id AND + NStrain.StrainId = ProbeSetData.StrainId) + WHERE + ProbeSet.Name = '4336695' AND ProbeSetXRef.ProbeSetId = ProbeSet.Id AND + ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id AND + ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA' AND + ProbeSetXRef.DataId = ProbeSetData.Id AND + ProbeSetData.StrainId = Strain.Id + Order BY + Strain.Name; +``` + +Prototocol from + +=> ./move-to-innodb.gmi + +Every table update has to follow the template: + +## Reboot Tux01 and Tux02 + +=> ../reboot-tux01-tux02.gmi + + +## Table upgrade + +### Check recent backups + +* [X] ascertain there is a backup +* [X] copy original files + +### Make temporary table + +``` +mysql 1.5K Jun 27 2021 ProbeSetData.frm +mysql 63G Dec 4 22:15 ProbeSetData.MYD +mysql 118G Dec 28 08:52 ProbeSetData.MYI +mysql 1.5K Jan 1 20:45 mytest.frm +mysql 20G Jan 1 21:34 mytest.ibd +``` + +If you have enough space, you can create a copy of the actual table and do the work on that: + +Original: + +``` +| ProbeSetData | CREATE TABLE 'ProbeSetData' ( + 'Id' int(10) unsigned NOT NULL DEFAULT 0, + 'StrainId' int(20) NOT NULL, + 'value' float NOT NULL, + UNIQUE KEY 'DataId' ('Id','StrainId'), + KEY 'strainid' ('StrainId') +) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +``` + +New: + +``` +DROP table mytest2; +CREATE table mytest2( + Id INT unsigned NOT NULL DEFAULT 0, + StrainId MEDIUMINT unsigned NOT NULL DEFAULT 0, + value float NOT NULL, + primary key(Id,StrainId), + key (StrainId)) CHARACTER SET utf8mb4 ; +``` + +Note loading latin1 is slower than utf8! I tried. + +And fill it killing quite a few birds with one stone :). You may want to disable the binary logger before running: + +``` +INSERT INTO mytest2 (Id,StrainId,value) SELECT Id,StrainId,value FROM ProbeSetData LIMIT 51631478; +``` + + +Makes a table at 1% of size - a table of 2.3GB. I wonder why the full thing took up 330Gb before we ran out of disk space. + +Show progress + +``` +SELECT TABLE_SCHEMA,TABLE_NAME,DATA_LENGTH,AVG_ROW_LENGTH,TABLE_ROWS FROM information_schema.TABLE +S WHERE DATA_LENGTH>1000000000 order by data_length; +``` + +May add order by Id,StrainID next time. + +Once the process is done, you can rename the tables: + +``` +DROP ProbeSetData; +RENAME TABLE tbl_name TO new_tbl_name, tbl_name2 TO ProbeSetData; +``` + +### Check table structure + +``` +SHOW CREATE TABLE ProbeSetData; +``` + +This is a really simple table + +``` +| ProbeSetData | CREATE TABLE 'ProbeSetData' ( + 'Id' int(10) unsigned NOT NULL DEFAULT 0, + 'StrainId' int(20) NOT NULL, + 'value' float NOT NULL, + UNIQUE KEY 'DataId' ('Id','StrainId'), + KEY 'strainid' ('StrainId') +) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +``` + +``` +MariaDB [db_webqtl]> select * from ProbeSetData limit 2; ++----+----------+-------+ +| Id | StrainId | value | ++----+----------+-------+ +| 1 | 1 | 5.742 | +| 1 | 2 | 5.006 | ++----+----------+-------+ +2 rows in set (0.000 sec) +``` + +But large + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetData; ++------------+ +| count(*) | ++------------+ +| 5141631478 | ++------------+ +1 row in set (0.000 sec) + +``` + +This parses the full table (MYISAM version) + +``` +MariaDB [db_webqtl]> select count(distinct id) from ProbeSetData; ++--------------------+ +| count(distinct id) | ++--------------------+ +| 48119047 | ++--------------------+ +1 row in set (4 min 40.035 sec) +``` + +Do this also on Penguin2. + +``` +| ProbeSetData | CREATE TABLE 'ProbeSetData' ( + 'Id' int(10) unsigned NOT NULL DEFAULT 0, + 'StrainId' smallint(5) unsigned NOT NULL DEFAULT 0, + 'value' float NOT NULL, + UNIQUE KEY 'DataId' ('Id','StrainId') +) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +``` + +The table layout is a bit puzzling. + +``` +MariaDB [db_webqtl]> select max(StrainId) from ProbeSetData; ++---------------+ +| max(StrainId) | ++---------------+ +| 65777 | ++---------------+ +MariaDB [db_webqtl]> select max(Id) from ProbeSetData; ++----------+ +| max(Id) | ++----------+ +| 91406791 | ++----------+ +``` + +For StrainID MediumInt is a better fit with unsigned range is 0 to 16777215 and Id should be a INT (instead of BIGINT for the time being). That will make the table+indices smaller and faster. Note that MediumInt does not save RAM, just disk space. But that can impact IO performance by 25%. + +When a value overflows MariaDB will throw an error. Therefore it is safe to pick the smallest possible size. So even if we grow out of values we can quickly resize the table in, say, 10 years. That is, if we are still using SQL. + +### Check GN1,2,3 code for use of table + +``` +rg ProbeSetData --color=always |less -R +rg ProbeSetData --type=py -l|fzf --preview="rg --color=always -A 20 ProbeSetData {}" --preview-window=right:85%:wrap +``` + +No surprises from the code scan. + +### Create test + +Some select statement and maybe a page of GN2. + +The tests we can share from ProbeSetXRef. + +Page of GN2 is the mapping page, e.g. + +=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P + +### Create primary key + +``` +ALTER TABLE ProbeSetData + ADD PRIMARY KEY(Id,StrainId); +``` + +(no quotes for column names) + +### Create indices + +``` +SHOW INDEX FROM ProbeSetData; +DROP INDEX ProbeSetId ON ProbeSetData; +``` + +### Convert to innodb + +``` +ALTER TABLE ProbeSetData ENGINE = InnoDB; +``` + +### Change charset + +``` +ALTER TABLE ProbeSetSE CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; +``` + +### Update fulltext + +For those cases see bottom of move-to-innodb.gmi. + +### Run optimiser + +``` +OPTIMIZE NO_WRITE_TO_BINLOG TABLE ProbeSetData; +``` + +After running the optimiser rename the tables + +``` +RENAME TABLE orig TO orig_old, mytest TO orig; +``` + +### Check test + +## Notes + +I found it is a bad idea to remove large .ibd files by hand because mariadb wants to recreate them to play the transaction log. diff --git a/issues/systems/mariadb/ProbeSetXRef.gmi b/issues/systems/mariadb/ProbeSetXRef.gmi new file mode 100644 index 0000000..cdf08f4 --- /dev/null +++ b/issues/systems/mariadb/ProbeSetXRef.gmi @@ -0,0 +1,403 @@ +# ProbeSetXRef + +## Tags + +* keywords: ProbeSetXRef, database, mariadb, innodb +* type: enhancement, documentation +* assigned: pjotrp +* status: unclear +* priority: medium + +## Table ProbeSetXRef + +Juggling indexes and transforming to InnoDB led to a massive speed increase for typical ProbeSetXRef queries. Global search for brca2 went down from 13s to 4s! Try + +=> http://genenetwork.org/gsearch?type=gene&terms=brca2 + +Not only global search is faster, *all* search is significantly faster. + +=> http://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=sh*&search_terms_and=&FormID=searchResult + +File data size increased from 3.3GB to 5.4GB (1.6x). + +Note that the full wildcard '*' search throws an error or empty results. That needs a fix. + +A primary key index was introduced (required by innodb) and joined indexes were split in two. I also reduced the text index to 5 characters for each locus to reduce the data pump. + +It turned out that a join on a mixed latin1 and utf8 field does not work which makes sense when you think about it. Here ProbeSetXRef.locus joins against geno.name. + +The short of it is that it is ill advised to change the charset table by table! So, we'll stick with latin1 until we convert all tables. Needs proper testing anyway. + +Some optimisation is still possible - removing the Old columns and perhaps the Locus VARCHAR column can be reduced in size. + +Using prototocol from + +=> ./move-to-innodb.gmi + +### Check table structure + +``` +SHOW CREATE TABLE ProbeSetXRef; +``` + +``` +| ProbeSetXRef | CREATE TABLE 'ProbeSetXRef' ( + 'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0, + 'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0, + 'DataId' int(10) unsigned NOT NULL DEFAULT 0, + 'Locus_old' char(20) DEFAULT NULL, + 'LRS_old' double DEFAULT NULL, + 'pValue_old' double DEFAULT NULL, + 'mean' double DEFAULT NULL, + 'se' double DEFAULT NULL, + 'Locus' varchar(50) DEFAULT NULL, + 'LRS' double DEFAULT NULL, + 'pValue' double DEFAULT NULL, + 'additive' double DEFAULT NULL, + 'h2' float DEFAULT NULL, + UNIQUE KEY 'ProbeSetId' ('ProbeSetFreezeId','ProbeSetId'), + UNIQUE KEY 'DataId_IDX' ('DataId'), + KEY 'Locus_IDX' ('Locus') +) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +``` + +For every probe set (read dataset measuring point): + + +``` +select * from ProbeSetXRef limit 2; +| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | ++------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+ +| 1 | 1 | 1 | 10.095.400 | 13.3971627898894 | 0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 | 12.590069931048 | 0.269 | -0.28515625 | NULL | +| 1 | 2 | 2 | D15Mit189 | 10.042057464356201 | 0.431 | 9.90165714285714 | 0.0374686634976217 | rs29535974 | 10.5970737900941 | 0.304 | -0.116783333333333 | NULL | ++------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+ +``` + +where ProbeSetFreezeId is the dataset (experiment). ProbesetId refers +to the probe set information (measuring point). DataId points to the +data point. The other values are used for search. + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetXRef; ++----------+ +| count(*) | ++----------+ +| 48076905 | ++----------+ +``` + +### Check GN1,2,3 code for use of table + +``` +rg ProbeSetXRef --color=always |less -R +rg ProbeSetXRef --type=py -l|fzf --preview="rg --color=always -A 20 ProbeSetXRef {}" --preview-window=right:85%:wrap +``` + +In SQL where statements these are usually combined + +ProbeSetXRef.ProbeSetFreezeId +ProbeSetXRef.ProbeSetId + +and sometimes + +ProbeSetXRef.DataId +ProbeSetXRef.Locus + +As can be seen from above table definition the indices are matching + +``` + UNIQUE KEY 'ProbeSetId' ('ProbeSetFreezeId','ProbeSetId'), + UNIQUE KEY 'DataId_IDX' ('DataId'), + KEY 'Locus_IDX' ('Locus') +``` + +The combination of + +``` +select count(distinct ProbeSetFreezeId,ProbeSetId) from ProbeSetXRef limit 10; +``` + +is unique. Now we should also notice that DataId is unique and will make a smaller primary index. It is wort trying to split the combined ('ProbeSetFreezeId','ProbeSetId') into two indices. I'll do that once we are on innodb. Oh wait, I have to set the primary key first, I don't think I can change that. Let's create a new table for testing: + +``` +CREATE TABLE mytest AS SELECT * FROM ProbeSetXRef; +ALTER TABLE mytest ADD PRIMARY KEY(DataId); +ALTER TABLE mytest ADD KEY(ProbeSetFreezeId); +ALTER TABLE mytest ADD KEY(ProbeSetId); +SHOW CREATE TABLE mytest; +``` + + +### Create test + +Some select statement and maybe a page of GN2. + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5; ++----------+ +| count(*) | ++----------+ +| 19068 | ++----------+ +1 row in set (1.752 sec) + +MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<300 and ProbeSetId<1000 and pValue>0.5; ++----------+ +| count(*) | ++----------+ +| 19068 | ++----------+ +1 row in set (13.781 sec) +``` + +And InnoDB mytest has a 3x speedup for the real query: + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5; ++----------+ +| count(*) | ++----------+ +| 19068 | ++----------+ +1 row in set (1.748 sec) + +MariaDB [db_webqtl]> select count(*) from mytest where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5; ++----------+ +| count(*) | ++----------+ +| 19068 | ++----------+ +1 row in set (0.445 sec) +``` + +Note the second query traverses the full file and the mytest version is the same speed. + +### Create primary key + +``` +ALTER TABLE ProbeSetXRef + ADD PRIMARY KEY(ProbeSetFreezeId,ProbeSetId); +``` + +``` +MariaDB [db_webqtl]> DROP INDEX ProbeSetId ON ProbeSetXRef; +MariaDB [db_webqtl]> show index from ProbeSetXRef; +``` + + +### Create indices +### Convert to innodb + +``` +ALTER TABLE ProbeSetXRef ENGINE = InnoDB; +``` + +### Change charset + +There are some text fields for locus, I think it is safe to translate those to utf8. + +``` +ALTER TABLE mytest CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; +ALTER TABLE mytest ADD KEY(Locus(5)); +``` + +Note I specified a size for Locus to keep the index nimble. + +### Update fulltext + +For those cases see bottom of move-to-innodb.gmi. + +### Run optimiser + +``` +OPTIMIZE NO_WRITE_TO_BINLOG TABLE mytest; +``` + +### Final table + +``` +RENAME TABLE ProbeSetXRef TO ProbeSetXRef_old, mytest TO ProbeSetXRef; +``` + +``` +| mytest | CREATE TABLE 'mytest' ( + 'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0, + 'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0, + 'DataId' int(10) unsigned NOT NULL DEFAULT 0, + 'Locus_old' char(20) DEFAULT NULL, + 'LRS_old' double DEFAULT NULL, + 'pValue_old' double DEFAULT NULL, + 'mean' double DEFAULT NULL, + 'se' double DEFAULT NULL, + 'Locus' varchar(50) DEFAULT NULL, + 'LRS' double DEFAULT NULL, + 'pValue' double DEFAULT NULL, + 'additive' double DEFAULT NULL, + 'h2' float DEFAULT NULL, + PRIMARY KEY ('DataId'), + KEY 'ProbeSetFreezeId' ('ProbeSetFreezeId'), + KEY 'ProbeSetId' ('ProbeSetId'), + KEY 'Locus_2' ('Locus'(5)) +) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 | +``` + + +### Check test + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where ProbeSetFreezeId<200 and ProbeSetId<1000 and pValue>0.5; ++----------+ +| count(*) | ++----------+ +| 19068 | ++----------+ +1 row in set (0.058 sec) +``` + +Now I had 3 pages open + +=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P +=> http://genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=shh&search_terms_and=&FormID=searchResult +=> http://genenetwork.org/gsearch?type=gene&terms=brca2 + +The first two pages loaded the same. But the third (global search) took much longer than the 15 seconds it did before I modified ProbeSetXRef. Doh! It should be + +``` +It took the server 15.71517s seconds to process this page. +It took your browser 0.598 second(s) to render this page +``` + +The query timed out, so I could look in the slow query log we have. + +(But first I added a standard test on sheepdog to make sure we see these regressions.) + +``` +vim /var/log/mysql/mysql-slow.log + + # Time: 211230 8:20:36 + # User@Host: webqtlout[webqtlout] @ localhost [] + # Thread_id: 11771 Schema: db_webqtl QC_hit: No + # Query_time: 470.965035 Lock_time: 0.000258 Rows_sent: 1017 Rows_examined: 115097830 + # Rows_affected: 0 Bytes_sent: 253724 +SELECT ProbeSetFreeze.'Name', ProbeSetFreeze.'FullName', ProbeSet.'Name', ProbeSet.'Symbol', CAST(ProbeSet.'description' AS BINARY), CAST(ProbeSet.'Probe_Target_Description' AS BINARY), ProbeSet.'Chr', ProbeSet.'Mb', ProbeSetXRef.'Mean', ProbeSetXRef.'LRS', ProbeSetXRef.'Locus', ProbeSetXRef.'pValue', ProbeSetXRef.'additive', Geno.'Chr' as geno_chr, Geno.'Mb' as geno_mb FROM Species + INNER JOIN InbredSet ON InbredSet.'SpeciesId'= Species.'Id' + INNER JOIN ProbeFreeze ON ProbeFreeze.'InbredSetId' = InbredSet.'Id' INNER JOIN Tissue ON ProbeFreeze.'TissueId' = Tissue.'Id' + INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.'ProbeFreezeId' = ProbeFreeze.'Id' + INNER JOIN ProbeSetXRef ON ProbeSetXRef.'ProbeSetFreezeId' = ProbeSetFreeze.'Id' + INNER JOIN ProbeSet ON ProbeSet.'Id' = ProbeSetXRef.'ProbeSetId' + LEFT JOIN Geno ON ProbeSetXRef.'Locus' = Geno.'Name' AND Geno.'SpeciesId' = Species.'Id' + WHERE ((((MATCH (ProbeSet.Name, ProbeSet.description, ProbeSet.symbol, alias, GenbankId, UniGeneId, Probe_Target_Description) AGAINST ('sh*' IN BOOLEAN MODE))) AND ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112 )) and ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112 ORDER BY ProbeSet.symbol ASC; +``` + +this is a different query which logs after a >60s query (it is one of my monitors looking for 'sh*'. This is an interesting one to look at, but it does not match my query. + +The actual query showed up as + +``` +SELECT + Species.'Name' AS species_name, + InbredSet.'Name' AS inbredset_name, + Tissue.'Name' AS tissue_name, + ProbeSetFreeze.Name AS probesetfreeze_name, + ProbeSetFreeze.FullName AS probesetfreeze_fullname, + ProbeSet.Name AS probeset_name, + ProbeSet.Symbol AS probeset_symbol, + CAST(ProbeSet.'description' AS BINARY) AS probeset_description, + ProbeSet.Chr AS chr, + ProbeSet.Mb AS mb, + ProbeSetXRef.Mean AS mean, + ProbeSetXRef.LRS AS lrs, + ProbeSetXRef.'Locus' AS locus, + ProbeSetXRef.'pValue' AS pvalue, + ProbeSetXRef.'additive' AS additive, + ProbeSetFreeze.Id AS probesetfreeze_id, + Geno.Chr as geno_chr, + Geno.Mb as geno_mb + FROM Species + INNER JOIN InbredSet ON InbredSet.'SpeciesId'=Species.'Id' + INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet.'Id' + INNER JOIN Tissue ON ProbeFreeze.'TissueId'=Tissue.'Id' + INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id + INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id + INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId + LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id + WHERE ( MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ('brca2' IN BOOLEAN MODE) ) + AND ProbeSetFreeze.confidentiality < 1 + AND ProbeSetFreeze.public > 0 + ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name, probeset_name + LIMIT 6000; +``` + +which looks somewhat similar. There are some joins and I think my index size restriction of Locus(5) may be the problem here. The INNER JOIN keyword selects records that have matching values in both tables, so maybe the Locus has trouble matching. + +Add a full index: + +``` +ALTER TABLE ProbeSetXRef ADD KEY(Locus); +``` + +The alternative is that mixed innodb and myisam joins are expensive, or more likely the mixed charsets are not playing well. I can try the EXPLAIN statement next. + +Well, it turned out that a join on a mixed latin1 and utf8 field does not work which makes sense when you think about it. It joins against geno.name. Now in GN geno.name is usally compared against a string or ProbeSetXRef.locus. + +The short of it is that it is ill advised to change the charset table by table! So, we'll stick with latin1 until we convert all tables. + + +``` +SELECT + Species.'Name' AS species_name, + InbredSet.'Name' AS inbredset_name, + Tissue.'Name' AS tissue_name, + ProbeSetFreeze.Name AS probesetfreeze_name, + ProbeSetFreeze.FullName AS probesetfreeze_fullname, + ProbeSet.Name AS probeset_name, + ProbeSet.Symbol AS probeset_symbol, + CAST(ProbeSet.'description' AS BINARY) AS probeset_description, + ProbeSet.Chr AS chr, + ProbeSet.Mb AS mb, + mytest.Mean AS mean, + mytest.LRS AS lrs, + mytest.'Locus' AS locus, + mytest.'pValue' AS pvalue, + mytest.'additive' AS additive, + ProbeSetFreeze.Id AS probesetfreeze_id, + Geno.Chr as geno_chr, + Geno.Mb as geno_mb + FROM Species + INNER JOIN InbredSet ON InbredSet.'SpeciesId'=Species.'Id' + INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet.'Id' + INNER JOIN Tissue ON ProbeFreeze.'TissueId'=Tissue.'Id' + INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id + INNER JOIN mytest ON mytest.ProbeSetFreezeId=ProbeSetFreeze.Id + INNER JOIN ProbeSet ON ProbeSet.Id = mytest.ProbeSetId + LEFT JOIN Geno ON mytest.Locus = Geno.Name AND Geno.SpeciesId = Species.Id + WHERE ( MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ("brca2" IN BOOLEAN MODE) ) + AND ProbeSetFreeze.confidentiality < 1 + AND ProbeSetFreeze.public > 0 + ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name, probeset_name LIMIT 600; +``` + +Should run 600 rows in set (11.638 sec) + +Changing the index now does 6000 rows in set (0.240 sec)! + +The new table structure is + +``` +| ProbeSetXRef | CREATE TABLE 'ProbeSetXRef' ( + 'ProbeSetFreezeId' smallint(5) unsigned NOT NULL DEFAULT 0, + 'ProbeSetId' int(10) unsigned NOT NULL DEFAULT 0, + 'DataId' int(10) unsigned NOT NULL DEFAULT 0, + 'Locus_old' char(20) DEFAULT NULL, + 'LRS_old' double DEFAULT NULL, + 'pValue_old' double DEFAULT NULL, + 'mean' double DEFAULT NULL, + 'se' double DEFAULT NULL, 'Locus' varchar(50) DEFAULT NULL, 'LRS' double DEFAULT NULL, + 'pValue' double DEFAULT NULL, + 'additive' double DEFAULT NULL, 'h2' float DEFAULT NULL, + PRIMARY KEY ('DataId'), + KEY 'ProbeSetFreezeId' ('ProbeSetFreezeId'), + KEY 'ProbeSetId' ('ProbeSetId'), + KEY 'Locus_2' ('Locus'(5)) +) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +``` diff --git a/issues/systems/mariadb/cleanup.gmi b/issues/systems/mariadb/cleanup.gmi new file mode 100644 index 0000000..0639e40 --- /dev/null +++ b/issues/systems/mariadb/cleanup.gmi @@ -0,0 +1,43 @@ +# Clean Up + +## Tags + +* assigned: pjotrp, robw +* status: unclear +* priority: unclear +* type: database administration +* keywords: database, mariadb + +## Description + +Find all larger tables + +``` +SELECT TABLE_SCHEMA,TABLE_NAME,DATA_LENGTH FROM information_schema.TABLES WHERE DATA_LENGTH>10000; +``` + +The following four tables live in + +``` +MariaDB [mysql]> SELECT TABLE_SCHEMA,TABLE_NAME,DATA_LENGTH FROM information_schema.TABLES WHERE ENGINE = 'InnoDB'; ++--------------+----------------------+-------------+ +| TABLE_SCHEMA | TABLE_NAME | DATA_LENGTH | ++--------------+----------------------+-------------+ +| db_webqtl | TraitMetadata | 16384 | +| db_webqtl | ProbeSetSE | 24177016832 | +| db_webqtl | metadata_audit | 49152 | +| db_webqtl | mytest1 | 1567621120 | +| db_webqtl | ProbeSetXRef | 2836398080 | +| db_webqtl | GeneInfo | 23642112 | +| db_webqtl | mytest2 | 56524537856 | +| mysql | transaction_registry | 16384 | +| mysql | innodb_index_stats | 16384 | +| mysql | innodb_table_stats | 16384 | +| mysql | gtid_slave_pos | 16384 | ++--------------+----------------------+-------------+ +11 rows in set (0.008 sec) +``` + +```sh +for x in innodb_index_stats innodb_table_stats gtid_slave_pos transaction_registry ; do echo $x ; mysqldump -u webqtlout -pwebqtlout mysql $x > /export3/$x.sql ; done +``` diff --git a/issues/systems/mariadb/move-to-innodb.gmi b/issues/systems/mariadb/move-to-innodb.gmi new file mode 100644 index 0000000..134a50a --- /dev/null +++ b/issues/systems/mariadb/move-to-innodb.gmi @@ -0,0 +1,594 @@ +# MariaDB: Move to InnoDB Engine + +We are going to move from myisam to innodb. Penguin2 has been happily running innodb for some time. + +Main problem are fulltext columns, the text from Trello is captured below. This is for the following tables + +* ProbeSet +* GeneRIF_BASIC +* pubmedsearch + +Initial good candidates are (from issues/database-not-responding): + +* Good candidates + + 2.1G Dec 4 22:15 ProbeSetXRef.MYD (done!) + + 2.3G Dec 18 14:56 ProbeSet.MYD - with fulltext column + + 2.6G Aug 27 2019 ProbeSE.MYD (used?) + + 7.1G Nov 2 05:07 ProbeSetSE.MYD (done!) + + 11G Aug 27 2019 ProbeData.MYD (used?) + + 63G Dec 4 22:15 ProbeSetData.MYD + +I am starting with the two SE tables first - because they are small. + +Actually ProbeData and ProbeSE (containing the Affy data) are not referenced in GN2. I need to check that. + +## Report + +With the SQL database we need to move from myisam to innodb format, +mostly to stop the problem of full table locks. Also I expect the +occasional crashes we see to go away. + +Today, as a start, I moved the ProbeSetSE table to innodb. The result +is that the disk representation is 3x the size and a full table scan +takes 3x the time (somewhat unsurprising). This may impact +correlation, i.e. those routines that go through all the data. We'll +have to test that carefully. For the SE I don't think we do that, so +it is no biggie. + +There are no real performance gains that I can tell, though for the +mapping page I see no slowing down either. + +I'll need to free up space on the NVME storage to do the larger +tables. + +I also took the opportunity to add a primary key because InnoDB +requires it and I changed the *table* language from latin1-sweden to +utf8. These are major upgrades. + +There are no real performance gains that I can tell, though for the +mapping page I see no slowing down either. + +I'll need to free up space on the NVME storage to do the larger +tables. + +I also took the opportunity to add a primary key because InnoDB +requires it and I changed the *table* language from latin1-sweden to +utf8. These are major upgrades. + +In the coming week I'll have to reboot the server to add a disk, move +stuff across to free up space on the NVME, and convert a few more +tables. I aim to get the largest tables done - because these are the +one that hurt when locked: + +* Good candidates with update times + + 2.1G Dec 4 22:15 ProbeSetXRef.MYD + + 2.3G Dec 18 14:56 ProbeSet.MYD - with fulltext column + + 2.6G Aug 27 2019 ProbeSE.MYD + + 7.1G Nov 2 05:07 ProbeSetSE.MYD + + 11G Aug 27 2019 ProbeData.MYD + + 63G Dec 4 22:15 ProbeSetData.MYD + +(note each is about double that size because of indexes) + +ProbeSetSE is done. So ProbeData and ProbeSetData are the main +candidates right now. + +Note I set buffers to 16GB for now + +``` ++innodb_buffer_pool_size=16G ++innodb_ft_min_token_size=3 ++# innodb_use_sys_malloc=0 ++innodb_file_per_table=ON +``` + +## Check database + +``` +time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl +``` + +To check one single table Probe + +``` +root@tux01:/var/lib/mysql/db_webqtl# mysqlcheck -c db_webqtl -u webqtlout -pwebqtlout Probe +db_webqtl.Probe OK +``` + +Make sure we have not a FULLTEXT column we do not know about (note it needs backquotes around ls): + +``` +root@tux01:/var/lib/mysql/db_webqtl# for x in ls -1 *.MYD|sed -e 's,\.MYD,,' ; do echo $x ; mysql -u webqtlout -pwebqtlout db_webqtl -e "select group_concat(distinct column_name) from information_schema.STATISTICS where table_schema = 'db_webqtl' and table_name = '$x' and index_type = 'FULLTEXT'" ; done|less +GeneRIF_BASIC +ProbeSet +pubmedsearch +``` + +still the same. + +## Create a test + +Start with ProbeSetSE + +``` +-rw-rw---- 1 mysql mysql 8.1G Dec 28 08:39 ProbeSetSE.MYI +-rw-rw---- 1 mysql mysql 7.1G Nov 2 05:07 ProbeSetSE.MYD +-rw-rw---- 1 mysql mysql 8.5K Feb 3 2021 ProbeSetSE.frm +``` + +``` +MariaDB [db_webqtl]> select * from ProbeSetSE limit 2; ++--------+----------+----------+ +| DataId | StrainId | error | ++--------+----------+----------+ +| 1 | 1 | 0.681091 | +| 1 | 2 | 0.361151 | ++--------+----------+----------+ +2 rows in set (0.001 sec) + +MariaDB [db_webqtl]> select count(*) from ProbeSetSE limit 2; ++-----------+ +| count(*) | ++-----------+ +| 688744613 | ++-----------+ +1 row in set (0.000 sec) +``` + +``` +MariaDB [db_webqtl]> flush tables ProbeSetSE; +Query OK, 0 rows affected (0.002 sec) +``` + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetSE where error<0.36; ++-----------+ +| count(*) | ++-----------+ +| 601603553 | ++-----------+ +1 row in set (1 min 1.189 sec) +``` + +Some testing shows strainid and error are not indexed. Test query on myisam: + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetSE where strainid<20 and error<0.10; ++----------+ +| count(*) | ++----------+ +| 61625074 | ++----------+ +1 row in set (58.301 sec) +``` + +Check index and primary key (PK): + +``` +SHOW CREATE TABLE ProbeSetSE; + +| ProbeSetSE | CREATE TABLE "ProbeSetSE" ( + "DataId" int(10) unsigned NOT NULL DEFAULT 0, + "StrainId" smallint(5) unsigned NOT NULL DEFAULT 0, + "error" float NOT NULL, + UNIQUE KEY "DataId" ("DataId","StrainId") +) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +``` + +Note the latin1 we don't need. Also innodb needs a primary key. + +The mapping page retrieves SE through a call to + +``` +retrieve_sample_data(this_trait, dataset) +``` + +we also have an API endpoint + +``` +@app.route("/api/v_{}/sample_data/<path:dataset_name>".format(version)) +@app.route("/api/v_{}/sample_data/<path:dataset_name>.<path:file_format>".format(version)) +def all_sample_data(dataset_name, file_format="csv"): +``` + +But I can't get it to work for a trait. + +``` +curl "http://genenetwork.org/api/v_pre1/sample_data/1427571_at" +``` + +so, instead, I exported the CSV from + +=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P + +It took the server 2.02924s seconds to process this page. +It took your browser 1.577 second(s) to render this page. + +This is our test 'setup'. + +## Backup and convert table + +On Tux01 the database is hosted on a drive with 111 GB free. Not enough for all conversions. There should be an extra drive in there which requires configuration with reboot. But first we can convert this small table. Backups we have already automated. But I'll add + +``` +tar cvzf /home/wrk/ProbeSetSE.tgz ProbeSetSE* +``` + +To convert to InnoDB we should: + +* ascertain primary key +* change charset +* change engine to InnoDB +* make space on disk drive(s) + +### Set primary key + +For ProbeSetSE note the UNIQUE key was already defined. So set a primary key: + +``` +ALTER TABLE ProbeSetSE + ADD PRIMARY KEY(DataId,StrainId); +Query OK, 688744613 rows affected (15 min 13.830 sec) +Records: 688744613 Duplicates: 0 Warnings: 0 +``` + +Unsurprisingly the index grew + +``` +-rw-rw---- 1 mysql mysql 16G Dec 28 11:06 ProbeSetSE.MYI +``` + +### Change charset + +``` +MariaDB [db_webqtl]> SHOW CHARACTER SET LIKE 'utf8mb4'; ++---------+---------------+--------------------+--------+ +| Charset | Description | Default collation | Maxlen | ++---------+---------------+--------------------+--------+ +| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | ++---------+---------------+--------------------+--------+ +1 row in set (0.000 sec) +``` + +``` +MariaDB [db_webqtl]> ALTER TABLE ProbeSetSE CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; +Stage: 2 of 2 'Enabling keys' 0% of stage done +Query OK, 688744613 rows affected (15 min 14.380 sec) +Records: 688744613 Duplicates: 0 Warnings: 0 +``` + +For this table it has no effect since there are not text fields. Still this looks good: + +``` +SHOW CREATE TABLE ProbeSetSE; +| ProbeSetSE | CREATE TABLE "ProbeSetSE" ( + "DataId" int(10) unsigned NOT NULL DEFAULT 0, + "StrainId" smallint(5) unsigned NOT NULL DEFAULT 0, + "error" float NOT NULL, + PRIMARY KEY ("DataId","StrainId"), + UNIQUE KEY "DataId" ("DataId","StrainId") +) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 | +``` + +Looking good! Now there are two indexes which are the same to + +``` +MariaDB [db_webqtl]> DROP INDEX DataId ON ProbeSetSE; +``` + +``` +MariaDB [db_webqtl]> show index from ProbeSetSE; +2 rows in set (0.000 sec) +``` + + +### Convert to innodb + +So, on to converting to innodb + +=> https://mariadb.com/kb/en/converting-tables-from-myisam-to-innodb/ + +=> https://dataedo.com/kb/query/mysql/list-innodb-tables Show existing innodb tables + +Unfortunately we can't swith to 4k page tables because we have existing tables. We'll do that later some day in a controlled fashion. + +``` +MariaDB [db_webqtl]> ALTER TABLE ProbeSetSE ENGINE = InnoDB; +Query OK, 688744613 rows affected (1 hour 51 min 36.273 sec) +Records: 688744613 Duplicates: 0 Warnings: 0 +``` + +2 hours for a lousy table conversion! + +The new file sizes are: + +``` +-rw-rw---- 1 mysql mysql 1.5K Dec 28 11:33 ProbeSetSE.frm +-rw-rw---- 1 mysql mysql 51G Dec 28 13:25 ProbeSetSE.ibd +``` + +So the has tripled (including a new index) and during conversion it has both the old and the new on disk. I'll need to make space for this baby. + +Let's try OPTIMIZE + +``` +OPTIMIZE NO_WRITE_TO_BINLOG TABLE ProbeSetSE; +``` + +Ah, now the size is similar to myisam and loading the mapping page is slighty faster. + +## Run tests again + +What about performance? + +``` +MariaDB [db_webqtl]> select count(*) from ProbeSetSE where strainid<20 and error<0.10; ++----------+ +| count(*) | ++----------+ +| 61625074 | ++----------+ +1 row in set (3 min 22.958 sec) +``` + +Whoah. 3 times slower - which makes sense if you know the physical size of the data. Full table scans should be rare, but we need to make sure we don't slow them down that much! + +Again I exported the CSV from + +=> http://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P + +It took the server 1.46351s seconds to process this page. +It took your browser 2.119 second(s) to render this page + +and it shows practically the same results. + + +## Table template + +Prototocol from + +=> ./move-to-innodb.gmi + +Every table update has to follow the template: + +### Check recent backups + +* [ ] ascertain there is a backup +* [ ] copy original files + +### Make temporary table + +If you have enough space, you can create a copy of the actual table and do the work on that: + +``` +CREATE TABLE new_tbl [AS] SELECT * FROM MYTABLE; +``` + +Then you can change the column as desired: + +``` +ALTER TABLE tbl_name MODIFY COLUMN col_name BIGINT AUTO_INCREMENT; +``` + +Once the process is done, you can rename the tables: + +``` +DROP MYTABLE; +RENAME TABLE tbl_name TO new_tbl_name, tbl_name2 TO MYTABLE; +``` + +### Check table structure + +``` +SHOW CREATE TABLE MYTABLE; +``` + +``` +select * from MYTABLE limit 2; +``` + +``` +select count(*) from MYTABLE; +``` + +Do this also on Penguin2. + +### Check GN1,2,3 code for use of table + +``` +rg MYTABLE --color=always |less -R +rg MYTABLE --type=py -l|fzf --preview="rg --color=always -A 20 MYTABLE {}" --preview-window=right:85%:wrap +``` + +### Create test + +Some select statement and maybe a page of GN2. + +### Create primary key + +``` +ALTER TABLE MYTABLE + ADD PRIMARY KEY(col1,col2); +``` + +### Create indices + +``` +SHOW INDEX FROM MYTABLE; +DROP INDEX ProbeSetId ON MYTABLE; +``` + +### Convert to innodb + +``` +ALTER TABLE MYTABLE ENGINE = InnoDB; +``` + +### Change charset + +``` +ALTER TABLE ProbeSetSE CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; +``` + +### Update fulltext + +For those cases see bottom of move-to-innodb.gmi. + +### Run optimiser + +``` +OPTIMIZE NO_WRITE_TO_BINLOG TABLE MYTABLE; +``` + +After running the optimiser rename the tables + +``` +RENAME TABLE orig TO orig_old, mytest TO orig; +``` + +### Check test + +## ProbeSetXRef + +=> ProbeSetXRef.gmi + +## Notes captured from Trello: + +In an earlier track I wrote how to deal with Fulltext fields + +pjotrp 16 Oct 2019 at 10:47 + +Fulltext + +To list fulltext info for one table do: + +``` +select group_concat(distinct column_name) from information_schema.STATISTICS where table_schema = 'db_webqtl' and table_name = 'Probeset' and index_type = 'FULLTEXT'; +``` + +Or + +``` +SHOW CREATE TABLE ProbeSet; + +E.g. + + FULLTEXT KEY `SEARCH_GENE_IDX` (`Symbol`,`alias`), + FULLTEXT KEY `SEARCH_FULL_IDX` (`Name`,`description`,`Symbol`,`alias`,`GenbankId`,`UniGeneId`,`Probe_Target_Description`), + FULLTEXT KEY `RefSeq_FULL_IDX` (`RefSeq_TranscriptId`) +``` + +To see all: + +``` +root@tux02:/var/lib/mysql/db_webqtl# for x in ls -1 *.MYD|sed -e 's,\.MYD,,' ; do echo $x ; mysql -u webqtlout -pwebqtlout db_webqtl -e "select group_concat(distinct column_name) from information_schema.STATISTICS where table_schema = 'db_webqtl' and table_name = '$x' and index_type = 'FULLTEXT'" ; done +``` + +So we have to fix only + +``` +ProbeSet +GeneRIF_BASIC +pubmedsearch +``` + +``` +REPAIR TABLE ProbeSet QUICK; +REPAIR TABLE GeneRIF_BASIC QUICK; +REPAIR TABLE pubmedsearch QUICK; +``` + +Note that GN1 search only appears to use ProbeSet. Reindexing takes +about 10 minutes on Tux02. + +After updating to ProbeSet to innodb the following query failed + + +``` +SELECT distinct ProbeSet.Name as TNAME, 0 as thistable, ProbeSetXRef.Mean as TMEAN, ProbeSetXRef.LRS as TLRS, ProbeSetXRef.PVALUE as TPVALUE, ProbeSet.Chr_num as TCHR_NUM, ProbeSet.Mb as TMB, ProbeSet.Symbol as TSYMBOL, ProbeSet.name_num as TNAME_NUM FROM ProbeSetXRef, ProbeSet WHERE ((MATCH (ProbeSet.Name, ProbeSet.description, ProbeSet.symbol, alias, GenbankId, UniGeneId, Probe_Target_Description) AGAINST ('shh' IN BOOLEAN MODE)) or (MATCH (ProbeSet.symbol) AGAINST ('"Hx" "ShhNC" "9530036O11Rik" "Dsh" "Hhg1" "Hxl3" "M100081"' IN BOOLEAN MODE))) and ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112 ORDER BY ProbeSet.symbol ASC; +``` + +with + +``` +ERROR 1191 (HY000): Can't find FULLTEXT index matching the column list +``` + +The shorter version works + +``` +SELECT distinct ProbeSet.Name as TNAME, 0 as thistable, ProbeSetXRef.Mean as TMEAN, ProbeSetXRef.LRS as TLRS, ProbeSetXRef.PVALUE as TPVALUE, ProbeSet.Chr_num as TCHR_NUM, ProbeSet.Mb as TMB, ProbeSet.Symbol as TSYMBOL, ProbeSet.name_num as TNAME_NUM FROM ProbeSetXRef, ProbeSet WHERE ((MATCH (ProbeSet.Name, ProbeSet.description, ProbeSet.symbol, alias, GenbankId, UniGeneId, Probe_Target_Description) AGAINST ('"Shh" "ShhNC" "9530036O11Rik" "Dsh" "Hhg1"' IN BOOLEAN MODE))) and ProbeSet.Id = ProbeSetXRef.ProbeSetId and ProbeSetXRef.ProbeSetFreezeId = 112; +``` + +when you remove any column in the MATCH statement we get this +error. Which kinda makes sense I suppose. We need to add an index +for the single ProbeSet.symbol match. Create it with + +``` +CREATE FULLTEXT INDEX ft_ProbeSet_Symbol ON ProbeSet(Symbol); +``` + +Now we have + +``` +FULLTEXT KEY `SEARCH_GENE_IDX` (`Symbol`,`alias`), +FULLTEXT KEY `SEARCH_FULL_IDX` (`Name`,`description`,`Symbol`,`alias`,`GenbankId`,`UniGeneId`,`Probe_Target_Description`), +FULLTEXT KEY `RefSeq_FULL_IDX` (`RefSeq_TranscriptId`), +FULLTEXT KEY `ft_ProbeSet_Symbol` (`Symbol`) +``` + +and the query works. + +pjotrp 16 Oct 2019 at 09:24 (edited) + +Full text search is the least straightforward, see + +and + +Now for ProbeSet - one of the critical locked tables we need this. + +pjotrp 15 Oct 2019 at 16:38 (edited) + +``` +time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl +db_webqtl.Docs +warning : 1 client is using or hasn't closed the table properly +status : OK + +real 16m52.567s +``` + +## Convert to InnoDB + +The largest tables are + +``` +1.6G Aug 27 2019 Probe.MYD +2.1G Aug 27 2019 LCorrRamin3.MYD +2.1G Dec 4 22:15 ProbeSetXRef.MYD +2.3G Dec 18 14:56 ProbeSet.MYD +2.6G Aug 27 2019 ProbeSE.MYD +7.1G Nov 2 05:07 ProbeSetSE.MYD +8.3G Aug 28 2019 SnpPattern.MYD + 11G Aug 27 2019 ProbeData.MYD + 11G May 22 2020 GenoData.MYD + 11G Aug 27 2019 SnpAll.MYD + 63G Dec 4 22:15 ProbeSetData.MYD +``` + +On Penguin2 we are already runing ProbeSetData as + +``` +238G Jul 10 2020 ProbeSetData.ibd +``` + +which is pretty massive! It includes the index, which is 180G, so the difference is not that great. Also we should try a 4kb page size. Also make sure to enable innodb_file_per_table. + + +## Tags + +* assigned: pjotr +* type: enhancement, documentation +* status: unclear +* priority: medium +* keywords: database, mariadb, innodb |