diff options
author | Pjotr Prins | 2023-03-12 23:37:06 +0100 |
---|---|---|
committer | Pjotr Prins | 2023-03-12 23:37:06 +0100 |
commit | 7223e7514b98f270f7717f48307a5114d7bf0e11 (patch) | |
tree | b2fae2568da6cc161436a0f182ebf45296455722 /issues | |
parent | 974f71509292c68274d6800f3cdfa53144ee24af (diff) | |
download | gn-gemtext-7223e7514b98f270f7717f48307a5114d7bf0e11.tar.gz |
DB stuff
Diffstat (limited to 'issues')
-rw-r--r-- | issues/database-long-query-after-innodb-migration.gmi | 133 | ||||
-rw-r--r-- | issues/systems/mariadb/ProbeSetData.gmi | 83 |
2 files changed, 209 insertions, 7 deletions
diff --git a/issues/database-long-query-after-innodb-migration.gmi b/issues/database-long-query-after-innodb-migration.gmi new file mode 100644 index 0000000..65d6b0d --- /dev/null +++ b/issues/database-long-query-after-innodb-migration.gmi @@ -0,0 +1,133 @@ +# slow text search query + +A slow query turned out to do a join on latin1 and utf8 columns. That was +very slow! + +The query contains + +``` +WHERE (((Phenotype.Post_publication_description +LIKE "%liver%" OR Phenotype.Pre_publication_description LIKE "%liver%" OR +Phenotype.Pre_publication_abbreviation LIKE "%liver%" OR +Phenotype.Post_publication_abbreviation LIKE "%liver%" OR +Phenotype.Lab_code LIKE "%liver%" OR Publication.PubMed_ID LIKE "%liver%" +OR Publication.Abstract LIKE "%liver%" OR Publication.Title LIKE "%liver%" +OR Publication.Authors LIKE "%liver%" OR PublishXRef.Id LIKE "%liver%") )) +``` + +Below page describes the issue. Essentially an index won't help and +mariadb will scan the whole file for every query. Not good. + +=> https://stackoverflow.com/questions/2042269/how-to-speed-up-select-like-queries-in-mysql-on-multiple-columns + +This is a typical candidate for FULLTEXT searches where we do a multi +match against the larger fields, e.g. + + Add a full text index on the columns that you need: + + ALTER TABLE table ADD FULLTEXT INDEX index_table_on_x_y_z (x, y, z); + + Then query those columns: + + SELECT * FROM table WHERE MATCH(x,y,z) AGAINST("text") + +I think we can try creating a fulltext for index for Abstract, Title +and Authors - since these are longer strings. + +Again, I note we are doing this the wrong way. We'll unify xapian - +have you seen how fast that is? But Arun and I need more time to get +the menu search in place. + +So, let's try some things. + +``` +ALTER TABLE Publication ADD FULLTEXT INDEX index_table (Title, Abstract, Authors); +SELECT * FROM Publication WHERE MATCH(Title, Abstract, Authors) AGAINST("diabetes"); +``` + +renders 23 rows in 0.001 seconds. The combined is still slow, so let's check the Phenotype table too. It has + +``` +Phenotype.Post_publication_description +Phenotype.Pre_publication_description +Phenotype.Pre_publication_abbreviation +Phenotype.Post_publication_abbreviation +Phenotype.Lab_code +Publication.PubMed_ID +``` + +not sure why we need most of these, but let's create an index + +``` +ALTER TABLE Phenotype ADD FULLTEXT INDEX index_table (Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviation,Lab_code); +SELECT * FROM Phenotype WHERE MATCH(Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviation,Lab_code) AGAINST("liver"); +``` + +and that is fast too. Let's combine these. Still slow (darn!). So it must be on the joins. + +``` + INNER JOIN InbredSet ON InbredSet.'SpeciesId' = + Species.'Id' + INNER JOIN PublishXRef ON PublishXRef.'InbredSetId' = + InbredSet.'Id' + INNER JOIN PublishFreeze ON PublishFreeze.'InbredSetId' = + InbredSet.'Id' + INNER JOIN Publication ON Publication.'Id' = + PublishXRef.'PublicationId' + INNER JOIN Phenotype ON Phenotype.'Id' = + PublishXRef.'PhenotypeId' + LEFT JOIN Geno ON PublishXRef.Locus = Geno.Name AND + Geno.SpeciesId = Species.Id +``` + +when I remove the final left join the query is fast. That means we can focus on Geno and PublishXRef tables. + +First for some reason Geno was still latin1: + +``` +ALTER TABLE Geno CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; +``` + +After that the search is fast. + +A nice search now: + +``` +SELECT PublishXRef.Id, + CAST(Phenotype.'Pre_publication_description' AS BINARY), + CAST(Phenotype.'Post_publication_description' AS BINARY), + Publication.'Authors', + Publication.'Year', + Publication.'PubMed_ID', + PublishXRef.'mean', + PublishXRef.'LRS', + PublishXRef.'additive', + PublishXRef.'Locus', + InbredSet.'InbredSetCode', + Geno.'Chr', + Geno.'Mb' + FROM Species + INNER JOIN InbredSet ON InbredSet.'SpeciesId' = + Species.'Id' + INNER JOIN PublishXRef ON PublishXRef.'InbredSetId' = + InbredSet.'Id' + INNER JOIN PublishFreeze ON PublishFreeze.'InbredSetId' = + InbredSet.'Id' + INNER JOIN Publication ON Publication.'Id' = + PublishXRef.'PublicationId' + INNER JOIN Phenotype ON Phenotype.'Id' = + PublishXRef.'PhenotypeId' + LEFT JOIN Geno ON PublishXRef.Locus = Geno.Name AND + Geno.SpeciesId = Species.Id + WHERE ((( + MATCH(Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviat +ion,Lab_code) AGAINST("liver") + OR Publication.PubMed_ID LIKE "%liver%" + OR MATCH(Title, Abstract, Authors) AGAINST("liver") + OR PublishXRef.Id LIKE "%liver%") )) + and PublishXRef.InbredSetId = 1 + and PublishXRef.PhenotypeId = Phenotype.Id + and PublishXRef.PublicationId = Publication.Id + and PublishFreeze.Id = 1 + ORDER BY PublishXRef.Id +``` diff --git a/issues/systems/mariadb/ProbeSetData.gmi b/issues/systems/mariadb/ProbeSetData.gmi index 91179c3..9fceea9 100644 --- a/issues/systems/mariadb/ProbeSetData.gmi +++ b/issues/systems/mariadb/ProbeSetData.gmi @@ -14,15 +14,16 @@ This is by far the largest table (~200Gb). I need to add disk space to be able t This time I failed porting to InnoDB (see Migration below): -* [ ] Move database to larger drive -* [ ] Run second instance of mariadb using a Guix container, upgrade too? -* [ ] Stop binary log -* [ ] Drop the indices -* [ ] Try different sizes of innodb exports +* [X] Move database to larger drive (stop Mariadb for final stage) +* [X] Stop binary log (SET sql_log_bin = 0;) +* [X] Run second instance of mariadb using a Guix container, upgrade too? +* [X] Drop the indices +* [X] Try different sizes of innodb exports * [ ] Make (ordered) conversion and test performance * [ ] Rebuild indices -* [ ] Restart binary log +* [ ] Test performance * [ ] Muck out ibdata1 and transaction logs +* [ ] Restart binary log (SET sql_log_bin = 1;) I disabled these and they need to be restored: @@ -302,7 +303,75 @@ next copy the database to a new partition: root@tux01:/export4/local/home/mariadb/database/db_webqtl# rsync -vaP /var/lib/mysql/db_webqtl/* . --delete --bwlimit=20M ``` -Note I throttle the speed because the system can become quite unusable at full copy speed. +Note I throttle the speed because the system can become quite unusable at full copy speed. Next I stopped Mariadb and made sure the copy is completed. After restarding mariadb I could continue work on the copy using a guix shell instance as described in + +=> setting-up-local-development-database.gmi + +Steps were as a normal user + +``` +tux01:/export4/local/home/mariadb$ ~/opt/guix-pull/bin/guix pull -p ~/opt/guix-latest +. ~/opt/guix-latest/etc/profile +mkdir var +guix shell -C -N coreutils sed mariadb --share=var=/var +mysqld_safe --datadir='./database' --user=$USER --nowatch --socket=/var/run/mysqld/mysqld.sock +mysql --socket=/var/run/mysqld/mysqld.sock -uwebqtlout -p db_webqtl +``` + +OK, now it is running and we can start experimenting with the table outside the main database setup. Remember we had + +``` +ProbeSetData | CREATE TABLE 'ProbeSetData' ( + 'Id' int(10) unsigned NOT NULL DEFAULT 0, + 'StrainId' int(20) NOT NULL, + 'value' float NOT NULL, + UNIQUE KEY 'DataId' ('Id','StrainId'), + KEY 'strainid' ('StrainId') +) ENGINE=MyISAM DEFAULT CHARSET=latin1 +``` + +``` +DROP INDEX strainid ON ProbeSetData; +DROP INDEX DataId ON ProbeSetData; +``` + +of course it starts making a copy of the whole table and takes hours(!) This is why we need over 200Gb free both on the DB directory and the tempdir of the mariadb server. + +``` +select count(Id) from ProbeSetData; ++------------+ +| count(Id) | ++------------+ +| 5173425135 | ++------------+ +MariaDB [db_webqtl]> select max(Id),max(StrainId) from ProbeSetData; ++----------+---------------+ +| max(Id) | max(StrainId) | ++----------+---------------+ +| 92199993 | 71224 | ++----------+---------------+ +MariaDB [db_webqtl]> select * from ProbeSetData limit 4; ++----+----------+-------+ +| Id | StrainId | value | ++----+----------+-------+ +| 1 | 1 | 5.742 | +| 1 | 2 | 5.006 | +| 1 | 3 | 6.079 | +| 1 | 4 | 6.414 | ++----+----------+-------+ +``` + +``` +ALTER TABLE ProbeSetData MODIFY StrainId mediumint UNSIGNED NOT NULL; +``` + +Now the table is 58Gb without indices. Convert to innodb and add indices + +``` +CREATE INDEX id_index ON ProbeSetData(Id); +ALTER TABLE ProbeSetData ENGINE = InnoDB; +``` + ## Notes |