DB stuff

author: Pjotr Prins 2023-03-12 23:37:06 +0100
committer: Pjotr Prins 2023-03-12 23:37:06 +0100
commit: 7223e7514b98f270f7717f48307a5114d7bf0e11 (patch)
tree: b2fae2568da6cc161436a0f182ebf45296455722 /issues
parent: 974f71509292c68274d6800f3cdfa53144ee24af (diff)
download: gn-gemtext-7223e7514b98f270f7717f48307a5114d7bf0e11.tar.gz
2 files changed, 209 insertions, 7 deletions
diff --git a/issues/database-long-query-after-innodb-migration.gmi b/issues/database-long-query-after-innodb-migration.gmi
new file mode 100644
index 0000000..65d6b0d
--- /dev/null
+++ b/issues/database-long-query-after-innodb-migration.gmi
@@ -0,0 +1,133 @@
+# slow text search query
+
+A slow query turned out to do a join on latin1 and utf8 columns. That was
+very slow!
+
+The query contains
+
+```
+WHERE (((Phenotype.Post_publication_description
+LIKE "%liver%" OR Phenotype.Pre_publication_description LIKE "%liver%" OR
+Phenotype.Pre_publication_abbreviation LIKE "%liver%" OR
+Phenotype.Post_publication_abbreviation LIKE "%liver%" OR
+Phenotype.Lab_code LIKE "%liver%" OR Publication.PubMed_ID LIKE "%liver%"
+OR Publication.Abstract LIKE "%liver%" OR Publication.Title LIKE "%liver%"
+OR Publication.Authors LIKE "%liver%" OR PublishXRef.Id LIKE "%liver%") ))
+```
+
+Below page describes the issue. Essentially an index won't help and
+mariadb will scan the whole file for every query. Not good.
+
+=> https://stackoverflow.com/questions/2042269/how-to-speed-up-select-like-queries-in-mysql-on-multiple-columns
+
+This is a typical candidate for FULLTEXT searches where we do a multi
+match against the larger fields, e.g.
+
+    Add a full text index on the columns that you need:
+
+        ALTER TABLE table ADD FULLTEXT INDEX index_table_on_x_y_z (x, y, z);
+
+    Then query those columns:
+
+        SELECT * FROM table WHERE MATCH(x,y,z) AGAINST("text")
+
+I think we can try creating a fulltext for index for Abstract, Title
+and Authors - since these are longer strings.
+
+Again, I note we are doing this the wrong way. We'll unify xapian -
+have you seen how fast that is? But Arun and I need more time to get
+the menu search in place.
+
+So, let's try some things.
+
+```
+ALTER TABLE Publication ADD FULLTEXT INDEX index_table (Title, Abstract, Authors);
+SELECT * FROM Publication WHERE MATCH(Title, Abstract, Authors) AGAINST("diabetes");
+```
+
+renders 23 rows in 0.001 seconds. The combined is still slow, so let's check the Phenotype table too. It has
+
+```
+Phenotype.Post_publication_description
+Phenotype.Pre_publication_description
+Phenotype.Pre_publication_abbreviation
+Phenotype.Post_publication_abbreviation
+Phenotype.Lab_code
+Publication.PubMed_ID
+```
+
+not sure why we need most of these, but let's create an index
+
+```
+ALTER TABLE Phenotype ADD FULLTEXT INDEX index_table (Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviation,Lab_code);
+SELECT * FROM Phenotype WHERE MATCH(Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviation,Lab_code) AGAINST("liver");
+```
+
+and that is fast too. Let's combine these. Still slow (darn!). So it must be on the joins.
+
+```
+                INNER JOIN InbredSet ON InbredSet.'SpeciesId' =
+ Species.'Id'
+                 INNER JOIN PublishXRef ON PublishXRef.'InbredSetId' =
+ InbredSet.'Id'
+                 INNER JOIN PublishFreeze ON PublishFreeze.'InbredSetId' =
+ InbredSet.'Id'
+                 INNER JOIN Publication ON Publication.'Id' =
+ PublishXRef.'PublicationId'
+                 INNER JOIN Phenotype ON Phenotype.'Id' =
+ PublishXRef.'PhenotypeId'
+                 LEFT JOIN Geno ON PublishXRef.Locus = Geno.Name AND
+ Geno.SpeciesId = Species.Id
+```
+
+when I remove the final left join the query is fast. That means we can focus on Geno and PublishXRef tables.
+
+First for some reason Geno was still latin1:
+
+```
+ALTER TABLE Geno CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
+```
+
+After that the search is fast.
+
+A nice search now:
+
+```
+SELECT PublishXRef.Id,
+                 CAST(Phenotype.'Pre_publication_description' AS BINARY),
+                 CAST(Phenotype.'Post_publication_description' AS BINARY),
+                 Publication.'Authors',
+                 Publication.'Year',
+                 Publication.'PubMed_ID',
+                 PublishXRef.'mean',
+                 PublishXRef.'LRS',
+                 PublishXRef.'additive',
+                 PublishXRef.'Locus',
+                 InbredSet.'InbredSetCode',
+                 Geno.'Chr',
+                 Geno.'Mb'
+                 FROM Species
+                 INNER JOIN InbredSet ON InbredSet.'SpeciesId' =
+ Species.'Id'
+                 INNER JOIN PublishXRef ON PublishXRef.'InbredSetId' =
+ InbredSet.'Id'
+                 INNER JOIN PublishFreeze ON PublishFreeze.'InbredSetId' =
+ InbredSet.'Id'
+                 INNER JOIN Publication ON Publication.'Id' =
+ PublishXRef.'PublicationId'
+                 INNER JOIN Phenotype ON Phenotype.'Id' =
+ PublishXRef.'PhenotypeId'
+                 LEFT JOIN Geno ON PublishXRef.Locus = Geno.Name AND
+ Geno.SpeciesId = Species.Id
+                         WHERE (((
+ MATCH(Post_publication_description,Pre_publication_description,Pre_publication_abbreviation,Post_publication_abbreviat
+ion,Lab_code) AGAINST("liver")
+ OR Publication.PubMed_ID LIKE "%liver%"
+ OR MATCH(Title, Abstract, Authors) AGAINST("liver")
+ OR PublishXRef.Id LIKE "%liver%") ))
+                         and PublishXRef.InbredSetId = 1
+                         and PublishXRef.PhenotypeId = Phenotype.Id
+                         and PublishXRef.PublicationId = Publication.Id
+                         and PublishFreeze.Id = 1
+                         ORDER BY PublishXRef.Id
+```
diff --git a/issues/systems/mariadb/ProbeSetData.gmi b/issues/systems/mariadb/ProbeSetData.gmi
index 91179c3..9fceea9 100644
--- a/issues/systems/mariadb/ProbeSetData.gmi
+++ b/issues/systems/mariadb/ProbeSetData.gmi
@@ -14,15 +14,16 @@ This is by far the largest table (~200Gb). I need to add disk space to be able t
 
 This time I failed porting to InnoDB (see Migration below):
 
-* [ ] Move database to larger drive
-* [ ] Run second instance of mariadb using a Guix container, upgrade too?
-* [ ] Stop binary log
-* [ ] Drop the indices
-* [ ] Try different sizes of innodb exports
+* [X] Move database to larger drive (stop Mariadb for final stage)
+* [X] Stop binary log (SET sql_log_bin = 0;)
+* [X] Run second instance of mariadb using a Guix container, upgrade too?
+* [X] Drop the indices
+* [X] Try different sizes of innodb exports
 * [ ] Make (ordered) conversion and test performance
 * [ ] Rebuild indices
-* [ ] Restart binary log
+* [ ] Test performance
 * [ ] Muck out ibdata1 and transaction logs
+* [ ] Restart binary log (SET sql_log_bin = 1;)
 
 I disabled these and they need to be restored:
 
@@ -302,7 +303,75 @@ next copy the database to a new partition:
 root@tux01:/export4/local/home/mariadb/database/db_webqtl# rsync -vaP /var/lib/mysql/db_webqtl/* . --delete --bwlimit=20M
 ```
 
-Note I throttle the speed because the system can become quite unusable at full copy speed.
+Note I throttle the speed because the system can become quite unusable at full copy speed. Next I stopped Mariadb and made sure the copy is completed. After restarding mariadb I could continue work on the copy using a guix shell instance as described in
+
+=> setting-up-local-development-database.gmi
+
+Steps were as a normal user
+
+```
+tux01:/export4/local/home/mariadb$ ~/opt/guix-pull/bin/guix pull -p ~/opt/guix-latest
+. ~/opt/guix-latest/etc/profile
+mkdir var
+guix shell -C -N coreutils sed mariadb --share=var=/var
+mysqld_safe --datadir='./database' --user=$USER --nowatch --socket=/var/run/mysqld/mysqld.sock
+mysql --socket=/var/run/mysqld/mysqld.sock -uwebqtlout -p db_webqtl
+```
+
+OK, now it is running and we can start experimenting with the table outside the main database setup. Remember we had
+
+```
+ProbeSetData | CREATE TABLE 'ProbeSetData' (
+  'Id' int(10) unsigned NOT NULL DEFAULT 0,
+  'StrainId' int(20) NOT NULL,
+  'value' float NOT NULL,
+  UNIQUE KEY 'DataId' ('Id','StrainId'),
+  KEY 'strainid' ('StrainId')
+) ENGINE=MyISAM DEFAULT CHARSET=latin1
+```
+
+```
+DROP INDEX strainid ON ProbeSetData;
+DROP INDEX DataId ON ProbeSetData;
+```
+
+of course it starts making a copy of the whole table and takes hours(!) This is why we need over 200Gb free both on the DB directory and the tempdir of the mariadb server.
+
+```
+select count(Id) from ProbeSetData;
++------------+
+| count(Id)  |
++------------+
+| 5173425135 |
++------------+
+MariaDB [db_webqtl]> select max(Id),max(StrainId) from ProbeSetData;
++----------+---------------+
+| max(Id)  | max(StrainId) |
++----------+---------------+
+| 92199993 |         71224 |
++----------+---------------+
+MariaDB [db_webqtl]> select * from ProbeSetData limit 4;
++----+----------+-------+
+| Id | StrainId | value |
++----+----------+-------+
+|  1 |        1 | 5.742 |
+|  1 |        2 | 5.006 |
+|  1 |        3 | 6.079 |
+|  1 |        4 | 6.414 |
++----+----------+-------+
+```
+
+```
+ALTER TABLE ProbeSetData MODIFY StrainId mediumint UNSIGNED NOT NULL;
+```
+
+Now the table is 58Gb without indices. Convert to innodb and add indices
+
+```
+CREATE INDEX id_index ON ProbeSetData(Id);
+ALTER TABLE ProbeSetData ENGINE = InnoDB;
+```
+
 
 ## Notes
author	Pjotr Prins	2023-03-12 23:37:06 +0100
committer	Pjotr Prins	2023-03-12 23:37:06 +0100
commit	7223e7514b98f270f7717f48307a5114d7bf0e11 (patch)
tree	b2fae2568da6cc161436a0f182ebf45296455722 /issues
parent	974f71509292c68274d6800f3cdfa53144ee24af (diff)
download	gn-gemtext-7223e7514b98f270f7717f48307a5114d7bf0e11.tar.gz