From df8c16f83dd1ff02da562c347566a91f5adf7808 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sun, 28 May 2023 09:18:18 -0500
Subject: mariadb indexing

---
 issues/slow-sql-queary-for-xapian-indexing.gmi |  84 -----------------
 issues/slow-sql-query-for-xapian-indexing.gmi  | 122 +++++++++++++++++++++++++
 issues/systems/mariadb/move-to-innodb.gmi      |   6 +-
 tasks/pjotrp.gmi                               |   2 +
 4 files changed, 125 insertions(+), 89 deletions(-)
 delete mode 100644 issues/slow-sql-queary-for-xapian-indexing.gmi
 create mode 100644 issues/slow-sql-query-for-xapian-indexing.gmi

diff --git a/issues/slow-sql-queary-for-xapian-indexing.gmi b/issues/slow-sql-queary-for-xapian-indexing.gmi
deleted file mode 100644
index 5ced977..0000000
--- a/issues/slow-sql-queary-for-xapian-indexing.gmi
+++ /dev/null
@@ -1,84 +0,0 @@
-# Slow query
-
-# Tags
-
-* assigned: pjotrp, aruni
-* priority: high
-* status: in progress
-* keywords: slow query
-
-# Description
-
-Since moving to innodb we have this very slow query:
-
-```
-SELECT ProbeSet.Name AS name, ProbeSet.Symbol AS symbol, ProbeSet.description AS description, ProbeSet.Chr AS chr, ProbeSet.Mb as mb, ProbeSet.alias AS alias, ProbeSet.GenbankId AS genbankid, ProbeSet.UniGeneId AS unigeneid, ProbeSet.Probe_Target_Description AS probe_target_description, ProbeSetFreeze.Name AS dataset, ProbeSetFreeze.FullName AS dataset_fullname, Species.Name AS species, InbredSet.Name AS "group", Tissue.Name AS tissue, ProbeSetXRef.Mean AS mean, ProbeSetXRef.LRS AS lrs, ProbeSetXRef.additive AS additive, Geno.Chr AS geno_chr, Geno.Mb as geno_mb FROM Species INNER JOIN InbredSet ON InbredSet.SpeciesId = Species.Id INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId = InbredSet.Id INNER JOIN Tissue ON ProbeFreeze.TissueId = Tissue.Id INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId = ProbeFreeze.Id INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id WHERE ProbeSetFreeze.confidentiality < 1 AND ProbeSetFreeze.public > 0
-```
-
-And this one is also slow - as used in old style global search
-
-```
-SELECT Species."Name" AS species_name,InbredSet."Name" AS inbredset_name, Tissue."Name" AS tissue_name,ProbeSetFreeze.Name AS probesetfreeze_name, ProbeSetFreeze.FullName AS probesetfreeze_fullname, ProbeSet.Name AS probeset_name, ProbeSet.Symbol AS probeset_symbol, CAST(ProbeSet."description" AS BINARY) AS probeset_description, ProbeSet.Chr AS chr, ProbeSet.Mb AS mb,ProbeSetXRef.Mean AS mean,ProbeSetXRef.LRS AS lrs, ProbeSetXRef."Locus" AS locus, ProbeSetXRef."pValue" AS pvalue,ProbeSetXRef."additive" AS additive,ProbeSetFreeze.Id AS probesetfreeze_id, Geno.Chr as geno_chr, Geno.Mb as geno_mbFROM Species INNER JOIN InbredSet ON InbredSet."SpeciesId"=Species."Id" INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet."Id" INNER JOIN Tissue ON ProbeFreeze."TissueId"=Tissue."Id"INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id WHERE (MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ('brca2' IN BOOLEAN MODE) )AND ProbeSetFreeze.confidentiality < 1 AND ProbeSetFreeze.public > 0 ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name, probeset_name LIMIT 600;
-```
-
-To find out what index is problematic we first look at sizes. Tables in use are
-
-```
-ProbeSet
-ProbeSetFreeze
-ProbeSetXRef
-ProbeFreeze
-InbredSet
-Tissue
-Species
-Geno
-```
-
-The big ones are:
-
-```
-ProbeSet     4G
-ProbeSetXRef 5G
-```
-
-Then I had a peek at the configuration. query_cache_size and query_cache_limit only cache results. Mariadb uses about 20G of RAM on tux02 - that all looks healthy.
-
-## Buffers and settings
-
-innodb_buffer_pool_size is set to 1Gb - and that may be a bit small, but it shoud not be critical here. Though I found
-
-BUFFER CACHING:
-
-The InnoDB Buffer Pool caches data and index pages. MyISAM only caches index pages.
-
-Just in this area alone, MyISAM does not waste time caching data. That's because it's not designed to cache data. InnoDB caches every data page and index page (and its grandmother) it touches. If your InnoDB Buffer Pool is too small, you could be caching pages, invalidating pages, and removing pages all in one query.
-
-I note that the mysql configuration on tux01 differs from tux02. But both these queries run slow on both machines, so not much point in trying to tweak that first.
-
-## Text formats
-
-Next step, as these are text searches, is to check the character type with something like
-
-```
-for t in 'cat tables.lst' ; do mysql -uwebqtlout -pwebqtlout db_webqtl -e "show create table $t" ; done
-```
-
-which shows that ProbeSet and Geno have utf8 and the others utf8mb4. That may be worth checking. Different text types add overheads to searches (it translates from one to the other).
-
-## Indices
-
-ProbeSet has a large number of columns, indices and full text indices(!)
-
-## Nailing it down
-
-The obvious thing to try is to reduce the query and see what speeds it up again. So, let's try removing the utf8 tables and indices first. E.g.
-
-```
-ALTER TABLE ProbeSet CONVERT TO CHARACTER SET utf8mb4 ;
-```
-
-
-## Further reading
-
-
-=> https://dba.stackexchange.com/questions/75091/why-are-simple-selects-on-innodb-100x-slower-than-on-myisam
diff --git a/issues/slow-sql-query-for-xapian-indexing.gmi b/issues/slow-sql-query-for-xapian-indexing.gmi
new file mode 100644
index 0000000..914d1d1
--- /dev/null
+++ b/issues/slow-sql-query-for-xapian-indexing.gmi
@@ -0,0 +1,122 @@
+# Slow query
+
+# Tags
+
+* assigned: pjotrp, aruni
+* priority: high
+* status: in progress
+* keywords: slow query
+
+# Description
+
+Since moving to innodb we have this very slow query:
+
+```
+SELECT ProbeSet.Name AS name, ProbeSet.Symbol AS symbol, ProbeSet.description AS description, ProbeSet.Chr AS chr, ProbeSet.Mb as mb, ProbeSet.alias AS alias, ProbeSet.GenbankId AS genbankid, ProbeSet.UniGeneId AS unigeneid, ProbeSet.Probe_Target_Description AS probe_target_description, ProbeSetFreeze.Name AS dataset, ProbeSetFreeze.FullName AS dataset_fullname, Species.Name AS species, InbredSet.Name AS "group", Tissue.Name AS tissue, ProbeSetXRef.Mean AS mean, ProbeSetXRef.LRS AS lrs, ProbeSetXRef.additive AS additive, Geno.Chr AS geno_chr, Geno.Mb as geno_mb FROM Species INNER JOIN InbredSet ON InbredSet.SpeciesId = Species.Id INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId = InbredSet.Id INNER JOIN Tissue ON ProbeFreeze.TissueId = Tissue.Id INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId = ProbeFreeze.Id INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id WHERE ProbeSetFreeze.confidentiality < 1 AND ProbeSetFreeze.public > 0
+```
+
+And this one is also slow - as used in old style global search
+
+```
+SELECT Species."Name" AS species_name,InbredSet."Name" AS inbredset_name, Tissue."Name" AS tissue_name,ProbeSetFreeze.Name AS probesetfreeze_name, ProbeSetFreeze.FullName AS probesetfreeze_fullname, ProbeSet.Name AS probeset_name, ProbeSet.Symbol AS probeset_symbol, CAST(ProbeSet."description" AS BINARY) AS probeset_description, ProbeSet.Chr AS chr, ProbeSet.Mb AS mb,ProbeSetXRef.Mean AS mean,ProbeSetXRef.LRS AS lrs, ProbeSetXRef."Locus" AS locus, ProbeSetXRef."pValue" AS pvalue,ProbeSetXRef."additive" AS additive,ProbeSetFreeze.Id AS probesetfreeze_id, Geno.Chr as geno_chr, Geno.Mb as geno_mbFROM Species INNER JOIN InbredSet ON InbredSet."SpeciesId"=Species."Id" INNER JOIN ProbeFreeze ON ProbeFreeze.InbredSetId=InbredSet."Id" INNER JOIN Tissue ON ProbeFreeze."TissueId"=Tissue."Id"INNER JOIN ProbeSetFreeze ON ProbeSetFreeze.ProbeFreezeId=ProbeFreeze.Id INNER JOIN ProbeSetXRef ON ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id INNER JOIN ProbeSet ON ProbeSet.Id = ProbeSetXRef.ProbeSetId LEFT JOIN Geno ON ProbeSetXRef.Locus = Geno.Name AND Geno.SpeciesId = Species.Id WHERE (MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,ProbeSet.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ('brca2' IN BOOLEAN MODE) )AND ProbeSetFreeze.confidentiality < 1 AND ProbeSetFreeze.public > 0 ORDER BY species_name, inbredset_name, tissue_name, probesetfreeze_name, probeset_name LIMIT 600;
+```
+
+To find out what index is problematic we first look at sizes. Tables in use are
+
+```
+ProbeSet
+ProbeSetFreeze
+ProbeSetXRef
+ProbeFreeze
+InbredSet
+Tissue
+Species
+Geno
+```
+
+The big ones are:
+
+```
+ProbeSet     4G
+ProbeSetXRef 5G
+```
+
+Then I had a peek at the configuration. query_cache_size and query_cache_limit only cache results. Mariadb uses about 20G of RAM on tux02 - that all looks healthy.
+
+## Buffers and settings
+
+innodb_buffer_pool_size is set to 1Gb - and that may be a bit small, but it shoud not be critical here. Though I found
+
+BUFFER CACHING:
+
+The InnoDB Buffer Pool caches data and index pages. MyISAM only caches index pages.
+
+Just in this area alone, MyISAM does not waste time caching data. That's because it's not designed to cache data. InnoDB caches every data page and index page (and its grandmother) it touches. If your InnoDB Buffer Pool is too small, you could be caching pages, invalidating pages, and removing pages all in one query.
+
+I note that the mysql configuration on tux01 differs from tux02. But both these queries run slow on both machines, so not much point in trying to tweak that first.
+
+## Text formats
+
+Next step, as these are text searches, is to check the character type with something like
+
+```
+for t in 'cat tables.lst' ; do mysql -uwebqtlout -pwebqtlout db_webqtl -e "show create table $t" ; done
+```
+
+which shows that ProbeSet and Geno have utf8 and the others utf8mb4. That may be worth checking. Different text types add overheads to searches (it translates from one to the other).
+
+In MySQL utf8 is an alias for utf8mb3 which is deprecated and will be removed in a future MySQL release. At that point utf8 will become a reference to utf8mb4. utf8mb4 is a UTF-8 encoding of the Unicode character set using one to four bytes per character.
+
+So we should be using utf8mb4.
+
+This also fixed the global search query.
+
+## Indices
+
+ProbeSet has a large number of columns, indices and full text indices(!)
+
+## Nailing it down
+
+The obvious thing to try is to reduce the query and see what speeds it up again. So, let's try removing the utf8 tables and indices first. E.g.
+
+```
+ALTER TABLE ProbeSet CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
+ALTER TABLE Geno CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
+```
+
+Unfortunately I was getting 'Specified key was too long; max key length is 3072 bytes' again as in
+
+=> ./move-to-innodb.gmi
+
+We use the full length of the text fields to create the index in
+
+```
+FULLTEXT KEY `ft_ProbeSet_full` (`Name`,`description`,`Symbol`,`alias`,`GenbankId`,`UniGeneId`,`Probe_Target_Description`),
+```
+
+BLOB and TEXT columns also can be indexed, but a prefix length must be given not to exceed the maximum length. In global search we used the match function - that has to scan the full text.
+
+```
+MATCH (ProbeSet.Name,ProbeSet.description,ProbeSet.symbol,ProbeSet.alias,Probe Set.GenbankId, ProbeSet.UniGeneId, ProbeSet.Probe_Target_Description) AGAINST ('brca2' IN BOOLEAN MODE)
+```
+
+With xapian search that is no longer done through SQL. We'll need to toy a bit with this index. We need to check what 'menu' search does, though we want to consolidate that with xapian (anyhow).
+
+```
+DROP INDEX ft_ProbeSet_full ON ProbeSet;
+```
+
+and recreate
+
+```
+CREATE FULLTEXT INDEX ft_ProbeSet_Alias ON ProbeSet (alias);
+CREATE FULLTEXT INDEX ft_ProbeSet_Alias ON ProbeSet (alias);
+CREATE FULLTEXT INDEX ft_ProbeSet_full ON ProbeSet (Name,description,Symbol,alias,GenbankId,UniGeneId,Probe_Target_Description);
+```
+
+This made the query fast again (100K results in 2s).
+
+## Further reading
+
+
+=> https://dba.stackexchange.com/questions/75091/why-are-simple-selects-on-innodb-100x-slower-than-on-myisam
diff --git a/issues/systems/mariadb/move-to-innodb.gmi b/issues/systems/mariadb/move-to-innodb.gmi
index fdd3c28..4218c4f 100644
--- a/issues/systems/mariadb/move-to-innodb.gmi
+++ b/issues/systems/mariadb/move-to-innodb.gmi
@@ -583,20 +583,16 @@ Turned out simple utf8 worked:
 ALTER TABLE ProbeSet CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
 ```
 
-and utf8mb is not recommended. We'll need fix that FIXME. See
+and utf8mb3 is not recommended. We'll need fix that FIXME. See
 
 => https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
 
-pjotrp 16 Oct 2019 at 09:24 (edited)
-
 Full text search is the least straightforward, see
 
 and
 
 Now for ProbeSet - one of the critical locked tables we need this.
 
-pjotrp 15 Oct 2019 at 16:38 (edited)
-
 ```
 time mysqlcheck -c -u webqtlout -pwebqtlout db_webqtl
 db_webqtl.Docs
diff --git a/tasks/pjotrp.gmi b/tasks/pjotrp.gmi
index f9bc57a..998f391 100644
--- a/tasks/pjotrp.gmi
+++ b/tasks/pjotrp.gmi
@@ -41,6 +41,8 @@ Later
 
 => ./machine-room.gmi machine room
 
+* [ ] Fix mariadb index search - see
+=> slow-sq
 * [ ] Frontend for GN4MSK
 * [.] Check Tony's list and improve search for SNPs and Hs
 * [ ] GeneNetwork consortium (paper & CTC)
-- 
cgit 1.4.1