From e174551e624e8d9df5d9f143fb0571c78994940e Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Tue, 18 May 2021 04:03:48 -0400 Subject: Slow SQL: add note --- docs/performance/slow-probesetdata-sql.org | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'docs/performance/slow-probesetdata-sql.org') diff --git a/docs/performance/slow-probesetdata-sql.org b/docs/performance/slow-probesetdata-sql.org index e1693ee..3b7ed03 100644 --- a/docs/performance/slow-probesetdata-sql.org +++ b/docs/performance/slow-probesetdata-sql.org @@ -7,7 +7,7 @@ drives, RAID5) the query took 1 hour. On Tux01 (solid state NVME) it took 6 minutes. Adding an index for StrainId (see explorations below) reduced that query to 3 minutes - which is kinda acceptable. The real problem, however is that this is a quadratic search - so it will get worse quickly - and we need -to solve it. +to solve it. This table has doubled in size in the last 5 years. So what does ProbeSetData contain? @@ -26,7 +26,8 @@ select * from ProbeSetData limit 5; You can see Id is sectioned in the file (and there are not that many Ids) but StrainId is *distributed* through the database file and some 'StrainIds' match -many data points. Id stands for Dataset and StrainId really means Measurement(!) +many data points. Id stands for Dataset (item) and StrainId really means +measurement type or trait measured(!) Our query looked for 1,236,088 measurement distributed over a 53Gb file (and an even larger index file). Turns out the full table is read many many times @@ -41,9 +42,11 @@ We have the following options: *** Reorder the table -We could reorder the table on StrainID which would make this search much -faster but it would many common (dataset) queries slower. So, that is not a -great idea. One thing we could try is create a copy of the first table. +We could reorder the table on StrainID which would make this search much faster +but it would many common (dataset) queries slower. So, that is not a great +idea. One thing we could try is add a copy of the first table. Not exactly +elegant but a quick fix for sure. We'll need an embedded procedure to keep it +up-to-data. *** Use column based storage -- cgit v1.2.3