aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorPjotr Prins2021-05-18 04:03:48 -0400
committerPjotr Prins2021-05-18 04:03:48 -0400
commite174551e624e8d9df5d9f143fb0571c78994940e (patch)
tree4cef08db4305ca5657b67b685422c7a7ce52b212 /docs
parent71b9b1e81f3140e66801b4cc78362cee94329243 (diff)
downloadgenenetwork3-e174551e624e8d9df5d9f143fb0571c78994940e.tar.gz
Slow SQL: add note
Diffstat (limited to 'docs')
-rw-r--r--docs/performance/slow-probesetdata-sql.org13
1 files changed, 8 insertions, 5 deletions
diff --git a/docs/performance/slow-probesetdata-sql.org b/docs/performance/slow-probesetdata-sql.org
index e1693ee..3b7ed03 100644
--- a/docs/performance/slow-probesetdata-sql.org
+++ b/docs/performance/slow-probesetdata-sql.org
@@ -7,7 +7,7 @@ drives, RAID5) the query took 1 hour. On Tux01 (solid state NVME) it took 6
minutes. Adding an index for StrainId (see explorations below) reduced that
query to 3 minutes - which is kinda acceptable. The real problem, however is
that this is a quadratic search - so it will get worse quickly - and we need
-to solve it.
+to solve it. This table has doubled in size in the last 5 years.
So what does ProbeSetData contain?
@@ -26,7 +26,8 @@ select * from ProbeSetData limit 5;
You can see Id is sectioned in the file (and there are not that many Ids) but
StrainId is *distributed* through the database file and some 'StrainIds' match
-many data points. Id stands for Dataset and StrainId really means Measurement(!)
+many data points. Id stands for Dataset (item) and StrainId really means
+measurement type or trait measured(!)
Our query looked for 1,236,088 measurement distributed over a 53Gb file (and
an even larger index file). Turns out the full table is read many many times
@@ -41,9 +42,11 @@ We have the following options:
*** Reorder the table
-We could reorder the table on StrainID which would make this search much
-faster but it would many common (dataset) queries slower. So, that is not a
-great idea. One thing we could try is create a copy of the first table.
+We could reorder the table on StrainID which would make this search much faster
+but it would many common (dataset) queries slower. So, that is not a great
+idea. One thing we could try is add a copy of the first table. Not exactly
+elegant but a quick fix for sure. We'll need an embedded procedure to keep it
+up-to-data.
*** Use column based storage