diff options
author | Pjotr Prins | 2021-05-18 04:03:48 -0400 |
---|---|---|
committer | Pjotr Prins | 2021-05-18 04:03:48 -0400 |
commit | e174551e624e8d9df5d9f143fb0571c78994940e (patch) | |
tree | 4cef08db4305ca5657b67b685422c7a7ce52b212 /docs/performance | |
parent | 71b9b1e81f3140e66801b4cc78362cee94329243 (diff) | |
download | genenetwork3-e174551e624e8d9df5d9f143fb0571c78994940e.tar.gz |
Slow SQL: add note
Diffstat (limited to 'docs/performance')
-rw-r--r-- | docs/performance/slow-probesetdata-sql.org | 13 |
1 files changed, 8 insertions, 5 deletions
diff --git a/docs/performance/slow-probesetdata-sql.org b/docs/performance/slow-probesetdata-sql.org index e1693ee..3b7ed03 100644 --- a/docs/performance/slow-probesetdata-sql.org +++ b/docs/performance/slow-probesetdata-sql.org @@ -7,7 +7,7 @@ drives, RAID5) the query took 1 hour. On Tux01 (solid state NVME) it took 6 minutes. Adding an index for StrainId (see explorations below) reduced that query to 3 minutes - which is kinda acceptable. The real problem, however is that this is a quadratic search - so it will get worse quickly - and we need -to solve it. +to solve it. This table has doubled in size in the last 5 years. So what does ProbeSetData contain? @@ -26,7 +26,8 @@ select * from ProbeSetData limit 5; You can see Id is sectioned in the file (and there are not that many Ids) but StrainId is *distributed* through the database file and some 'StrainIds' match -many data points. Id stands for Dataset and StrainId really means Measurement(!) +many data points. Id stands for Dataset (item) and StrainId really means +measurement type or trait measured(!) Our query looked for 1,236,088 measurement distributed over a 53Gb file (and an even larger index file). Turns out the full table is read many many times @@ -41,9 +42,11 @@ We have the following options: *** Reorder the table -We could reorder the table on StrainID which would make this search much -faster but it would many common (dataset) queries slower. So, that is not a -great idea. One thing we could try is create a copy of the first table. +We could reorder the table on StrainID which would make this search much faster +but it would many common (dataset) queries slower. So, that is not a great +idea. One thing we could try is add a copy of the first table. Not exactly +elegant but a quick fix for sure. We'll need an embedded procedure to keep it +up-to-data. *** Use column based storage |