genenetwork3 - GeneNetwork3 REST API for data science and machine learning

Slow ProbeSetData SQL query

The case

We were looking to optimize a query for correlations. On Penguin2 (standard drives, RAID5) the query took 1 hour. On Tux01 (solid state NVME) it took 6 minutes. Adding an index for StrainId (see explorations below) reduced that query to 3 minutes - which is kinda acceptable. The real problem, however is that this is a quadratic search - so it will get worse quickly - and we need to solve it. This table has doubled in size in the last 5 years.

So what does ProbeSetData contain?

select * from ProbeSetData limit 5;
+----+----------+-------+
| Id | StrainId | value |
+----+----------+-------+
|  1 |        1 | 5.742 |
|  1 |        2 | 5.006 |
|  1 |        3 | 6.079 |
|  1 |        4 | 6.414 |
|  1 |        5 | 4.885 |
+----+----------+-------+

You can see Id is sectioned in the file (and there are not that many Ids) but StrainId is distributed through the database file and some 'StrainIds' match many data points. Id stands for Dataset (item) and StrainId really means measurement type or trait measured(!)

Our query looked for 1,236,088 measurement distributed over a 53Gb file (and an even larger index file). Turns out the full table is read many many times over for one particular query pivoting on strainid…

We have the following options:

Reorder the table
Use column based storage
Use compression
Use a different storage layout

Reorder the table

We could reorder the table on StrainID which would make this search much faster but it would many common (dataset) queries slower. So, that is not a great idea. One thing we could try is add a copy of the first table. Not exactly elegant but a quick fix for sure. We'll need an embedded procedure to keep it up-to-data.

Use column based storage

Column-based storage works when you need a subset of the data in the table. In this case it won't help much because, even though the pivot itself would be faster, we still traverse all data to get IDs and values.

Use compression

Compression reduces the size on disk and may be beneficial. Real life metrics on the internet don't show that much improvement, but we could try native compression and/or ZFS.

Use a different storage layout

My prediction is that we can not get around this.

Result

The final query is shown at the bottom of 'Exploration'. It takes 2 minutes on Penguin2 and Tux01. Essentially I took out the joins (which parsed the same table repeatedly) and added an index. The trick is to keep minimizing the query.

The 2 minute query will do for now and it probably is no longer quadratic.

We can probably improve things in the future by changing the way ProbeSetData is stored.

Exploration

In the following steps I reduce the complex case to a simple case explaining the performance bottleneck we are seeing today. I did not add comments, but you can see what I did.

MariaDB [db_webqtl]> select * from ProbeSetXRef limit 2;
+------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old            | pValue_old | mean             | se                  | Locus      | LRS              | pValue | additive           | h2   |
+------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+
|                1 |          1 |      1 | 10.095.400 |   13.3971627898894 |      0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 |  12.590069931048 |  0.269 |        -0.28515625 | NULL |
|                1 |          2 |      2 | D15Mit189  | 10.042057464356201 |      0.431 | 9.90165714285714 |  0.0374686634976217 | rs29535974 | 10.5970737900941 |  0.304 | -0.116783333333333 | NULL |
+------------------+------------+--------+------------+--------------------+------------+------------------+---------------------+------------+------------------+--------+--------------------+------+
2 rows in set (0.001 sec)

MariaDB [db_webqtl]> SELECT ProbeSet.Name,T4.value FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)  left join ProbeSetData as T4 on T4.Id = ProbeSetXRef.DataId
    ->                         and T4.StrainId=4
    -> limit 5;
+------+-------+
| Name | value |
+------+-------+
| NULL | 6.414 |
| NULL | 6.414 |
| NULL | 6.414 |
| NULL | 6.414 |
| NULL | 6.414 |
+------+-------+

MariaDB [db_webqtl]> SELECT ProbeSet.Name,T4.value FROM (ProbeSet, ProbeSetXRef)
  left join ProbeSetData as T4 on T4.StrainId=4 and T4.Id = ProbeSetXRef.DataId
  WHERE ProbeSet.Id = ProbeSetXRef.ProbeSetId order by ProbeSet.Id limit 5;
+-----------+-------+
| Name      | value |
+-----------+-------+
| 100001_at | 6.414 |
| 100001_at | 6.414 |
| 100001_at | 6.414 |
| 100001_at | 8.414 |
| 100001_at | 8.414 |
+-----------+-------+
5 rows in set (20.064 sec)
#+END_SRC SQL

#+BEGIN_SRC SQL
MariaDB [db_webqtl]> SELECT ProbeSet.Name,T4.value,T5.value FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)
left join ProbeSetData as T4 on T4.Id = ProbeSetXRef.DataId and T4.StrainId=4
left join ProbeSetData as T5 on T5.Id = ProbeSetXRef.DataId and T5.StrainId=5
WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'
and ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id  limit 5;

+---------+---------+
| Name    | value   |
+---------+---------+
| 4331726 | 5.52895 |
| 5054239 | 6.29465 |
| 4642578 | 9.13706 |
| 4398221 | 6.77672 |
| 5543360 | 4.30016 |
+---------+---------+

SELECT ProbeSet.Name,T4.value FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)
left join ProbeSetData as T4 on T4.Id = ProbeSetXRef.DataId and T4.StrainId=4
WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'
and ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id ;

1236087 rows in set (19.173 sec)

SELECT ProbeSet.Name,T4.value FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)
left join ProbeSetData as T4 on T4.StrainId=4 and T4.Id = ProbeSetXRef.DataId
WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'
and ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id ;

1236087 rows in set (19.173 sec)

SELECT ProbeSet.Name FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)
WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'
and ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id ;

Find all the probeset 'names' (probe sequence included) for one dataset:

SELECT count(DISTINCT ProbeSet.Name) FROM (ProbeSet, ProbeSetXRef, ProbeSetFreeze)          WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id                       and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'            and ProbeSet.Id = ProbeSetXRef.ProbeSetId                                   order by ProbeSet.Id;
+-------------------------------+
| count(DISTINCT ProbeSet.Name) |
+-------------------------------+
|                       1236087 |
+-------------------------------+

Now for each of those probesets:

SELECT ProbeSet.Name,T4.value FROM (ProbeSet, ProbeSetXRef)
left join ProbeSetData as T4 on T4.StrainId=4 and T4.Id = ProbeSetXRef.DataId
WHERE ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id limit 5;

ProbeSetXRef contains the p-values:

select * from ProbeSetXRef limit 5;
+------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+------------+------------------+--------+--------------------+------+
| ProbeSetFreezeId | ProbeSetId | DataId | Locus_old  | LRS_old            | pValue_old | mean              | se                  | Locus      | LRS              | pValue | additive           | h2   |
+------------------+------------+--------+------------+--------------------+------------+-------------------+---------------------+------------+------------------+--------+--------------------+------+
|                1 |          1 |      1 | 10.095.400 |   13.3971627898894 |      0.163 |  5.48794285714286 | 0.08525787814808819 | rs13480619 |  12.590069931048 |  0.269 |        -0.28515625 | NULL |
|                1 |          2 |      2 | D15Mit189  | 10.042057464356201 |      0.431 |  9.90165714285714 |  0.0374686634976217 | rs29535974 | 10.5970737900941 |  0.304 | -0.116783333333333 | NULL |

SELECT count(T4.value) FROM (ProbeSet, ProbeSetXRef)
left join ProbeSetData as T4 on T4.StrainId=4 and T4.Id = ProbeSetXRef.DataId
WHERE ProbeSet.Id = ProbeSetXRef.ProbeSetId ;

SELECT count(T4.value) FROM (ProbeSet, ProbeSetXRef) left join ProbeSetData as T4 on T4.StrainId=4 limit 5;

select value from (ProbeSetData) where StrainId=4 limit 5;

So, this is the sloooow baby:

select count(id) from (ProbeSetData) where StrainId=4;

| ProbeSetData |          0 | DataId   |            2 | StrainId    | A         |  4852908856 |     NULL | NULL   |      | BTREE      |         |               |

-rw-rw---- 1 mysql mysql  53G Mar  3 23:49 ProbeSetData.MYD
-rw-rw---- 1 mysql mysql  66G Mar  4 03:00 ProbeSetData.MYI

create index strainid on ProbeSetData(StrainId);
Stage: 1 of 2 'Copy to tmp table'   8.77% of stage done
Stage: 2 of 2 'Enabling keys'      0% of stage done

MariaDB [db_webqtl]> create index strainid on ProbeSetData(StrainId);
Query OK, 5111384047 rows affected (2 hours 56 min 25.807 sec)
Records: 5111384047  Duplicates: 0  Warnings: 0

MariaDB [db_webqtl]> select count(id) from (ProbeSetData) where StrainId=4;

+-----------+
| count(id) |
+-----------+
|  14267545 |
+-----------+
1 row in set (19.707 sec)

MariaDB [db_webqtl]> select count(*) from ProbeSetData where strainid = 140;
+----------+
| count(*) |
+----------+
| 10717771 |
+----------+
1 row in set (10.161 sec)

MariaDB [db_webqtl]> select count(*) from ProbeSetData where strainid = 140 and id=4;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.000 sec)

MariaDB [db_webqtl]> select count(*) from ProbeSetData where strainid = 4 and id=4;
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.000 sec)

select id from ProbeSetFreeze where id=1;

WHERE ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA'
and ProbeSet.Id = ProbeSetXRef.ProbeSetId
order by ProbeSet.Id  limit 5;

select count(ProbeSetId) from ProbeSetXRef where ProbeSetFreezeId=1;
+-------------------+
| count(ProbeSetId) |
+-------------------+
|             12422 |
+-------------------+
1 row in set (0.006 sec)

select count(ProbeSetId) from (ProbeSetXRef,ProbeSetFreeze) where
ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA';

MariaDB [db_webqtl]> select count(ProbeSetId) from (ProbeSetXRef,ProbeSetFreeze) where
    -> ProbeSetXRef.ProbeSetFreezeId = ProbeSetFreeze.Id
    -> and ProbeSetFreeze.Name = 'UMUTAffyExon_0209_RMA';
+-------------------+
| count(ProbeSetId) |
+-------------------+
|           1236087 |
+-------------------+
1 row in set (0.594 sec)

ProbeSetXRef.ProbeSetFreezeId is 206, so

MariaDB [db_webqtl]> select count(ProbeSetId) from (ProbeSetXRef) where ProbeSetXRef.ProbeSetFreezeId = 206;
+-------------------+
| count(ProbeSetId) |
+-------------------+
|           1236087 |
+-------------------+
1 row in set (0.224 sec)

MariaDB [db_webqtl]> select count(*) from ProbeSetData where strainid = 1 and id=4;
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.000 sec)

Now this query is fast because it traverses the ProbeSetData table only once and uses id as a starting point:

MariaDB [db_webqtl]>  select count(*) from (ProbeSetData,ProbeSetXRef) where ProbeSetXRef.ProbeSetFreezeId = 206 and id=ProbeSetId;
+----------+
| count(*) |
+----------+
| 10699448 |
+----------+
1 row in set (4.429 sec)

select id,strainid,value from (ProbeSetData,ProbeSetXRef) where
  ProbeSetXRef.ProbeSetFreezeId = 206 and id=ProbeSetId
limit 5;
+--------+----------+-------+
| id     | strainid | value |
+--------+----------+-------+
| 225088 |        1 |  7.33 |
| 225088 |        2 | 7.559 |
| 225088 |        3 |  7.84 |
| 225088 |        4 | 7.835 |
| 225088 |        5 | 7.652 |
+--------+----------+-------+
5 rows in set (0.001 sec)

select id,strainid,value from (ProbeSetData,ProbeSetXRef) where   ProbeSetXRef.ProbeSetFreezeId = 206 and id=ProbeSetId   and strainid=4 limit 5;
+--------+----------+-------+
| id     | strainid | value |
+--------+----------+-------+
| 225088 |        4 | 7.835 |
| 225089 |        4 | 9.595 |
| 225090 |        4 | 8.982 |
| 225091 |        4 | 8.153 |
| 225092 |        4 | 7.111 |
+--------+----------+-------+
5 rows in set (0.000 sec)

#+BEGIN_SRC SQL select ProbeSet.name,strainid,probesetfreezeid,value from (ProbeSet,ProbeSetData,ProbeSetFreeze,ProbeSetXRef) where ProbeSetData.id=ProbeSetId