summaryrefslogtreecommitdiff
path: root/issues/materialised-views-for-correlations.gmi
blob: 5755e6ec8941dfadbd4478eb53edc3094f738c34 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Materialised Views for Correlations

## Tags

* assigned: alex, fred, zachs
* type: feature, improvement
* priority: high
* keywords: correlations, materialised views, optimisation
* status: closed, completed

## Description

Running correlations on some traits, for example
=> https://genenetwork.org/show_trait?trait_id=5612792&dataset=UMUTAffyExon_0209_RMA
takes such a huge amount of time, that the either the back-end server, or the browser times out.

As part of the optimisation effort, we need to look into materialised views to speed up fetching of the data.

MySQL/MariaDB does not support materialised views (there is no `CREATE MATERIALIZED VIEW ...` command or equivalent) but there are ways around that, to simulate the materialized views.

There is some work on
=> /topics/genotype-database the genotype database
that should allow intermediate materialised views to be stored in lmdb

There might need to be multiple materialised views for the different types of datasets/traits, i.e.
* Phenotypes (Publish)
* Genotypes (Geno)
* mRNA (ProbeSet)
* Temp (Not sure how this would be handled)

### Implementation Concerns

The implementation of the materialised views will need to concern itself with the following issues:

* Indexing the view to speed up the queries significantly
* Refreshing the data in the view if data in source table(s) changes (ideally) or periodically (less ideal).

### Other Concerns

Maybe, if we are going down the materialised views road, we should also have migrations in place - that way we can keep track of what schema, triggers, views, etc are in place, and be able to reproduce them automatically (or semi-automatically).

The reason for the concern above is, for example, The developers (alex, arun, bonz, fredm, etc) can implement whatever schema changes, triggers and views locally, but they do not (and probably should not) have access to production. As such, there should be provided a sort of CLI endpoint for the deployment system (or person) to run to implement the changes on the production (and CI/CD) database to get similar results.

The problem here, is that the migration might be moot, if the data is then moved out of the database, as is being planned.

### Queries to Materialise

Possible candidate queries for materialisation are:

=> https://github.com/genenetwork/genenetwork2/blob/a2b837801d479ed2fb06ca33c07de9c271532c46/wqflask/base/data_set/dataset.py#L200-L292

The method above is doing way too much - it should probably be split into separate methods for each class, to simplify the code a little and make it clearer what each part does before reworking the queries for the materialized view.


=> https://github.com/genenetwork/genenetwork2/blob/a2b837801d479ed2fb06ca33c07de9c271532c46/wqflask/base/trait.py#L386-L610	

The method above is also doing way too much.


Both methods above do not have the metadata, so probably also have a look at adding the metadata to the materialized views

### Possible "Entities" in Materialised Views

In my early (2022-10-10) assessment, for each of the different types of datasets/traits mentioned in the description, we might need the data for the following "entities" in the materialized views:

* Species data (db tables: Species)
* Strains data (db tables: Strain, NStrain)
* Group info (db tables: InbredSet)
* Dataset Info and Metadata
* Trait data and metadata

The following are the database tables classified into their various functions (in the order order: Genotypes, mRNA, Phenotypes, Temp):

#### Dataset Information Tables

* GenoFreeze
* ProbeSetFreeze
* PublishFreeze

#### Dataset Metadata Tables

* Geno
* ProbeSet
* Phenotype
* Temp

#### Sample Data

* GenoData and GenoSE
* ProbeSetData and ProbeSetSE
* PublishData and PublishSE
* TempData


There might also need to be materialised views for the `Tissue*` tables, i.e.:

* Tissue
* TissueProbeFreeze
* TissueProbeSetData
* TissueProbeSetFreeze
* TissueProbeSetXRef