summaryrefslogtreecommitdiff
path: root/issues/slow-correlations.gmi
blob: 039e06d74136ed7c98fc906e9d68391a4e5fa68f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# Slow Correlations and UI crashes

Correlation in gn2 has regressed when compared  gn1

Issue experienced by users include

* Correlation being slow 

* Browser crush and timeout for huge datasets like (exo dataset)



## Tags

* type: bug
* priority: critical
* assigned: alexm, bonfacekilz
* keywords: correlations, ui, crash
* status: closed

## Tasks

[x] separation of concerns
split between correlation code and code to database part
for easier debug

[x] optimize db queries

[x] Cache for huge datasets in text files

[x] Cache for traits metadata

[x] refactor data structures used

[x] limit number of results rendered to user

[] implement parralel computation for correlation

[] Server side pagination


### Background on the issue

As Rob has pointed out before, gn2 is much much slower than
gn1. Before, we mistakenly thought that it was because that it only
computed one of the correlations; but Zach correctly pointed out that
it, gn1, did in fact still compute all correlations in a similar
fashion to gn2.

The problems we have with gn2 are 2-fold:

- Slow computations
- UI crashing on our users for huge datasets
  
We took a step back; tried to probe deeper how we do correlations. To
do a correlation, we need to run a query on the entire dataset. After
running a query on this dataset, we additionally fetch metadata on
this dataset as seen here:

=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/wqflask/correlation/show_corr_results.py#L88

This takes a long time: it's our biggest bottleneck. 

For sample correlation we call this function to fetch the data:

=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/data_set.py#L731

IMO this seems to be the main issue among all queries.

For tissue correlation we call this function to fetch the data this
doesn't take much time less than 20 seconds to create instance and
fetch results.

=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/mrna_assay_tissue_data.py#L78>

For lit correlation, we fetch the correlation from the DB no
computation happens


Assume a user selects "sample correlation" in the form with limit
2000, they will fetch the results for the entire sample dataset to
compute the sample correlation; then filter the top 2000 traits. Fetch
the tissue input for them then do the correlation then fetch lit
results for them.

ATM, we know that our datasets are immutable unless @Acenteno updates
things. So why don't we just cache the results of such queries in
Redis, or in some json file. And use those instead of running the
query on every computation? A file look-up or a Redis look-up would be
much faster than what we already have.

Also, another thing that could be improved on is replacing some basic
data-structures used during the computations with more efficient
ones. As an example, it makes little sense to use a list that holds a
huge number of elements, when we could use a generator instead, or
depending on the application, a more appropriate structure. That could
shave some more seconds.

Something else worth mentioning is that the fast correlations that
used parallelisation produced bugs in gn2 could be re-written in a
more reliable way using threads-- that's what IIRC what gn1 did. So
that's something worth exploring too.

WRT the UI crashing, we rely too much on Javascript
(DataTables). AFAICT, the massive results we get from the correlations
are sent to DataTables to process. That's asking too much! We
brainstormed on some high level ideas on how to do this. One of them
being to have the results stored somewhere. And to build pagination
off those results. Now that's up to Alex to decide how to go about it.

Something cool that Alex pointed is an interesting "manual" testing
mechanism which he can feel free to try out: Separate the actual
"computation" and the "pre-fetching" in code. And see what takes
time.



# Updates

### Mon 18 Oct 2021 12:42:17 PM EAT

Atm GN2 is un-usable for Rob for basic tours and show-and-tells, and
it is a persistent problem that is getting worse the more he
complains.  Correlation is slower than it was ever before; and search
is broken. For a simple search of 10,000 phenotypes, it takes a lot of
time to compute.

According to Rob, GN1 does not rely on a cache. Instead it is
computing from a materialized view of the database that is
intentionally designed for a fast web service.




# Notes

### Tue, 12 April 2022

Most of the above issue have been addressed

correlation speed has greatly improved no complain't
on the issue as of 12/04/2022

for example the dataset below no longer crashes for this datashe computa

=> http://gn2.genenetwork.org/show_trait?trait_id=ENSG00000244734&dataset=GTEXv8_Wbl_tpm_0220

Also, wrt to parralel computation
implementation in python leads
to memory error for forked processes and
is best implemented in a different
language if the issue arises


# Notes

### Friday, 8 July 2022

Closing down the issue,to speed up things the gn2 correlation computation
was to be rewritten using rust

Added an issue tracker for this

=> https://issues.genenetwork.org/issues/implement-parallel-correlation-with-rust.gmi