summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--issues/slow-correlations.gmi93
1 files changed, 93 insertions, 0 deletions
diff --git a/issues/slow-correlations.gmi b/issues/slow-correlations.gmi
new file mode 100644
index 0000000..30cd980
--- /dev/null
+++ b/issues/slow-correlations.gmi
@@ -0,0 +1,93 @@
+# Slow Correlations and UI crashes
+
+Correlations for huge data set(like the exo dataset) is very slow; and the UI crashes.
+
+## Tasks
+
+* assigned: bonfacekilz, alex
+* keywords: critical bug, in progress
+
+* [ ] Caching slow queries
+* [ ] Server side pagination
+
+## Background
+
+First, what we've done:
+
+- Optimised a bunch of SQL.
+- https://mariadb.com/kb/en/query-cache/
+- General code clean-up in some places.
+- Futile experiments with code parallelisation.
+- Add a "test compute" button. More on this later.
+
+As Rob has pointed out before, gn2 is much much slower than
+gn1. Before, we mistakenly thought that it was because that it only
+computed one of the correlations; but Zach correctly pointed out that
+it, gn1, did in fact still compute all correlations in a similar
+fashion to gn2.
+
+The problems we have with gn2 are 2-fold:
+
+- Slow computations
+- UI crashing on our users for huge datasets
+
+We took a step back; tried to probe deeper how we do correlations. To
+do a correlation, we need to run a query on the entire dataset. After
+running a query on this dataset, we additionally fetch metadata on
+this dataset as seen here:
+
+=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/wqflask/correlation/show_corr_results.py#L88
+
+This takes a long time: it's our biggest bottleneck.
+
+For sample correlation we call this function to fetch the data:
+
+=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/data_set.py#L731
+
+IMO this seems to be the main issue among all queries.
+
+For tissue correlation we call this function to fetch the data this
+doesn't take much time less than 20 seconds to create instance and
+fetch results.
+
+=> https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/mrna_assay_tissue_data.py#L78>
+
+For lit correlation, we fetch the correlation from the DB no
+computation happens
+
+
+Assume a user selects "sample correlation" in the form with limit
+2000, they will fetch the results for the entire sample dataset to
+compute the sample correlation; then filter the top 2000 traits. Fetch
+the tissue input for them then do the correlation then fetch lit
+results for them.
+
+ATM, we know that our datasets are immutable unless @Acenteno updates
+things. So why don't we just cache the results of such queries in
+Redis, or in some json file. And use those instead of running the
+query on every computation? A file look-up or a Redis look-up would be
+much faster than what we already have.
+
+Also, another thing that could be improved on is replacing some basic
+data-structures used during the computations with more efficient
+ones. As an example, it makes little sense to use a list that holds a
+huge number of elements, when we could use a generator instead, or
+depending on the application, a more appropriate structure. That could
+shave some more seconds.
+
+Something else worth mentioning is that the fast correlations that
+used parallelisation produced bugs in gn2 could be re-written in a
+more reliable way using threads-- that's what IIRC what gn1 did. So
+that's something worth exploring too.
+
+WRT the UI crashing, we rely too much on Javascript
+(DataTables). AFAICT, the massive results we get from the correlations
+are sent to DataTables to process. That's asking too much! We
+brainstormed on some high level ideas on how to do this. One of them
+being to have the results stored somewhere. And to build pagination
+off those results. Now that's up to Alex to decide how to go about it.
+
+Something cool that Alex pointed is an interesting "manual" testing
+mechanism which he can feel free to try out: Separate the actual
+"computation" and the "pre-fetching" in code. And see what takes
+time.