summaryrefslogtreecommitdiff
path: root/topics/gn1-migration-to-gn2/clustering.gmi
blob: d7b411b1ec3a8f8a21d35b36cfd906e2d50e79b2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
# Migrate GN1 Clustering

GeneNetwork1 has clustering output that we want to migrate to GN2. For example, go to

=> http://gn1-lily.genenetwork.org/ GN1 on Lily

Select Type -> CRTD mRNA data and do a search for 'synap*'. Click on the top 10 results checkboxes and Add to the basket (add is an icon). Now click on 'Select all' and the 'Heat map' icons. Computation may take a while.

Next click on 'Cluster traits' and 'Redraw' map. This is the one we want in GN3/GN2!

=> ./heatmap.png

## Members

* fredm
* pjotr

## Kanban

* fredm

## Tags

* assigned: fredm

## Useful links

=> https://github.com/genenetwork/genenetwork1/tree/master/web/webqtl/heatmap Implementation

## Implementation

As a first step the computations should move to GN3 with proper regression/unit testing scripts. Next wire them up as REST API endpoints and add it to the GN2 web interface:

- [ ] Move computation to GN3
- [ ] Create REST API endpoints
- [ ] Add regression/unit tests in GN3
- [ ] Add to GN2 web interface


@pjotr I couldn't quite figure out what "CRTD mRNA", so I selected "Cartilage mRNA" , to try and produce a heat map, and I did get some results.

The selected data was:

* Species: Mouse (mm10)
* Group: BXD Family
* Type: Cartilage mRNA

I got
=> http://gn1-lily.genenetwork.org/image/Heatmap_8CJbN7Iy.png the initial heatmap
and
=> http://gn1-lily.genenetwork.org/image/Heatmap_syjdiq5z.png the 'Cluster Traits' heatmap

is that anything like what I should expect?



I (@fredmanglis) have noticed that the computation of the heatmap data is intertwined with the drawing of the heatmap.

I think this will be among the first things to separate, as part of moving the computation over to GN3. My initial impression of this, is that the GN3 should handle the computation and GN2 the display.

Please correct me if I am wrong.


## 2021-08-11

I have had a look at the Python implementation of Plotly, and tried out a few of the examples to get a feel for the library and its capabilities. I think it is possible to use the library to draw the heatmaps, though I'll need more time to fully grasp how it works, and how I could use it to actually provide some of the features in the clustering heatmaps that do not seem to be out-of-the-box with plotly, e.g. the lines with clustering distance.

It does also seem like the library might provide more complex heatmaps, such as

=> https://plotly.com/python/imshow/#display-an-xarray-image-with-pximshow heatmap of xarray data

which sort of approximates the heatmap example linked in

=> https://github.com/genenetwork/genenetwork3/pull/31#issuecomment-891539359 zsloan's comment

I wrote a simple script to test out the library as shown:

```
import random
import plotly.express as px

def generate_random_data(width=10, height=30):
    return [[random.uniform(0,2) for i in range(0, width)]
            for j in range(0, height)]

def main():
    fig = px.imshow(generate_random_data())
    fig.show()

if __name__ == "__main__":
    main()
```

Thankfully, the following python packages seem to already be packaged with guix-bioinformatics:

* python-plotly
* python-pandas
* python-xarray
* python-scipy

The only (seemingly) missing package is `python-pooch`

## python-pooch

python-pooch is in guix as of commit 211c933 in master from this March:

=> https://guix.gnu.org/packages/python-pooch-1.3.0/ python-pooch package listing
=> https://issues.guix.gnu.org/47022 python-pooch patch thread


## 2021-08-12

When the "Single Spectrum" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 is a single spectrum that flows from Blue, through green, to red. This one is easy to reproduce somewhat by setting the colour scale with something like:

```
fig.update_coloraxes(colorscale=[
    [0.0, '#0000FF'],
    [0.5, '#00FF00'],
    [1.0, '#FF0000']])
```

When the "Blue + Red" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 is split into 2 separate colour scales depending on whether the data value corresponds to one of:

* C57BL/6J +
* DBA/2J +

When the "Grey + Blue + Red" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 goes from a dark-grey at 0 to a light-grey at 0.5, before splitting into 2 separate colour scales for values greater than 0.5, depending on whether the data value corresponds to one of:

* C57BL/6J +
* DBA/2J +

I (@fredmanglis) have not yet figured out how to represent these more complex splits on the Plotly heatmaps, but I have a suspicion this might be achieved if there is a way to label the data as it goes into the `px.imshow(...)` call.

Maybe look into using the

=> https://plotly.com/python/creating-and-updating-figures/#conditionally-updating-traces conditional trace update

feature to set up the colours as appropriate, when different colour-schemes are selected. Failing that, have a look at the

=> https://plotly.com/python/colorscales/ colour scales documentation
=> https://plotly.com/python/plotly-fundamentals/ plotly fundamentals page
=> https://plotly.com/python/categorical-axes/ categorical axes

## 2021-08-17

Tried providing the data-points as a dictionary instead of number, to test out the categorical layout of the plots: something along the lines of:

```
data = [[{'value': 0.07039128483638035, 'category': 'C57BL/6J +'}, {'value': 1.8493427990767048, 'category': 'C57BL/6J +'}, ..., {'value': 1.3089193078506003, 'category': 'C57BL/6J +'}]]
```

but that did work as expected.

Paused on heatmap generation to first test out the database access code.

Added tests and fixed issues with older db-access code to get a sample of the data for drawing heatmaps.

## 2021-08-20

The data that seems to be used for drawing the actual heatmap is the following data from strains:

* value
* variance
* N (I'm not sure what N is)

this is retrieved with the

=> https://github.com/genenetwork/genenetwork3/blob/main/gn3/db/traits.py#L627-L668 `retrieve_trait_data` function

which is then processed with the

=> https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/heatmap.py#L12-L77 `export_trait_data` function

into a list of lists the example of which is as shown:

```
[(7.51879, 7.77141, 8.39265, 8.17443, 8.30401, 7.80944), (6.1427, 6.50588, 7.73705, 6.68328, 7.49293, 7.27398), (8.4211, 8.30581, 9.24076, 8.51173, 9.18455, 8.36077), (10.0904, 10.6509, 9.36716, 9.91202, 8.57444, 10.5731), (10.188, 9.76652, 9.54813, 9.05074, 9.52319, 9.10505), (6.74676, 7.01029, 7.54169, 6.48574, 7.01427, 7.26815), (6.39359, 6.85321, 5.78337, 7.11141, 6.22101, 6.16544), (6.84118, 7.08432, 7.59844, 7.08229, 7.26774, 7.24991), (9.45215, 10.6943, 8.64719, 10.1592, 7.75044, 8.78615), (7.04737, 6.87185, 7.58586, 6.92456, 6.84243, 7.36913)]
```

clustering the example data above with

=> https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/heatmap.py#L104-L126 the `cluster_traits` function

gives

```
((0.0, 0.20337048635536847, 0.16381088984330505, 1.7388553629398245, 1.5025235756329178, 0.6952839500255574, 1.271661230252733, 0.2100487290977544, 1.4699690641062024, 0.7934461515867415), (0.20337048635536847, 0.0, 0.2198321044997198, 1.5753041735592204, 1.4815755944537086, 0.26087293140686374, 1.6939790104301427, 0.06024619831474998, 1.7430082449189215, 0.4497104244247795), (0.16381088984330505, 0.2198321044997198, 0.0, 1.9073926868549234, 1.0396738891139845, 0.5278328671176757, 1.6275069061182947, 0.2636503792482082, 1.739617877037615, 0.7127042590637039), (1.7388553629398245, 1.5753041735592204, 1.9073926868549234, 0.0, 0.9936846292920328, 1.1169999189889366, 0.6007483980555253, 1.430209221053372, 0.25879514152086425, 0.9313185954797953), (1.5025235756329178, 1.4815755944537086, 1.0396738891139845, 0.9936846292920328, 0.0, 1.027827186339337, 1.1441743109173244, 1.4122477962364253, 0.8968250491499363, 1.1683723389247052), (0.6952839500255574, 0.26087293140686374, 0.5278328671176757, 1.1169999189889366, 1.027827186339337, 0.0, 1.8420471110023269, 0.19179284676938602, 1.4875072385631605, 0.23451785425383564), (1.271661230252733, 1.6939790104301427, 1.6275069061182947, 0.6007483980555253, 1.1441743109173244, 1.8420471110023269, 0.0, 1.6540234785929928, 0.2140799896286565, 1.7413442197913358), (0.2100487290977544, 0.06024619831474998, 0.2636503792482082, 1.430209221053372, 1.4122477962364253, 0.19179284676938602, 1.6540234785929928, 0.0, 1.5225640692832796, 0.33370067057028485), (1.4699690641062024, 1.7430082449189215, 1.739617877037615, 0.25879514152086425, 0.8968250491499363, 1.4875072385631605, 0.2140799896286565, 1.5225640692832796, 0.0, 1.3256191648260216), (0.7934461515867415, 0.4497104244247795, 0.7127042590637039, 0.9313185954797953, 1.1683723389247052, 0.23451785425383564, 1.7413442197913358, 0.33370067057028485, 1.3256191648260216, 0.0))
```

and that is then run through the

=> https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/slink.py#L140-L198 the `slink` function

to give

```
[(((0, 2, 0.16381088984330505), ((1, 7, 0.06024619831474998), 5, 0.19179284676938602), 0.20337048635536847), 9, 0.23451785425383564), ((3, (6, 8, 0.2140799896286565), 0.25879514152086425), 4, 0.8968250491499363), 0.9313185954797953]
```

this, "slinked" data, I think, is what is used to draw the "distance" lines in

=> ./heatmap.png the 'Cluster Traits' heatmap diagram


For the actual heatmap representation, it looks to me like the `neworder` variable initialised to an empty list in

=> https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L120 GN1's `buildCanvas` function

is what is populated and used to draw the "cells" of the heatmap diagram: see

=> https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L206-L316 this chunk.

This has not yet been migrated over.

Within the loop that uses `neworder`, there is a call to `genotype.regression(...)` function.

From what I (fredmanglis) can tell, it seems this `regression` function might be the one defined in the

=> https://github.com/genenetwork/QTLReaper/blob/master/Src/dataset.c#L416 QTLReaper library.

I am not entirely sure where we stand on the use of QTLReaper. I think there was a move away from using this library some time in the past.

There **might** be need to migrate the

=> https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L419-L438 `getNearestMarker` function out

So, it does seem like I had previously missed out on a lot of extra computation that still needs migration.

### 2021-08-20 14:15

The `neworder` variable setup has been partially migrated. The use of the `xoffset` and `d_1` variables has not been cleared up at this point - they could be removed in the future.

Also migrated retrieval of strains with non-NoneType values

Awaiting response on use of QTLReaper.

## 2021-08-26

Reading through the Genenetwork2 code to figure out how the ~rust-qtlreaper~ module is used.

I cloned

=> https://github.com/chfi/rust-qtlreaper ~rust-qtlreaper~

and searched through the code for python bindings (py_module_initializer!) but could not find it in either the rust-qtlreaper or genenetwork2 repositories

This causes me to suspect that when ~import reaper~ appears in the Genenetwork2 code, it actually uses the older QTLReaper implemented in C. This suspicion is further strengthened by the fact that when

=> https://github.com/genenetwork/genenetwork3/commit/7aa5f5422908b4dbfc80f3f73b008507878a34aa I added ~rust-qtlreaper~

to the dependencies for the Genenetwork3 environment,  I could not do ~import reaper~ successfully:

```
$ env GUIX_PACKAGE_PATH=~/genenetwork/guix-bioinformatics/ ~/.config/guix/current/bin/guix environment -C --network --load=guix.scm
frederick@galadriel-ubuntu ~/genenetwork/genenetwork3 [env]$ python3
Python 3.8.2 (default, Jan  1 1970, 00:00:01)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import reaper
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'reaper'
```

There's a chance we might have to write a separate rust library whose entire purpose is to add the python bindings, or maybe add the bindings to the rust-qtlreaper library itself.

## 2021-08-30

### 08:38

I built a new module to act as an interface between /rust-qtlreaper/ and the Genenetwork3 python code.

I also looked at how the genotype file is identified in GN2 and compared with the data in GN3. It seems like the corresponding value in GN3 to identify the genotype file would be the ~riset~ field in the trait.

For the examples used to test the data out, all the values end up being "BXD". There is a chance that the ~riset~ field's value could be different for each trait, depending on what the user runs the code against.

I think this is part of what Rob was talking about

=> https://github.com/genenetwork/genenetwork3/pull/31#issuecomment-890907828 here.

The issues to consider are:

* If all the traits are from a single group, then ALL the heatmap functions can be run
* If all the traits are from a single species, but different groups, then only the 'mapping' function of the heatmap should be enabled
* If the traits are from two or more species, then NO heatmap functions can be run.

Still need to identify how the groups are identified/formed. From a cursory inspection of

=> https://github.com/genenetwork/genenetwork2/blob/testing/wqflask/base/data_set.py#L319 GN2

it seems like the groups correspond to the ~riset~ field in GN3, since they share the name

I might need to figure out how the traits correspond to a species.

For the time being, however, I make the assumption that the ~riset~ field for all the traits is the same value, and use that to get the genotype file for use in computation of the QTL values with /rust-qtlreaper/.

### 11:37

The traits files should only contain the strains that can be found in the corresponding genotype file used for computation of the QTLs. If the trait file contains strains not present in the genotype file, then /rust-qtlreaper/ panics and fails.

I should also look into reworking ~gn3.computations.export_trait_data~ function, so that each value is linked to the strain that it corresponds to, so that we do not rely on numerical order. This should help reduce chances of bugs where the strains and the values are not in the same order.

I also need to figure out how the strains in the genotype are passed around in GN1. I have mostly ignored that since in GN1, those details were passed around in a class object.


## 2021-08-31

### 06:59

Succeeded in

* figuring out how to link the genotype files to traits
* loading samples/strains from the genotype files
* testing the /rust-qtlreaper/ interface code

Now to integrate the results into the main code, and start looking into parsing the results of running qtlreaper into data we can use to generate heatmaps.

## 2021-09-06

The results are parsed and the following information has been gleaned so far, considering:

=> ./sample_clustering_heatmap.png the 'Cluster Traits' heatmap

The values on the vertical axis [0 .. 19, "X"] come from the "Chr" field of the parsed QTLReaper results

There are 5 additional values that exist in the QTLReaper results: these are, "cM", "Mb", "LRS", "Additive" and "pValue" - of these, it seems like the "LRS" value is what is plotted on the heatmap.

The values for "Additive", "Locus", and "Chr" seem to modify the colours, and where the values are plotted on the chart. They might not find much use in GN3, from the look of things.

## 2021-09-08

After further analysis of the data, and comparing it with the sample heatmaps generated from GN1, it seems like the GN1 heatmap diagram is actually a number of heatmaps, all sharing the same axis (X) for the traits, but having different Chromosome axes (Y).

This means that I need to organise the data provided in a way that we can generate these set of heatmaps, and also figure out how to merge them in such a way that they share the same traits axis.

## 2021-09-09

### 06:10

Finalise on generation of the individual chromosome heatmaps.

Start figuring out how to combine the individual heatmaps into one large one

### 11:05

Combining the plots with ~plotly.subplots.make_subplots~ function seems to get us some way towards getting the final heatmap.

The clustering lines, are still not drawn however. There were explorations of dendrograms done to see whether that can help with that.