# Migrate GN1 Clustering GeneNetwork1 has clustering output that we want to migrate to GN2. For example, go to => http://gn1-lily.genenetwork.org/ GN1 on Lily Select Type -> CRTD mRNA data and do a search for 'synap*'. Click on the top 10 results checkboxes and Add to the basket (add is an icon). Now click on 'Select all' and the 'Heat map' icons. Computation may take a while. Next click on 'Cluster traits' and 'Redraw' map. This is the one we want in GN3/GN2! => ./heatmap.png ## Members * fredm * pjotr ## Useful links => https://github.com/genenetwork/genenetwork1/tree/master/web/webqtl/heatmap Implementation ## Implementation As a first step the computations should move to GN3 with proper regression/unit testing scripts. Next wire them up as REST API endpoints and add it to the GN2 web interface: - [ ] Move computation to GN3 - [ ] Create REST API endpoints - [ ] Add regression/unit tests in GN3 - [ ] Add to GN2 web interface @pjotr I couldn't quite figure out what "CRTD mRNA", so I selected "Cartilage mRNA" , to try and produce a heat map, and I did get some results. The selected data was: * Species: Mouse (mm10) * Group: BXD Family * Type: Cartilage mRNA I got => http://gn1-lily.genenetwork.org/image/Heatmap_8CJbN7Iy.png the initial heatmap and => http://gn1-lily.genenetwork.org/image/Heatmap_syjdiq5z.png the 'Cluster Traits' heatmap is that anything like what I should expect? I (@fredmanglis) have noticed that the computation of the heatmap data is intertwined with the drawing of the heatmap. I think this will be among the first things to separate, as part of moving the computation over to GN3. My initial impression of this, is that the GN3 should handle the computation and GN2 the display. Please correct me if I am wrong. ## 2021-08-11 I have had a look at the Python implementation of Plotly, and tried out a few of the examples to get a feel for the library and its capabilities. I think it is possible to use the library to draw the heatmaps, though I'll need more time to fully grasp how it works, and how I could use it to actually provide some of the features in the clustering heatmaps that do not seem to be out-of-the-box with plotly, e.g. the lines with clustering distance. It does also seem like the library might provide more complex heatmaps, such as => https://plotly.com/python/imshow/#display-an-xarray-image-with-pximshow heatmap of xarray data which sort of approximates the heatmap example linked in => https://github.com/genenetwork/genenetwork3/pull/31#issuecomment-891539359 zsloan's comment I wrote a simple script to test out the library as shown: ``` import random import plotly.express as px def generate_random_data(width=10, height=30): return [[random.uniform(0,2) for i in range(0, width)] for j in range(0, height)] def main(): fig = px.imshow(generate_random_data()) fig.show() if __name__ == "__main__": main() ``` Thankfully, the following python packages seem to already be packaged with guix-bioinformatics: * python-plotly * python-pandas * python-xarray * python-scipy The only (seemingly) missing package is `python-pooch` ## python-pooch python-pooch is in guix as of commit 211c933 in master from this March: => https://guix.gnu.org/packages/python-pooch-1.3.0/ python-pooch package listing => https://issues.guix.gnu.org/47022 python-pooch patch thread ## 2021-08-12 When the "Single Spectrum" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 is a single spectrum that flows from Blue, through green, to red. This one is easy to reproduce somewhat by setting the colour scale with something like: ``` fig.update_coloraxes(colorscale=[ [0.0, '#0000FF'], [0.5, '#00FF00'], [1.0, '#FF0000']]) ``` When the "Blue + Red" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 is split into 2 separate colour scales depending on whether the data value corresponds to one of: * C57BL/6J + * DBA/2J + When the "Grey + Blue + Red" colour scheme is selected, the heatmap's "colour-scale" in genenetwork1 goes from a dark-grey at 0 to a light-grey at 0.5, before splitting into 2 separate colour scales for values greater than 0.5, depending on whether the data value corresponds to one of: * C57BL/6J + * DBA/2J + I (@fredmanglis) have not yet figured out how to represent these more complex splits on the Plotly heatmaps, but I have a suspicion this might be achieved if there is a way to label the data as it goes into the `px.imshow(...)` call. Maybe look into using the => https://plotly.com/python/creating-and-updating-figures/#conditionally-updating-traces conditional trace update feature to set up the colours as appropriate, when different colour-schemes are selected. Failing that, have a look at the => https://plotly.com/python/colorscales/ colour scales documentation => https://plotly.com/python/plotly-fundamentals/ plotly fundamentals page => https://plotly.com/python/categorical-axes/ categorical axes ## 2021-08-17 Tried providing the data-points as a dictionary instead of number, to test out the categorical layout of the plots: something along the lines of: ``` data = [[{'value': 0.07039128483638035, 'category': 'C57BL/6J +'}, {'value': 1.8493427990767048, 'category': 'C57BL/6J +'}, ..., {'value': 1.3089193078506003, 'category': 'C57BL/6J +'}]] ``` but that did work as expected. Paused on heatmap generation to first test out the database access code. Added tests and fixed issues with older db-access code to get a sample of the data for drawing heatmaps. ## 2021-08-20 The data that seems to be used for drawing the actual heatmap is the following data from strains: * value * variance * N (I'm not sure what N is) this is retrieved with the => https://github.com/genenetwork/genenetwork3/blob/main/gn3/db/traits.py#L627-L668 `retrieve_trait_data` function which is then processed with the => https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/heatmap.py#L12-L77 `export_trait_data` function into a list of lists the example of which is as shown: ``` [(7.51879, 7.77141, 8.39265, 8.17443, 8.30401, 7.80944), (6.1427, 6.50588, 7.73705, 6.68328, 7.49293, 7.27398), (8.4211, 8.30581, 9.24076, 8.51173, 9.18455, 8.36077), (10.0904, 10.6509, 9.36716, 9.91202, 8.57444, 10.5731), (10.188, 9.76652, 9.54813, 9.05074, 9.52319, 9.10505), (6.74676, 7.01029, 7.54169, 6.48574, 7.01427, 7.26815), (6.39359, 6.85321, 5.78337, 7.11141, 6.22101, 6.16544), (6.84118, 7.08432, 7.59844, 7.08229, 7.26774, 7.24991), (9.45215, 10.6943, 8.64719, 10.1592, 7.75044, 8.78615), (7.04737, 6.87185, 7.58586, 6.92456, 6.84243, 7.36913)] ``` clustering the example data above with => https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/heatmap.py#L104-L126 the `cluster_traits` function gives ``` ((0.0, 0.20337048635536847, 0.16381088984330505, 1.7388553629398245, 1.5025235756329178, 0.6952839500255574, 1.271661230252733, 0.2100487290977544, 1.4699690641062024, 0.7934461515867415), (0.20337048635536847, 0.0, 0.2198321044997198, 1.5753041735592204, 1.4815755944537086, 0.26087293140686374, 1.6939790104301427, 0.06024619831474998, 1.7430082449189215, 0.4497104244247795), (0.16381088984330505, 0.2198321044997198, 0.0, 1.9073926868549234, 1.0396738891139845, 0.5278328671176757, 1.6275069061182947, 0.2636503792482082, 1.739617877037615, 0.7127042590637039), (1.7388553629398245, 1.5753041735592204, 1.9073926868549234, 0.0, 0.9936846292920328, 1.1169999189889366, 0.6007483980555253, 1.430209221053372, 0.25879514152086425, 0.9313185954797953), (1.5025235756329178, 1.4815755944537086, 1.0396738891139845, 0.9936846292920328, 0.0, 1.027827186339337, 1.1441743109173244, 1.4122477962364253, 0.8968250491499363, 1.1683723389247052), (0.6952839500255574, 0.26087293140686374, 0.5278328671176757, 1.1169999189889366, 1.027827186339337, 0.0, 1.8420471110023269, 0.19179284676938602, 1.4875072385631605, 0.23451785425383564), (1.271661230252733, 1.6939790104301427, 1.6275069061182947, 0.6007483980555253, 1.1441743109173244, 1.8420471110023269, 0.0, 1.6540234785929928, 0.2140799896286565, 1.7413442197913358), (0.2100487290977544, 0.06024619831474998, 0.2636503792482082, 1.430209221053372, 1.4122477962364253, 0.19179284676938602, 1.6540234785929928, 0.0, 1.5225640692832796, 0.33370067057028485), (1.4699690641062024, 1.7430082449189215, 1.739617877037615, 0.25879514152086425, 0.8968250491499363, 1.4875072385631605, 0.2140799896286565, 1.5225640692832796, 0.0, 1.3256191648260216), (0.7934461515867415, 0.4497104244247795, 0.7127042590637039, 0.9313185954797953, 1.1683723389247052, 0.23451785425383564, 1.7413442197913358, 0.33370067057028485, 1.3256191648260216, 0.0)) ``` and that is then run through the => https://github.com/genenetwork/genenetwork3/blob/main/gn3/computations/slink.py#L140-L198 the `slink` function to give ``` [(((0, 2, 0.16381088984330505), ((1, 7, 0.06024619831474998), 5, 0.19179284676938602), 0.20337048635536847), 9, 0.23451785425383564), ((3, (6, 8, 0.2140799896286565), 0.25879514152086425), 4, 0.8968250491499363), 0.9313185954797953] ``` this, "slinked" data, I think, is what is used to draw the "distance" lines in => ./heatmap.png the 'Cluster Traits' heatmap diagram For the actual heatmap representation, it looks to me like the `neworder` variable initialised to an empty list in => https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L120 GN1's `buildCanvas` function is what is populated and used to draw the "cells" of the heatmap diagram: see => https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L206-L316 this chunk. This has not yet been migrated over. Within the loop that uses `neworder`, there is a call to `genotype.regression(...)` function. From what I (fredmanglis) can tell, it seems this `regression` function might be the one defined in the => https://github.com/genenetwork/QTLReaper/blob/master/Src/dataset.c#L416 QTLReaper library. I am not entirely sure where we stand on the use of QTLReaper. I think there was a move away from using this library some time in the past. There **might** be need to migrate the => https://github.com/genenetwork/genenetwork1/blob/master/web/webqtl/heatmap/Heatmap.py#L419-L438 `getNearestMarker` function out So, it does seem like I had previously missed out on a lot of extra computation that still needs migration. ### 2021-08-20 14:15 The `neworder` variable setup has been partially migrated. The use of the `xoffset` and `d_1` variables has not been cleared up at this point - they could be removed in the future. Also migrated retrieval of strains with non-NoneType values Awaiting response on use of QTLReaper.