summaryrefslogtreecommitdiff
path: root/issues/phenotype-correlation-error.gmi
blob: be37dec7b378e6923312e00e9df71832cd80030a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Phenotype Correlation Error

## Tags

* type: bug
* priority: high
* status: closed
* assigned: fredm, zachs
* keywords: correlation

## Fixed: Correlation against phenotypes fails (for at least some datasets)

Example: Run a correlation against BXD Published Phenotypes (at the top of the drop-down menu) from here -
=> https://genenetwork.org/show_trait?trait_id=1442370_at&dataset=HC_M2_0606_P

The bug appears to occur in the rust correlation tool, so I'm not sure how to debug it myself. The last few linnes of the stack trace are as follows:

```
  File "/export2/local/home/zas1024/gn2-zach/gene/wqflask/wqflask/correlation/rust_correlation.py", line 262, in __compute_sample_corr__
    return run_correlation(
  File "/usr/local/guix-profiles/gn-latest-20220820/lib/python3.9/site-packages/gn3/computations/rust_correlation.py", line 58, in run_correlation
    subprocess.run(command_list, check=True)
  File "/gnu/store/qar3sks5fwzm91bl3d3ngyrvxs7ipj5z-python-3.9.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/guix-profiles/gn-latest-20220820/bin/correlation_rust', '/home/zas1024/gn2-zach/tmp/gn2/correlation/IoaglmTgDJ.json', '/home/zas1024/gn2-zach/tmp/gn2']' died with <Signals.SIGSEGV: 11>.
```

## Fixed: Processing for Output too Early

After fixing the issues with the interactions with the rust correlations code, I am now running into the following error when I run a correlation against the "Hippocampus Consortium M430v2 (Jun06) PDNN" dataset with the same trait from the URI above:
```
Traceback (most recent call last):
  File "/home/frederick/opt/gn_profiles/gn2_latest/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/frederick/opt/gn_profiles/gn2_latest/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/views.py", line 820, in corr_compute_page
    correlation_results = set_template_vars(request.form, correlation_results)
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/show_corr_results.py", line 54, in set_template_vars
    table_json = correlation_json_for_table(correlation_data,
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/show_corr_results.py", line 104, in correlation_json_for_table
    target_trait_ob = create_trait(dataset=target_dataset_ob,
  File "/home/frederick/genenetwork/genenetwork2/wqflask/base/trait.py", line 44, in create_trait
    the_trait = retrieve_trait_info(
  File "/home/frederick/genenetwork/genenetwork2/wqflask/base/trait.py", line 599, in retrieve_trait_info
    raise KeyError(repr(trait.name)
KeyError: "'1' information is not found in the database."
```

The error above was caused by processing the data for output way too early. This has been fixed.

## Tissue Correlation: Probeset Trait Against Publish/Genotype Dataset

Running "Tissue" correlations on
=> https://genenetwork.org/show_trait?trait_id=1442370_at&dataset=HC_M2_0606_P
against the "BXD Published Phenotypes" database fails with the error:

This also fails if you run it against the "BXD Genotypes" dataset.

```
Traceback (most recent call last):
  File "/usr/local/guix-profiles/gn-latest-20220820/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/guix-profiles/gn-latest-20220820/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/gn2/production/gene/wqflask/wqflask/views.py", line 820, in corr_compute_page
    correlation_results = set_template_vars(request.form, correlation_results)
  File "/home/gn2/production/gene/wqflask/wqflask/correlation/show_corr_results.py", line 54, in set_template_vars
    table_json = correlation_json_for_table(correlation_data,
  File "/home/gn2/production/gene/wqflask/wqflask/correlation/show_corr_results.py", line 104, in correlation_json_for_table
    target_trait_ob = create_trait(dataset=target_dataset_ob,
  File "/home/gn2/production/gene/wqflask/base/trait.py", line 44, in create_trait
    the_trait = retrieve_trait_info(
  File "/home/gn2/production/gene/wqflask/base/trait.py", line 599, in retrieve_trait_info
    raise KeyError(repr(trait.name)
KeyError: "'1422223_at' information is not found in the database."
```

so far, triangulated the issue to possibly being the fact that the "target_dataset" value is not used
=> https://github.com/genenetwork/genenetwork2/blob/53aa084fd2c9c930ac791ee43affffb3f788547c/wqflask/wqflask/correlation/rust_correlation.py#L271-L289 in this function.

## Literature Correlation: Probeset Trait Against Publish/Genotype Dataset

Run literature correlation for
=> http://localhost:5033/show_trait?trait_id=1442370_at&dataset=HC_M2_0606_P this trait
against the "BXD Published Phenotype" database and observe the following exception:

This also fails if you run it against the "BXD Genotypes" dataset.

```
ERROR:wqflask:http://localhost:5033/corr_compute ( 4:26AM UTC Oct 03, 2022)
Traceback (most recent call last):
  File "/home/frederick/opt/gn_profiles/gn2_latest/lib/python3.9/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/frederick/opt/gn_profiles/gn2_latest/lib/python3.9/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/views.py", line 818, in corr_compute_page
    correlation_results = compute_correlation(request.form, compute_all=True)
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/correlation_gn3_api.py", line 199, in compute_correlation
    return compute_correlation_rust(
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/rust_correlation.py", line 326, in compute_correlation_rust
    results = corr_type_fns[corr_type](
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/rust_correlation.py", line 299, in __compute_lit_corr__
    (this_trait_geneid, geneid_dict, species) = do_lit_correlation(
  File "/home/frederick/genenetwork/genenetwork2/wqflask/wqflask/correlation/correlation_gn3_api.py", line 237, in do_lit_correlation
    geneid_dict = this_dataset.retrieve_genes("GeneId")
AttributeError: 'PhenotypeDataSet' object has no attribute 'retrieve_genes'
```

The literature correlations computation calls the `retrieve_genes` method, that is only present in the `base.data_set.mrnaassaydataset.MrnaAssayDataSet` class, which handles traits of type "ProbeSet".

The code seems to imply that we should not run literature correlations against any dataset that is not of type "ProbeSet".

## Some Reflections

The `target_dataset` is not used in the
=> https://github.com/genenetwork/genenetwork2/blob/c38bee43c1256c3515bbd1d805745d8dfb8ce390/wqflask/wqflask/correlation/rust_correlation.py#L271-L289 tissue correlations which seems like an error to me (fredm).

In my (fredm) work on partial correlations, before doing the computations,
=> https://github.com/genenetwork/genenetwork3/blob/ff34aee0f39c2e91db243461d7d67405e7aea0e3/gn3/computations/partial_correlations.py#L704-L750 there were error checks
that were run.

Should these be present for the full correlations too?

The failures above with the Publish/Genotype datasets implies one of two things:
* The code is not general enough, or
* We need to handle the exceptions, and present the selection errors as appropriate.

Better yet, we should probably not present invalid data to the user, i.e. do not present user with a dataset which would lead to errors if a correlation of a particular type is run against it with the given trait.

## Trial Against GN1

@zsloan @alexm: Running the failing tissue and literature correlations above with the same trait against the "BXD Published Phenotypes" and the "BXD Genotypes" on
=> http://gn1.genenetwork.org/
I got the error
```
Wrong correlation type

    Sorry! Error occurred while processing your request.

    The nature of the error generated is as follows:

    Correlation Type Error :

        It is not possible to compute the Tissue Correlation (Pearson's r) between your trait and data in this BXDGeno database. Please try again after selecting another type of correlation.
```
for the tissue correlations and
```
Wrong correlation type

    Sorry! Error occurred while processing your request.

    The nature of the error generated is as follows:

    Correlation Type Error :

        It is not possible to compute the SGO Literature Correlation between your trait and data in this BXDPublish database. Please try again after selecting another type of correlation.
```
for the literature correlations.

My initial hunch was correct. We should not be running the tissue and literature correlations in the way we were in the cases above.

We now need to check for these combinations and display an error for the user, as is done in GN1



The error reported above
```
raise KeyError(
KeyError: "'1' information is not found in the database for dataset 'HC_M2_0606_P' with id '112'."

```
causes the correlation below to fail
for maintainability and to fix current bugs
this code that does  preprocessing of data needs to be modified
thats is :-

* Tissue correlation data
* top n sample correlation data
* top n tissue correlation data


## Tags
* assigned: alexm, fredm, zsloan
* type: bug
* keywords: correlations
* status: closed, completed
* priority: high

## Notes

* 2022-09-29: Successfully reproduced on production
* 2022-09-29: Fix file format issues
* 2022-09-30: Fix issues in rust correlation code
* 2022-10-03: Fix: avoid processing for output early