summaryrefslogtreecommitdiff
path: root/issues/handling-resource-links-in-probeset-page.gmi
blob: 4d0c052885c503878164a35577404875e4622980 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Handling Resource Links in ProbeSet Page

## Tags

* assigned: bonfacem
* priority: high
* type: RDF
* keywords: virtuoso

## Description:

During the finalization of the UI for endpoints, specifically the probeset page, modifications were made to streamline the ProbeSet RDF transform.  Unnecessary metadata was removed to reduce the transformed data.  Most of the metadata on the probeset page is utilized for constructing resources.  An example of a probeset page is:

=> https://genenetwork.org/show_trait?trait_id=1435395_s_at&dataset=HC_M2_0606_P Trait Data and Analysis for 1435395_s_at

Under the "Resource Links" section on this page, there's a specific URL:

=> https://genemania.org/search/mus-musculus/Atp5j2

Attempting to prefix this resource link as "genemania:mus-musculus/Atp5j2" is not valid in Turtle format.

To circumvent this issue, a workaround will be implemented:

```turtle
gn:probeset1435395_s_at gnt:hasGeneManiaResource <https://genemania.org/search/mus-musculus/Atp5j2> .
<https://genemania.org/search/mus-musculus/Atp5j2> rdf:type gnc:ResourceLink .
```

The straightforward approach would be to construct this structure in the front-end.  However, the problem lies in the fact that these resource links are inferred, making it challenging to discern their connection within GN without visiting the website.  Therefore, it's preferable to store this information in RDF despite the ease of constructing it in the front-end.


### GeneList Metadata

Consider GN's approach for fetching GeneList entries for a specific trait.

=> https://github.com/genenetwork/genenetwork2/blob/371cbaeb1b05a062d7f75083aa4ff7209e4e06b3/wqflask/wqflask/show_trait/show_trait.py#L398 Fetching GeneList for a given trait

The GeneList table lacks unique GeneSymbols and GeneIds, as illustrated in the following examples:

```
SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G
```

Duplicate entry examples:

```
SELECT * FROM GeneList WHERE GeneSymbol = "AB102723" AND 
GeneId=3070 AND SpeciesId = 4 \G

SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G
```

Identifying duplicates:

```
SELECT GeneSymbol, GeneId, SpeciesId, COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) AS `count` FROM GeneList GROUP BY BINARY GeneSymbol, GeneId, chromosome, txStart, txEnd HAVING COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) > 1;
```

Transforming ProbeSet metadata takes long.  The exact command:

```shell
time guix shell guile-dbi \
guile-hashing -m manifest.scm -- ./pre-inst-env ./examples/probeset.scm --settings conn.scm --output /export/data/genenetwork-virtuoso/probeset-metadata.ttl --documentation ./docs/probeset-metadata.md
```

The aforementioned command takes:

* real: 89m1.715s
* user: 175m47.684s
* sys: 6m15.076s

A second try:

* real: 87m45.751s
* user: 179m40.676s
* sys: 7m13.456s

The file-size of the transformed metadata is 6.0G which is to be expected.

Optimisations---perhaps using guile-fibers---can be considered later.