summaryrefslogtreecommitdiff
path: root/issues/transform-genelist-to-rdf.gmi
blob: bc0f1b86ab0b9a5c0d021e33ea6fc59ff654bfed (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Add GeneList to Metadata

## Tags

* assigned: bonfacem
* priority: high
* type: RDF
* keywords: virtuoso

## Description:

=> https://www.gene-list.com/about About GeneList


The GeneList table in GN fetches metadata about given Genes under Resource Links.

Example:

=> https://genenetwork.org/show_trait?trait_id=1460303_at&dataset=HC_M2_0606_P Trait Data and Analysis for 1460303_at

When transforming data, it's unclear how resources (GeneMANIA, STRING, PANTHER, etc.)  become links---they are manually constructed in GN's source code.  This transformation is crucial when converting data to RDF.

## GeneList Metadata

Consider GN's approach for fetching GeneList entries for a specific trait.

=> https://github.com/genenetwork/genenetwork2/blob/371cbaeb1b05a062d7f75083aa4ff7209e4e06b3/wqflask/wqflask/show_trait/show_trait.py#L398 Fetching GeneList for a given trait

The GeneList table lacks unique GeneSymbols and GeneIds, as illustrated in the following examples:

```
SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G
```

Duplicate entry examples:

```
SELECT * FROM GeneList WHERE GeneSymbol = "AB102723" AND 
GeneId=3070 AND SpeciesId = 4 \G

SELECT * FROM GeneList WHERE SpeciesId = 1 AND GeneSymbol = "Sp3" AND GeneId = 20687 AND Chromosome = "2"\G
```

Identifying duplicates:

```
SELECT GeneSymbol, GeneId, SpeciesId, COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) AS `count` FROM GeneList GROUP BY BINARY GeneSymbol, GeneId, chromosome, txStart, txEnd HAVING COUNT(CONCAT(GeneSymbol, "_", GeneId, "_", SpeciesId)) > 1;
```


## Unique Gene Identifiers

In the GeneList table, some genes share GeneIds and GeneSymbols. GeneIds are unique within a species, while GeneSymbols are unique across species. In cases where GeneSymbols and GeneIDs match, different AlignIDs exist. To create unique identifiers for genes in the GeneList table, we use a query like:

```sql
SELECT CONCAT_WS("_", GeneSymbol, GeneID, AlignID) FROM GeneList;
```

For the GeneList_rn33 table, due to ambiguous cases, we rely on the table's id as a unique identifier. Here's an example of duplicate entries for a gene, differing only in txStart/txEnd/cdsStart/cdsEnd/exonStarts/exonEnd values:

```sql
SELECT * FROM GeneList_rn33 WHERE geneSymbol="Cbara1" AND NM_ID="NM_199412"\G
```


* closed