summaryrefslogtreecommitdiff
path: root/issues/dump-genewiki-metadata.gmi
blob: 8db3610a74a9cd248a1b95b0b1e0476e6fbf397c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Dump GeneWiki Metadata

## Tags

* assigned: bonfacem
* type: feature
* keywords: metadata, RDF

Dump the tables:

* GeneRIF
* GeneRIF_BASIC

## Resources

=> https://www.w3.org/TR/rdf-schema/ RDF Schema 1.1
=> https://www.clearbyte.org/?p=5895&lang=en RDF/S quick walk through
=> https://www.dublincore.org/specifications/dublin-core/dcmi-terms/# DCMI Metadata Terms
=> https://sparql.uniprot.org/.well-known/sparql-examples/ UNIPROT sparql examples

## Checking for duplicates

```
ag "Observational study of gene-disease association" dump.ttl --pager='less -R'
ag "gn:symbol" | sort | less
ag "gn:anonSymbol" | sort | less
```

## Issues

* Some entries in the GeneRIF table don't have any entries in the GeneRIF_BASIC table:

```
SELECT * FROM GeneRIF LEFT JOIN GeneRIF_BASIC USING (symbol)
LEFT JOIN GeneRIFXRef ON GeneRIFXRef.GeneRIFId = GeneRIF.Id
LEFT JOIN GeneCategory ON GeneRIFXRef.GeneCategoryId = GeneCategory.Id
WHERE GeneRIF.display > 0 AND GeneRIF.VersionId = 0
AND GeneRIF_BASIC.GeneId IS NULL\G
```

* Missing data: some GeneIds stored in GN are not in GeneInfo.  These can be conceptualised as anonymous genewiki entries.  One example is the symbol "Mul1" with the GeneId 68350.  This example exists in GN with different GeneId's.  Example query:


```
SELECT * FROM GeneInfo WHERE GeneId = 68350\G 
```

* NEWENTRY: We have many genes with the "NEWENTRY".  In GN1, these are represented in one very big page:

=> https://gn1.genenetwork.org/webqtl/main.py?FormID=geneWiki&symbol=NEWENTRY

To query these entries:

```
SELECT * FROM GeneRIF_BASIC WHERE symbol = 'NEWENTRY'\G
```

* Broken UTF-8 character sets that rapper errored out on and that had to be manually fixed.  Here's a list:

```
'(("\x28" . "")
  ("\x29" . "")
  ("\xa0" . " ")
  ("â\x81„" . "/")
  ("â€\x9d" . #\")
  ("’" . #\')
  ("\x02" . "")
  ("\x01" . "")
  ("β" . "β")
  ("α-Â\xad" . "α")
  ("Â\xad" . "")
  ("α" . "α")
  ("–" . "-"))
```

* In the GeneRIF_BASIC table, there are 14,313 rows with an empty symbol:


```
SELECT COUNT(*) FROM GeneRIF_BASIC WHERE symbol = '';
```

* The are comments with the same values but different GeneIds.  Example:

=> https://gn1.genenetwork.org/webqtl/main.py?FormID=geneWiki&symbol=A2m

* closed