1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
|
# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact
* author: bonfacem
* status: proposal
* reviewed-by: pjotr, jnduli
## Context
Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol:
```
gn:symbolsspA rdfs:comment [
rdf:type gnc:NCBIWikiEntry ;
rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
gnt:belongsToSpecies gn:Mus_musculus ;
skos:notation taxon:511145 ;
gnt:hasGeneId generif:944744 ;
dct:hasVersion '1'^^xsd:int ;
dct:references pubmed:97295 ;
...
dct:references pubmed:15361618 ;
dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
] .
gn:symbolaraC rdfs:comment [
rdf:type gnc:NCBIWikiEntry ;
rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
gnt:belongsToSpecies gn:Mus_musculus ;
skos:notation taxon:511145 ;
gnt:hasGeneId generif:944780 ;
dct:hasVersion '1'^^xsd:int ;
dct:references pubmed:320034 ;
...
dct:references pubmed:16369539 ;
dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
] .
```
Moreover, we also store all the different versions of a comment:
```
mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G
*************************** 1. row ***************************
SpeciesId: 1
TaxID: 7955
GeneId: 323473
symbol: prdm1a
PubMed_ID: 15680355
createtime: 2010-01-21 00:00:00
comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells
VersionId: 1
*************************** 2. row ***************************
SpeciesId: 1
TaxID: 7955
GeneId: 323473
symbol: prdm1a
PubMed_ID: 15680355
createtime: 2010-01-21 00:00:00
comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons
VersionId: 2
```
## Decision
First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform.
We use a unique identifier for a given comment, and use that as a triple's QName:
> gn:rif-<speciesId>-<GeneId>
Finally instead of:
```
<symbol> predicate <comment metadata>
```
We use:
```
<comment-uid> predicate object ;
... (more metadata) .
```
An example triple would take the form:
```
gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145 skos:notation taxon:511145 .
gn:rif-1-511145 rdfs:seeAlso [
gnt:hasGeneId generif:944744 ;
gnt:symbol "spA" ;
dct:references ( pubmed:97295 ... pubmed:15361618 ) .
] .
gn:rif-1-511145 rdfs:seeAlso [
gnt:hasGeneId generif:944780 ;
gnt:symbol "araC" ;
dct:references ( pubmed:320034 ... pubmed:16369539 ) .
]
```
To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject:
```
gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145-944744 skos:notation taxon:511145 .
gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 .
gn:rif-1-511145-944744 gnt:symbol "spA" .
gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145-944780 skos:notation taxon:511145 .
gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 .
gn:rif-1-511145-944780 gnt:symbol "spA" .
gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
```
## Consequences
* More complex SQL query required for the transform.
* De-duplication of RIF entries during the transform.
* Because of the terseness, less work during the I/O heavy operation.
* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
|