summaryrefslogtreecommitdiff
path: root/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
blob: ac06fc13ec84d752c0c698aef55022752f37f684 (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact

* author: bonfacem
* status: proposal
* reviewed-by: pjotr, jnduli

## Context

Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol:

```
gn:symbolsspA rdfs:comment [
	rdf:type gnc:NCBIWikiEntry ;
	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
	gnt:belongsToSpecies gn:Mus_musculus ;
	skos:notation taxon:511145 ;
	gnt:hasGeneId generif:944744 ;
	dct:hasVersion '1'^^xsd:int ;
	dct:references pubmed:97295 ;
	...
	dct:references pubmed:15361618 ;
	dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
] .
gn:symbolaraC rdfs:comment [
	rdf:type gnc:NCBIWikiEntry ;
	rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
	gnt:belongsToSpecies gn:Mus_musculus ;
	skos:notation taxon:511145 ;
	gnt:hasGeneId generif:944780 ;
	dct:hasVersion '1'^^xsd:int ;
	dct:references pubmed:320034 ;
	...
	dct:references pubmed:16369539 ;
	dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
] .
```

Moreover, we also store all the different versions of a comment:

```
mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G
*************************** 1. row ***************************
 SpeciesId: 1
     TaxID: 7955
    GeneId: 323473
    symbol: prdm1a
 PubMed_ID: 15680355
createtime: 2010-01-21 00:00:00
   comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells
 VersionId: 1
*************************** 2. row ***************************
 SpeciesId: 1
     TaxID: 7955
    GeneId: 323473
    symbol: prdm1a
 PubMed_ID: 15680355
createtime: 2010-01-21 00:00:00
   comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons
 VersionId: 2
```

## Decision

First, we should only store the latest version of a given RIF entry and ignore all other versions.  RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId.  Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform.

We use a unique identifier for a given comment, and use that as a triple's QName:

> gn:rif-<speciesId>-<GeneId>

Finally instead of:

```
<symbol> predicate <comment metadata>
```

We use:

```
<comment-uid> predicate object ;
              ... (more metadata) .
```

An example triple would take the form:

```
gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145 skos:notation taxon:511145 .
gn:rif-1-511145 rdfs:seeAlso [
    gnt:hasGeneId generif:944744 ;
    gnt:symbol "spA" ;
    dct:references ( pubmed:97295 ... pubmed:15361618 ) .
] .
gn:rif-1-511145 rdfs:seeAlso [
    gnt:hasGeneId generif:944780 ;
    gnt:symbol "araC" ;
    dct:references ( pubmed:320034 ... pubmed:16369539 ) .
]
```

To efficiently store GeneIds, symbols and references, we use blank nodes.  This reduces redundancy and simplifies the triples compared to including these details within the subject:

```
gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145-944744 skos:notation taxon:511145 .
gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 .
gn:rif-1-511145-944744 gnt:symbol "spA" .
gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) .

gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry .
gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus .
gn:rif-1-511145-944780 skos:notation taxon:511145 .
gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 .
gn:rif-1-511145-944780 gnt:symbol "spA" .
gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
```

## Consequences

* More complex SQL query required for the transform.
* De-duplication of RIF entries during the transform.
* Because of the terseness, less work during the I/O heavy operation.
* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.