diff options
Diffstat (limited to 'topics')
-rw-r--r-- | topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi | 127 |
1 files changed, 127 insertions, 0 deletions
diff --git a/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi new file mode 100644 index 0000000..ac06fc1 --- /dev/null +++ b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi @@ -0,0 +1,127 @@ +# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact + +* author: bonfacem +* status: proposal +* reviewed-by: pjotr, jnduli + +## Context + +Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol: + +``` +gn:symbolsspA rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944744 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:97295 ; + ... + dct:references pubmed:15361618 ; + dct:created "2007-11-06T00:38:00"^^xsd:datetime ; +] . +gn:symbolaraC rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944780 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:320034 ; + ... + dct:references pubmed:16369539 ; + dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +] . +``` + +Moreover, we also store all the different versions of a comment: + +``` +mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G +*************************** 1. row *************************** + SpeciesId: 1 + TaxID: 7955 + GeneId: 323473 + symbol: prdm1a + PubMed_ID: 15680355 +createtime: 2010-01-21 00:00:00 + comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells + VersionId: 1 +*************************** 2. row *************************** + SpeciesId: 1 + TaxID: 7955 + GeneId: 323473 + symbol: prdm1a + PubMed_ID: 15680355 +createtime: 2010-01-21 00:00:00 + comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons + VersionId: 2 +``` + +## Decision + +First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform. + +We use a unique identifier for a given comment, and use that as a triple's QName: + +> gn:rif-<speciesId>-<GeneId> + +Finally instead of: + +``` +<symbol> predicate <comment metadata> +``` + +We use: + +``` +<comment-uid> predicate object ; + ... (more metadata) . +``` + +An example triple would take the form: + +``` +gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145 skos:notation taxon:511145 . +gn:rif-1-511145 rdfs:seeAlso [ + gnt:hasGeneId generif:944744 ; + gnt:symbol "spA" ; + dct:references ( pubmed:97295 ... pubmed:15361618 ) . +] . +gn:rif-1-511145 rdfs:seeAlso [ + gnt:hasGeneId generif:944780 ; + gnt:symbol "araC" ; + dct:references ( pubmed:320034 ... pubmed:16369539 ) . +] +``` + +To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject: + +``` +gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145-944744 skos:notation taxon:511145 . +gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 . +gn:rif-1-511145-944744 gnt:symbol "spA" . +gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) . + +gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . +gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry . +gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus . +gn:rif-1-511145-944780 skos:notation taxon:511145 . +gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 . +gn:rif-1-511145-944780 gnt:symbol "spA" . +gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) . +``` + +## Consequences + +* More complex SQL query required for the transform. +* De-duplication of RIF entries during the transform. +* Because of the terseness, less work during the I/O heavy operation. +* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index. |