# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact * author: bonfacem * status: proposal * reviewed-by: pjotr, jnduli ## Context Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol: ``` gn:symbolsspA rdfs:comment [ rdf:type gnc:NCBIWikiEntry ; rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; gnt:belongsToSpecies gn:Mus_musculus ; skos:notation taxon:511145 ; gnt:hasGeneId generif:944744 ; dct:hasVersion '1'^^xsd:int ; dct:references pubmed:97295 ; ... dct:references pubmed:15361618 ; dct:created "2007-11-06T00:38:00"^^xsd:datetime ; ] . gn:symbolaraC rdfs:comment [ rdf:type gnc:NCBIWikiEntry ; rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; gnt:belongsToSpecies gn:Mus_musculus ; skos:notation taxon:511145 ; gnt:hasGeneId generif:944780 ; dct:hasVersion '1'^^xsd:int ; dct:references pubmed:320034 ; ... dct:references pubmed:16369539 ; dct:created "2007-11-06T00:39:00"^^xsd:datetime ; ] . ``` Moreover, we also store all the different versions of a comment: ``` mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G *************************** 1. row *************************** SpeciesId: 1 TaxID: 7955 GeneId: 323473 symbol: prdm1a PubMed_ID: 15680355 createtime: 2010-01-21 00:00:00 comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells VersionId: 1 *************************** 2. row *************************** SpeciesId: 1 TaxID: 7955 GeneId: 323473 symbol: prdm1a PubMed_ID: 15680355 createtime: 2010-01-21 00:00:00 comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons VersionId: 2 ``` ## Decision First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform. We use a unique identifier for a given comment, and use that as a triple's QName: > gn:rif-- Finally instead of: ``` predicate ``` We use: ``` predicate object ; ... (more metadata) . ``` An example triple would take the form: ``` gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry . gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus . gn:rif-1-511145 skos:notation taxon:511145 . gn:rif-1-511145 rdfs:seeAlso [ gnt:hasGeneId generif:944744 ; gnt:symbol "spA" ; dct:references ( pubmed:97295 ... pubmed:15361618 ) . ] . gn:rif-1-511145 rdfs:seeAlso [ gnt:hasGeneId generif:944780 ; gnt:symbol "araC" ; dct:references ( pubmed:320034 ... pubmed:16369539 ) . ] ``` To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject: ``` gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry . gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus . gn:rif-1-511145-944744 skos:notation taxon:511145 . gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 . gn:rif-1-511145-944744 gnt:symbol "spA" . gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) . gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en . gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry . gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus . gn:rif-1-511145-944780 skos:notation taxon:511145 . gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 . gn:rif-1-511145-944780 gnt:symbol "spA" . gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) . ``` ## Consequences * More complex SQL query required for the transform. * De-duplication of RIF entries during the transform. * Because of the terseness, less work during the I/O heavy operation. * Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.