summaryrefslogtreecommitdiff
path: root/topics
diff options
context:
space:
mode:
authorMunyoki Kilyungi2024-09-07 17:35:57 +0300
committerMunyoki Kilyungi2024-09-07 17:35:57 +0300
commit5a9df614faecd5290779c6009a72de565d1b7512 (patch)
treea3c53934f0fceb91ceb2673d26a298208ff70e0e /topics
parent01dbe2ff5aa9452dc190cf604b69a298a3c0e994 (diff)
downloadgn-gemtext-5a9df614faecd5290779c6009a72de565d1b7512.tar.gz
New ADR on transforming NCBI RIF metadata.
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
Diffstat (limited to 'topics')
-rw-r--r--topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi98
1 files changed, 98 insertions, 0 deletions
diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..c757f18
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,98 @@
+# [ADR/gn-transform-databases] Remodel GeneRIF_BASIC (NCBI Rifs) Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: proposed
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+We can model RIF comments using pridacetobject lists as described in:
+
+=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists
+
+However, currently for NCBI RIFs we represent comments as blank nodes:
+
+```
+gn:symbolsspA rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944744 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:97295 ;
+ ...
+ dct:references pubmed:15361618 ;
+ dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944780 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:320034 ;
+ ...
+ dct:references pubmed:16369539 ;
+ dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+
+```
+
+Here we see alot of duplicated entries for the same symbols. For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates.
+
+## Decision
+
+We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList
+=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList
+
+In so doing, we can de-duplicate the entries demonstrated above. A representation of the above RDF Turtle triples would be:
+
+```
+[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ]
+rdf:type gnc:NCBIWikiEntry ;
+dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+gnt:belongsToSpecies gn:Mus_musculus ;
+skos:notation taxon:511145 ;
+dct:hasVersion '1'^^xsd:int ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944744 ;
+ gnt:symbol gn:symbolsspA ;
+ dct:references ( pubmed:97295 ... pubmed:15361618 ) ;
+] ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944780 ;
+ gn:symbolaraC ;
+ dct:references ( pubmed:320034 ... pubmed:16369539 ) ;
+] .
+```
+
+The above would translate to the following triples:
+
+```
+_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string .
+_:comment rdfs:type gn:NCBIWikiEntry .
+_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime .
+_:comment gnt:belongsToSpecies gn:Mus_musculus .
+_:comment skos:notation taxon:511145 .
+_:comment dct:hasVersion '1'^^xsd:int .
+_:comment rdfs:seeAlso _:metadata1
+_:comment rdfs:seeAlso _:metadata2 .
+_:metadata1 gnt:hasGeneId generif:944744 .
+_:metadata1 gnt:symbol gn:symbolaraC .
+_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 )
+_:metadata2 gnt:hasGeneId generif:944780 .
+_:metadata2 gnt:symbol gn:symbolsspA .
+_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+```
+
+Beyond that, we intentionally use a sequence to store a list of pubmed references.
+
+## Consequences
+
+* De-duplication of comments during the transform while retaining the integrity of the RIF metadata.
+* The indexing script will have to be updated.
+* Because of the terseness, less work during the I/O heavy operation.