diff options
author | Munyoki Kilyungi | 2024-09-07 17:35:57 +0300 |
---|---|---|
committer | Munyoki Kilyungi | 2024-09-07 17:35:57 +0300 |
commit | 5a9df614faecd5290779c6009a72de565d1b7512 (patch) | |
tree | a3c53934f0fceb91ceb2673d26a298208ff70e0e /topics | |
parent | 01dbe2ff5aa9452dc190cf604b69a298a3c0e994 (diff) | |
download | gn-gemtext-5a9df614faecd5290779c6009a72de565d1b7512.tar.gz |
New ADR on transforming NCBI RIF metadata.
Signed-off-by: Munyoki Kilyungi <me@bonfacemunyoki.com>
Diffstat (limited to 'topics')
-rw-r--r-- | topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi | 98 |
1 files changed, 98 insertions, 0 deletions
diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi new file mode 100644 index 0000000..c757f18 --- /dev/null +++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi @@ -0,0 +1,98 @@ +# [ADR/gn-transform-databases] Remodel GeneRIF_BASIC (NCBI Rifs) Metadata Using predicateObject Lists + +* author: bonfacem +* status: proposed +* reviewed-by: pjotr, jnduli + +## Context + +We can model RIF comments using pridacetobject lists as described in: + +=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists + +However, currently for NCBI RIFs we represent comments as blank nodes: + +``` +gn:symbolsspA rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944744 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:97295 ; + ... + dct:references pubmed:15361618 ; + dct:created "2007-11-06T00:38:00"^^xsd:datetime ; +] . +gn:symbolaraC rdfs:comment [ + rdf:type gnc:NCBIWikiEntry ; + rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ; + gnt:belongsToSpecies gn:Mus_musculus ; + skos:notation taxon:511145 ; + gnt:hasGeneId generif:944780 ; + dct:hasVersion '1'^^xsd:int ; + dct:references pubmed:320034 ; + ... + dct:references pubmed:16369539 ; + dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +] . + +``` + +Here we see alot of duplicated entries for the same symbols. For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates. + +## Decision + +We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments. + +=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList +=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList + +In so doing, we can de-duplicate the entries demonstrated above. A representation of the above RDF Turtle triples would be: + +``` +[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ] +rdf:type gnc:NCBIWikiEntry ; +dct:created "2007-11-06T00:39:00"^^xsd:datetime ; +gnt:belongsToSpecies gn:Mus_musculus ; +skos:notation taxon:511145 ; +dct:hasVersion '1'^^xsd:int ; +rdfs:seeAlso [ + gnt:hasGeneId generif:944744 ; + gnt:symbol gn:symbolsspA ; + dct:references ( pubmed:97295 ... pubmed:15361618 ) ; +] ; +rdfs:seeAlso [ + gnt:hasGeneId generif:944780 ; + gn:symbolaraC ; + dct:references ( pubmed:320034 ... pubmed:16369539 ) ; +] . +``` + +The above would translate to the following triples: + +``` +_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string . +_:comment rdfs:type gn:NCBIWikiEntry . +_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime . +_:comment gnt:belongsToSpecies gn:Mus_musculus . +_:comment skos:notation taxon:511145 . +_:comment dct:hasVersion '1'^^xsd:int . +_:comment rdfs:seeAlso _:metadata1 +_:comment rdfs:seeAlso _:metadata2 . +_:metadata1 gnt:hasGeneId generif:944744 . +_:metadata1 gnt:symbol gn:symbolaraC . +_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 ) +_:metadata2 gnt:hasGeneId generif:944780 . +_:metadata2 gnt:symbol gn:symbolsspA . +_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) . +``` + +Beyond that, we intentionally use a sequence to store a list of pubmed references. + +## Consequences + +* De-duplication of comments during the transform while retaining the integrity of the RIF metadata. +* The indexing script will have to be updated. +* Because of the terseness, less work during the I/O heavy operation. |