summaryrefslogtreecommitdiff
path: root/topics/ADR/gn-transform-databases
diff options
context:
space:
mode:
Diffstat (limited to 'topics/ADR/gn-transform-databases')
-rw-r--r--topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi74
-rw-r--r--topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi102
-rw-r--r--topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi127
3 files changed, 303 insertions, 0 deletions
diff --git a/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..1e3ee6a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,74 @@
+# [gn-transform-databases/ADR-000] Remodel GeneRIF Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+In RDF 1.1 Turtle, you have to use a Qname as the subject. As such, you cannot have a string literal forming the string. In simpler terms, this is not possible:
+
+```
+"Unique expression signature of a system that includes the subiculum, layer 6 in cortex ventral and lateral to dorsal striatum, and the endopiriform nucleus. Expression in cerebellum is apparently limited to Bergemann glia ABA" dct:created "2007-08-31T13:00:47"^^xsd:datetime .
+```
+
+As of commit "397745b554e0", a work-around was to manually create a unique identifier for each comment for the GeneRIF table. This identifier was created by combining GeneRIF.Id with GeneRIF.VersionId. One challenge with this is that we create some coupling with MySQL's unique generation of the GeneRIF.Id column. Here's an example of snipped turtle entries:
+
+```
+gn:wiki-352-0 rdfs:comment "Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia." .
+gn:wiki-352-0 rdf:type gnc:GNWikiEntry .
+gn:wiki-352-0 gnt:symbol gn:symbolPitpna .
+gn:wiki-352-0 dct:created "2006-03-10T15:39:29"^^xsd:datetime .
+gn:wiki-352-0 gnt:belongsToSpecies gn:Mus_musculus .
+gn:wiki-352-0 dct:hasVersion "0"^^xsd:int .
+gn:wiki-352-0 dct:identifier "352"^^xsd:int .
+gn:wiki-352-0 gnt:initial "BAH" .
+gn:wiki-352-0 foaf:mbox "XXX@XXX.XXX" .
+gn:wiki-352-0 dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+gn:wiki-352-0 gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+```
+
+## Decision
+
+We want to avoid manually generating a unique identifier for each WIKI comment. We should instead have that UID be a blank node reference that we don't care about and use predicateObjectLists as an idiom for representing string literals that can't be subjects.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList Predicate Object Lists
+
+The above transform (gn:wiki-352-0) would now be represented as:
+
+```
+[ rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] rdf:type gnc:GNWikiEntry ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ dct:created "2006-03-10 12:39:29"^^xsd:datetime ;
+ dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) ;
+ foaf:mbox <XXX@XXX.XXX> ;
+ dct:identifier "352"^^xsd:integer ;
+ dct:hasVersion "0"^^xsd:integer ;
+ gnt:initial "BAH" ;
+ gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) ;
+ gnt:symbol gn:symbolPitpna .
+```
+
+The above can be loosely translated as:
+
+```
+_:comment rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] .
+_:comment rdf:type gnc:GNWikiEntry .
+_:comment dct:created "2006-03-10 12:39:29"^^xsd:datetime .
+_:comment dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+_:comment foaf:mbox <bah@ucsd.edu> .
+_:comment dct:identifier "352"^^xsd:integer .
+_:comment dct:hasVersion "0"^^xsd:integer .
+_:comment gnt:initial "BAH" .
+_:comment gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+_:comment gnt:symbol gn:symbolPitpna .
+```
+
+## Consequences
+
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+* Reduction in size of the final output, and faster transform time because using PredicateObjectLists output more terse RDF.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..073525a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,102 @@
+# [gn-transform-databases/ADR-001] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+We can model RIF comments using pridacetobject lists as described in:
+
+=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists
+
+However, currently for NCBI RIFs we represent comments as blank nodes:
+
+```
+gn:symbolsspA rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944744 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:97295 ;
+ ...
+ dct:references pubmed:15361618 ;
+ dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944780 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:320034 ;
+ ...
+ dct:references pubmed:16369539 ;
+ dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+
+```
+
+Here we see alot of duplicated entries for the same symbols. For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates.
+
+## Decision
+
+We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList
+=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList
+
+In so doing, we can de-duplicate the entries demonstrated above. A representation of the above RDF Turtle triples would be:
+
+```
+[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ]
+rdf:type gnc:NCBIWikiEntry ;
+dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+gnt:belongsToSpecies gn:Mus_musculus ;
+skos:notation taxon:511145 ;
+dct:hasVersion '1'^^xsd:int ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944744 ;
+ gnt:symbol gn:symbolsspA ;
+ dct:references ( pubmed:97295 ... pubmed:15361618 ) ;
+] ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944780 ;
+ gn:symbolaraC ;
+ dct:references ( pubmed:320034 ... pubmed:16369539 ) ;
+] .
+```
+
+The above would translate to the following triples:
+
+```
+_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string .
+_:comment rdfs:type gn:NCBIWikiEntry .
+_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime .
+_:comment gnt:belongsToSpecies gn:Mus_musculus .
+_:comment skos:notation taxon:511145 .
+_:comment dct:hasVersion '1'^^xsd:int .
+_:comment rdfs:seeAlso _:metadata1
+_:comment rdfs:seeAlso _:metadata2 .
+_:metadata1 gnt:hasGeneId generif:944744 .
+_:metadata1 gnt:symbol gn:symbolaraC .
+_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 )
+_:metadata2 gnt:hasGeneId generif:944780 .
+_:metadata2 gnt:symbol gn:symbolsspA .
+_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+```
+
+Beyond that, we intentionally use a sequence to store a list of pubmed references.
+
+## Consequences
+
+* De-duplication of comments during the transform while retaining the integrity of the RIF metadata.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
new file mode 100644
index 0000000..ac06fc1
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
@@ -0,0 +1,127 @@
+# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact
+
+* author: bonfacem
+* status: proposal
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol:
+
+```
+gn:symbolsspA rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944744 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:97295 ;
+ ...
+ dct:references pubmed:15361618 ;
+ dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944780 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:320034 ;
+ ...
+ dct:references pubmed:16369539 ;
+ dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+```
+
+Moreover, we also store all the different versions of a comment:
+
+```
+mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G
+*************************** 1. row ***************************
+ SpeciesId: 1
+ TaxID: 7955
+ GeneId: 323473
+ symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+ comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells
+ VersionId: 1
+*************************** 2. row ***************************
+ SpeciesId: 1
+ TaxID: 7955
+ GeneId: 323473
+ symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+ comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons
+ VersionId: 2
+```
+
+## Decision
+
+First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform.
+
+We use a unique identifier for a given comment, and use that as a triple's QName:
+
+> gn:rif-<speciesId>-<GeneId>
+
+Finally instead of:
+
+```
+<symbol> predicate <comment metadata>
+```
+
+We use:
+
+```
+<comment-uid> predicate object ;
+ ... (more metadata) .
+```
+
+An example triple would take the form:
+
+```
+gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145 skos:notation taxon:511145 .
+gn:rif-1-511145 rdfs:seeAlso [
+ gnt:hasGeneId generif:944744 ;
+ gnt:symbol "spA" ;
+ dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+] .
+gn:rif-1-511145 rdfs:seeAlso [
+ gnt:hasGeneId generif:944780 ;
+ gnt:symbol "araC" ;
+ dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+]
+```
+
+To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject:
+
+```
+gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944744 skos:notation taxon:511145 .
+gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944744 gnt:symbol "spA" .
+gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+
+gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944780 skos:notation taxon:511145 .
+gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944780 gnt:symbol "spA" .
+gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+```
+
+## Consequences
+
+* More complex SQL query required for the transform.
+* De-duplication of RIF entries during the transform.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.