summaryrefslogtreecommitdiff
path: root/topics/ADR
diff options
context:
space:
mode:
Diffstat (limited to 'topics/ADR')
-rw-r--r--topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi18
-rw-r--r--topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi74
-rw-r--r--topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi102
-rw-r--r--topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi127
-rw-r--r--topics/ADR/gn3/000-add-test-cases-for-rdf.gmi21
-rw-r--r--topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi49
-rw-r--r--topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi32
7 files changed, 423 insertions, 0 deletions
diff --git a/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi
new file mode 100644
index 0000000..05b2b6a
--- /dev/null
+++ b/topics/ADR/gn-guile/000-markdown-editor-push-to-bare-repo.gmi
@@ -0,0 +1,18 @@
+# [gn-guile/ADR-000] Extend Markdown Editor to push to Git Bare Repo
+
+* author: bonfacem
+* status: accepted
+* reviewed-by: alexm, jnduli
+
+## Context
+
+The gn-guile markdown editor currently reads from normal git repositories. However, for GN's self-hosted git repository, we use bare repositories. Bare repositories only store the git objects, therefore we can't edit files directly.
+
+## Decision
+
+gn-guile and the cgit instance run in the same server. We will have one normal repository, and the bare repository, configurable by: "CURRENT_REPO_PATH", which has the normal raw files; and "CGIT_REPO_PATH" which is the bare repository. We will make edits to the normal repository, and once that is done, push locally to the cgit instance.
+
+## Consequences
+
+* When creating the gn-guile container, this introduces extra complexity in that will have to make sure that the container has the correct write access to the bare repository in the container.
+* With this, we are coupled to our GN git set-up.
diff --git a/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..1e3ee6a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,74 @@
+# [gn-transform-databases/ADR-000] Remodel GeneRIF Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+In RDF 1.1 Turtle, you have to use a Qname as the subject. As such, you cannot have a string literal forming the string. In simpler terms, this is not possible:
+
+```
+"Unique expression signature of a system that includes the subiculum, layer 6 in cortex ventral and lateral to dorsal striatum, and the endopiriform nucleus. Expression in cerebellum is apparently limited to Bergemann glia ABA" dct:created "2007-08-31T13:00:47"^^xsd:datetime .
+```
+
+As of commit "397745b554e0", a work-around was to manually create a unique identifier for each comment for the GeneRIF table. This identifier was created by combining GeneRIF.Id with GeneRIF.VersionId. One challenge with this is that we create some coupling with MySQL's unique generation of the GeneRIF.Id column. Here's an example of snipped turtle entries:
+
+```
+gn:wiki-352-0 rdfs:comment "Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia." .
+gn:wiki-352-0 rdf:type gnc:GNWikiEntry .
+gn:wiki-352-0 gnt:symbol gn:symbolPitpna .
+gn:wiki-352-0 dct:created "2006-03-10T15:39:29"^^xsd:datetime .
+gn:wiki-352-0 gnt:belongsToSpecies gn:Mus_musculus .
+gn:wiki-352-0 dct:hasVersion "0"^^xsd:int .
+gn:wiki-352-0 dct:identifier "352"^^xsd:int .
+gn:wiki-352-0 gnt:initial "BAH" .
+gn:wiki-352-0 foaf:mbox "XXX@XXX.XXX" .
+gn:wiki-352-0 dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+gn:wiki-352-0 gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+```
+
+## Decision
+
+We want to avoid manually generating a unique identifier for each WIKI comment. We should instead have that UID be a blank node reference that we don't care about and use predicateObjectLists as an idiom for representing string literals that can't be subjects.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList Predicate Object Lists
+
+The above transform (gn:wiki-352-0) would now be represented as:
+
+```
+[ rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] rdf:type gnc:GNWikiEntry ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ dct:created "2006-03-10 12:39:29"^^xsd:datetime ;
+ dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) ;
+ foaf:mbox <XXX@XXX.XXX> ;
+ dct:identifier "352"^^xsd:integer ;
+ dct:hasVersion "0"^^xsd:integer ;
+ gnt:initial "BAH" ;
+ gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) ;
+ gnt:symbol gn:symbolPitpna .
+```
+
+The above can be loosely translated as:
+
+```
+_:comment rdfs:comment '''Ubiquitously expressed. Hypomorphic vibrator allele shows degeneration of interneurons and tremor and juvenile lethality; modified by CAST alleles of Nxf1. Knockout has hepatic steatosis and hypoglycemia.'''@en] .
+_:comment rdf:type gnc:GNWikiEntry .
+_:comment dct:created "2006-03-10 12:39:29"^^xsd:datetime .
+_:comment dct:references ( pubmed:9182797 pubmed:12788952 pubmed:14517553 ) .
+_:comment foaf:mbox <bah@ucsd.edu> .
+_:comment dct:identifier "352"^^xsd:integer .
+_:comment dct:hasVersion "0"^^xsd:integer .
+_:comment gnt:initial "BAH" .
+_:comment gnt:belongsToCategory ( "Cellular distribution" "Development and aging" "Expression patterns: mature cells, tissues" "Genetic variation and alleles" "Health and disease associations" "Interactions: mRNA, proteins, other molecules" ) .
+_:comment gnt:symbol gn:symbolPitpna .
+```
+
+## Consequences
+
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+* Reduction in size of the final output, and faster transform time because using PredicateObjectLists output more terse RDF.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
new file mode 100644
index 0000000..073525a
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/001-remodel-ncbi-transform-with-predicateobject-lists.gmi
@@ -0,0 +1,102 @@
+# [gn-transform-databases/ADR-001] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata Using predicateObject Lists
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+We can model RIF comments using pridacetobject lists as described in:
+
+=> https://issues.genenetwork.org/topics/ADR/gn-transform-databases/000-remodel-rif-transform-with-predicateobject-lists [ADR/gn-transform-databases] Remodel GeneRIF Metadata Using predicateObject Lists
+
+However, currently for NCBI RIFs we represent comments as blank nodes:
+
+```
+gn:symbolsspA rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944744 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:97295 ;
+ ...
+ dct:references pubmed:15361618 ;
+ dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944780 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:320034 ;
+ ...
+ dct:references pubmed:16369539 ;
+ dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+
+```
+
+Here we see alot of duplicated entries for the same symbols. For the above 2 entries, everything is exactly the same except for the "gnt:hasGeneId" and "dct:references" predicates.
+
+## Decision
+
+We use predicateObjectLists with blankNodePropertyLists as an idiom to represent the generif comments.
+
+=> https://www.w3.org/TR/turtle/#grammar-production-predicateObjectList predicateObjectList
+=> https://www.w3.org/TR/turtle/#grammar-production-blankNodePropertyList blankNodePropertyList
+
+In so doing, we can de-duplicate the entries demonstrated above. A representation of the above RDF Turtle triples would be:
+
+```
+[ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ]
+rdf:type gnc:NCBIWikiEntry ;
+dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+gnt:belongsToSpecies gn:Mus_musculus ;
+skos:notation taxon:511145 ;
+dct:hasVersion '1'^^xsd:int ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944744 ;
+ gnt:symbol gn:symbolsspA ;
+ dct:references ( pubmed:97295 ... pubmed:15361618 ) ;
+] ;
+rdfs:seeAlso [
+ gnt:hasGeneId generif:944780 ;
+ gn:symbolaraC ;
+ dct:references ( pubmed:320034 ... pubmed:16369539 ) ;
+] .
+```
+
+The above would translate to the following triples:
+
+```
+_:comment rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string .
+_:comment rdfs:type gn:NCBIWikiEntry .
+_:comment dct:created "2007-11-06T00:39:00"^^xsd:datetime .
+_:comment gnt:belongsToSpecies gn:Mus_musculus .
+_:comment skos:notation taxon:511145 .
+_:comment dct:hasVersion '1'^^xsd:int .
+_:comment rdfs:seeAlso _:metadata1
+_:comment rdfs:seeAlso _:metadata2 .
+_:metadata1 gnt:hasGeneId generif:944744 .
+_:metadata1 gnt:symbol gn:symbolaraC .
+_:metadata1 dct:references ( pubmed:97295 ... pubmed:15361618 )
+_:metadata2 gnt:hasGeneId generif:944780 .
+_:metadata2 gnt:symbol gn:symbolsspA .
+_:metadata2 dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+```
+
+Beyond that, we intentionally use a sequence to store a list of pubmed references.
+
+## Consequences
+
+* De-duplication of comments during the transform while retaining the integrity of the RIF metadata.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
+
+## Rejection Rationale
+
+This proposal was rejected because relying on blank-nodes as an identifier is opaque and not human-readable. We want to use human readable identifiers where possible.
diff --git a/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
new file mode 100644
index 0000000..ac06fc1
--- /dev/null
+++ b/topics/ADR/gn-transform-databases/002-remodel-ncbi-transform-to-be-more-compact.gmi
@@ -0,0 +1,127 @@
+# [gn-transform-databases/ADR-002] Remodel GeneRIF_BASIC (NCBI RIFs) Metadata To Be More Compact
+
+* author: bonfacem
+* status: proposal
+* reviewed-by: pjotr, jnduli
+
+## Context
+
+Currently, we represent NCBI RIFs as blank nodes that form the object of a given symbol:
+
+```
+gn:symbolsspA rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944744 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:97295 ;
+ ...
+ dct:references pubmed:15361618 ;
+ dct:created "2007-11-06T00:38:00"^^xsd:datetime ;
+] .
+gn:symbolaraC rdfs:comment [
+ rdf:type gnc:NCBIWikiEntry ;
+ rdfs:comment "N-terminus verified by Edman degradation on mature peptide"^^xsd:string ;
+ gnt:belongsToSpecies gn:Mus_musculus ;
+ skos:notation taxon:511145 ;
+ gnt:hasGeneId generif:944780 ;
+ dct:hasVersion '1'^^xsd:int ;
+ dct:references pubmed:320034 ;
+ ...
+ dct:references pubmed:16369539 ;
+ dct:created "2007-11-06T00:39:00"^^xsd:datetime ;
+] .
+```
+
+Moreover, we also store all the different versions of a comment:
+
+```
+mysql> SELECT * FROM GeneRIF_BASIC WHERE SpeciesId=1 AND TaxID=7955 AND GeneId=323473 AND PubMed_ID = 15680355\G
+*************************** 1. row ***************************
+ SpeciesId: 1
+ TaxID: 7955
+ GeneId: 323473
+ symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+ comment: One of two mutations in which defects are observed in both cell populations: it leads to a complete absence of RB neurons and a reduction in neural crest cells
+ VersionId: 1
+*************************** 2. row ***************************
+ SpeciesId: 1
+ TaxID: 7955
+ GeneId: 323473
+ symbol: prdm1a
+ PubMed_ID: 15680355
+createtime: 2010-01-21 00:00:00
+ comment: prdm1 functions to promote the cell fate specification of both neural crest cells and sensory neurons
+ VersionId: 2
+```
+
+## Decision
+
+First, we should only store the latest version of a given RIF entry and ignore all other versions. RIF entries in the GeneRIF_BASIC table are uniquely identified by the columns: SpeciesId, GeneId, PubMed_ID, createtime, and VersionId. Since we are storing the latest version of a given RIF entry, we drop the version identifier during the RDF transform.
+
+We use a unique identifier for a given comment, and use that as a triple's QName:
+
+> gn:rif-<speciesId>-<GeneId>
+
+Finally instead of:
+
+```
+<symbol> predicate <comment metadata>
+```
+
+We use:
+
+```
+<comment-uid> predicate object ;
+ ... (more metadata) .
+```
+
+An example triple would take the form:
+
+```
+gn:rif-1-511145 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145 skos:notation taxon:511145 .
+gn:rif-1-511145 rdfs:seeAlso [
+ gnt:hasGeneId generif:944744 ;
+ gnt:symbol "spA" ;
+ dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+] .
+gn:rif-1-511145 rdfs:seeAlso [
+ gnt:hasGeneId generif:944780 ;
+ gnt:symbol "araC" ;
+ dct:references ( pubmed:320034 ... pubmed:16369539 ) .
+]
+```
+
+To efficiently store GeneIds, symbols and references, we use blank nodes. This reduces redundancy and simplifies the triples compared to including these details within the subject:
+
+```
+gn:rif-1-511145-944744 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944744 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944744 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944744 skos:notation taxon:511145 .
+gn:rif-1-511145-944744 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944744 gnt:symbol "spA" .
+gn:rif-1-511145-944744 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+
+gn:rif-1-511145-944780 rdf:label '''N-terminus verified by Edman degradation on mature peptide'''@en .
+gn:rif-1-511145-944780 rdf:type gnc:NCBIWikiEntry .
+gn:rif-1-511145-944780 gnt:belongsToSpecies gn:Mus_musculus .
+gn:rif-1-511145-944780 skos:notation taxon:511145 .
+gn:rif-1-511145-944780 gnt:hasGeneId generif:944744 .
+gn:rif-1-511145-944780 gnt:symbol "spA" .
+gn:rif-1-511145-944780 dct:references ( pubmed:97295 ... pubmed:15361618 ) .
+```
+
+## Consequences
+
+* More complex SQL query required for the transform.
+* De-duplication of RIF entries during the transform.
+* Because of the terseness, less work during the I/O heavy operation.
+* Update SPARQL in tux02, tux01 in lockstep with updating GN3/GN2 and the XAPIAN index.
diff --git a/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi
new file mode 100644
index 0000000..43ac2ba
--- /dev/null
+++ b/topics/ADR/gn3/000-add-test-cases-for-rdf.gmi
@@ -0,0 +1,21 @@
+# [gn3/ADR-000] Add RDF Test Cases
+
+* author: bonfacem
+* status: proposed
+* reviewed-by: jnduli
+
+## Context
+
+We have no way of ensuring the integrity of our SPARQL queries in GN3. As such, GN3 is fragile to breaking changes when the TTL files are updated.
+
+## Decision
+
+In Virtuoso, we load all our data to a default named graph: <http://genenetwork.org>. For SPARQL/RDF tests, we should upload test ttl files to a test named graph: <http://cd-test.genenetwork.org>, and run our RDF unit tests against that named graph.
+
+## Consequences
+
+* Extra bootstrapping to load ttl files when running the test.
+* Extra documentation to GN developers on how to run virtuoso locally to get the tests running.
+* Testing against gn-machines to make sure that all things run accordingly.
+* Extra maintenance costs to keep the TTL files in lockstep with the latest RDF changes during re-modeling.
+* Improvement in GN3 reliability.
diff --git a/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi
new file mode 100644
index 0000000..0910415
--- /dev/null
+++ b/topics/ADR/gn3/001-remove-stace-traces-in-gn3-error-response.gmi
@@ -0,0 +1,49 @@
+# [gn3/ADR-001] Remove Stack Traces in GN3
+
+* author: bonfacem
+* status: rejected
+* reviewed-by: jnduli, zach, pjotr, fredm
+
+## Context
+
+Currently, GN3 error responses include stack traces:
+
+```
+def add_trace(exc: Exception, jsonmsg: dict) -> dict:
+ """Add the traceback to the error handling object."""
+ return {
+ **jsonmsg,
+ "error-trace": "".join(traceback.format_exception(exc))
+ }
+
+
+def page_not_found(pnf):
+ """Generic 404 handler."""
+ current_app.logger.error("Handling 404 errors", exc_info=True)
+ return jsonify(add_trace(pnf, {
+ "error": pnf.name,
+ "error_description": pnf.description
+ })), 404
+
+
+def internal_server_error(pnf):
+ """Generic 404 handler."""
+ current_app.logger.error("Handling internal server errors", exc_info=True)
+ return jsonify(add_trace(pnf, {
+ "error": pnf.name,
+ "error_description": pnf.description
+ })), 500
+```
+
+
+## Decision
+
+Stack traces have the potential to allow malicious actors compromise our system by providing more context. As such, we should send a useful description of what went wrong; and log our stack traces in our logs, and send an appropriate error status code. We can use the logs to troubleshoot our system.
+
+## Consequences
+
+* Lockstep update in GN2 UI on how we handle GN3 errors.
+
+## Rejection Rationale
+
+The proposal to remove stack traces from error responses was rejected because they are essential for troubleshooting, especially when issues are difficult to reproduce or production logs are inaccessible. Stack traces provide immediate error context, and removing them would complicate debugging by requiring additional effort to link logs with specific requests; a trade-off we are not willing to make at the moment.
diff --git a/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi
new file mode 100644
index 0000000..a8026ce
--- /dev/null
+++ b/topics/ADR/gn3/002-run-rdf-tests-in-build-container.gmi
@@ -0,0 +1,32 @@
+# [gn3/ADR-002] Move RDF Test Cases to Build Container
+
+* author: bonfacem
+* status: accepted
+* reviewed-by: jnduli
+
+## Context
+
+GN3 RDF tests are run against the CD's virtuoso instance. As such, we need to set special parameters when running tests:
+
+```
+SPARQL_USER = "dba"
+SPARQL_PASSWORD = "dba"
+SPARQL_AUTH_URI="http://localhost:8890/sparql-auth/"
+SPARQL_CRUD_AUTH_URI="http://localhost:8890/sparql-graph-crud-auth"
+FAHAMU_AUTH_TOKEN="XXXXXX"
+```
+
+This extra bootstrapping when running tests needs care, and locks tests to CD or special configuration when running locally. This leads to fragile tests that cause CD to break. Moreover, to add tests to CD, we would have to add extra g-exp to gn-machines.
+
+This ADR is related to:
+
+=> /topics/ADR/gn3/000-add-test-cases-for-rdf.gmi gn3/ADR-000.
+
+## Decision
+
+Move tests to the test build phase of building the genenetwork3 package. These tests are added in the ".guix/genenetwork3-all-tests.scm" file instead of the main "genenetwork3" package definition in guix-bioinformatics. This way, we have all our "light" tests I.e. unit tests running in guix-bioinformatics, while having all our heavier tests, in this case, RDF tests, running in CD.
+
+## Consequences
+
+* Extra bootstrapping to gn3's .guix/genenetwork3-package.scm to get tests working.
+* GN3 RDF tests refactoring to use a virtuoso instance running in the background while tests are running.