[![dump-genenetwork-database-tests CI badge](https://ci.genenetwork.org/badge/dump-genenetwork-database-tests.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database-tests) [![dump-genenetwork-database CI badge](https://ci.genenetwork.org/badge/dump-genenetwork-database.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database) This repository contains code to dump the metadata in the GeneNetwork relational database to RDF. It requires a connection to a SQL server. # Using Drop into a development environment with ``` shell $ guix shell -m manifest.scm ``` If the path is not picked up add ``` export PATH=$GUIX_ENVIRONMENT/bin:$PATH ``` Build the sources. ``` shell $ make ``` or for a container ```shell mkdir ./tmp guix shell -C --network --share=/run/mysqld/ --manifest=manifest.scm export GUILE_LOAD_PATH=.:$GUILE_LOAD_PATH guile json-to-ttl.scm etc/sample.json tmp/ ``` That reads the `etc/sample.json` file included in this repository and converts it to an RDF representation that is stored in a file `./tmp/sampledata.ttl`. ## Set up connection parameters Describe the database connection parameters in a file *conn.scm* file as shown below. Take care to replace the placeholders within angle brackets with the appropriate values. ``` scheme ((sql-username . "") (sql-password . "") (sql-database . "") (sql-host . "") (sql-port . ) (virtuoso-port . ) (virtuoso-username . "") (virtuoso-password . "") (sparql-scheme . ) (sparql-host . "") (sparql-port . )) ``` Here's a sample *conn.scm*. ``` scheme ((sql-username . "webqtlout") (sql-password . "my-secret-password") (sql-database . "db_webqtl") (sql-host . "localhost") (sql-port . 3306) (virtuoso-port . 9081) (virtuoso-username . "dba") (virtuoso-password . "my-secret-virtuoso-password") (sparql-scheme . http) (sparql-host . "localhost") (sparql-port . 9082)) ``` ## Transform the database Example: Transform the phenotype from SQL to Terse RDF Triple Language (TTL) ```sh guile -s examples/phenotype.scm \ --settings=conn.scm \ --output=tmp/phenotype.ttl \ --documentation=tmp/phenotype.ttl.md ``` the `-s` option to *guile* runs the `examples/phenotype.scm` file as a script. Everything else on the command line is passed onto the script as command-line arguments. This should create the files: - `tmp/phenotype.ttl`: will contain the data in the database in TTL format - `tmp/phenotype.ttl.md`: will contain a short documentation on the data in the file above. **Note to Devs**: The current `pre-inst-env` script will not work within containers since it assumes the existence of `/usr/bin/env`. We need to fix that if we intend to keep using that. There is a shorter form of the command above: ```sh guile -s examples/phenotype.scm \ -s conn.scm \ -o tmp/phenotype.ttl \ -d tmp/phenotype.ttl.md ``` which does the same thing, but has the potential to be confusing due to the two `-s` options: the first `-s` option is to guile while the second is to the script itself. ## Validate and load dump Then, validate the dumped RDF using `rapper`: ``` shell $ guix shell -m manifest.scm -- rapper --input turtle --count ~/data/dump/dump.ttl ``` If there are no errors, load the relevant RDF files into the `http://genenetwork.org` graph using the `load-rdf.scm` script: ``` shell $ guix shell -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm ~/data/dump/dump.ttl ``` This `load-rdf.scm` script replaces the existing graph with the ttl files from: "/var/lib/data", and indexes all the text data for quicker searches. ## Upload data to virtuoso See https://issues.genenetwork.org/topics/systems/virtuoso ## Visualize schema Now, you may query virtuoso to visualize the SQL and RDF schema. ``` shell $ guix shell -m manifest.scm -- ./pre-inst-env ./visualize-schema.scm conn.scm ``` This will output graphviz dot files `sql.dot` and `rdf.dot` describing the schema. Render them into SVG images like so. ``` shell $ dot -Tsvg -osql.svg sql.dot $ dot -Tsvg -ordf.svg rdf.dot ``` Or, peruse them interactively with `xdot`. ``` shell $ xdot sql.dot $ xdot rdf.dot ``` The [dump-genenetwork-database](https://ci.genenetwork.org/jobs/dump-genenetwork-database) continuous integration job runs these steps on every commit and publishes its version of [sql.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/sql.svg) and [rdf.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/rdf.svg). # LICENSE AGPLv3. See LICENSE.txt # Contributing The main code tree is hosted at https://git.genenetwork.org/. ```sh git remote add gn git.genenetwork.org:/home/git/public/gn-transform-databases ``` See bugs and tasks in BUGS.org.