aboutsummaryrefslogtreecommitdiff

dump-genenetwork-database-tests CI
badge dump-genenetwork-database CI
badge

This repository contains code to dump the metadata in the GeneNetwork relational database to RDF. It requires a connection to a SQL server.

Using

Drop into a development environment with

$ guix shell -m manifest.scm

If the path is not picked up add

export PATH=$GUIX_ENVIRONMENT/bin:$PATH

Build the sources.

$ make

or for a container

mkdir ./tmp
guix shell -C --network --share=/run/mysqld/ --manifest=manifest.scm
export GUILE_LOAD_PATH=.:$GUILE_LOAD_PATH
guile json-to-ttl.scm etc/sample.json tmp/

That reads the etc/sample.json file included in this repository and converts it to an RDF representation that is stored in a file ./tmp/sampledata.ttl.

Set up connection parameters

Describe the database connection parameters in a file conn.scm file as shown below. Take care to replace the placeholders within angle brackets with the appropriate values.

((sql-username . "<sql-username-here>")
 (sql-password . "<sql-password-here>")
 (sql-database . "<sql-database-name-here>")
 (sql-host . "<sql-hostname-here>")
 (sql-port . <sql-port-here>)
 (virtuoso-port . <virtuoso-port-here>)
 (virtuoso-username . "<virtuoso-username-here>")
 (virtuoso-password . "<virtuoso-password-here>")
 (sparql-scheme . <sparql-endpoint-scheme-here>)
 (sparql-host . "<sparql-endpoint-hostname-here>")
 (sparql-port . <sparql-endpoint-port-here>))

Here's a sample conn.scm.

((sql-username . "webqtlout")
 (sql-password . "my-secret-password")
 (sql-database . "db_webqtl")
 (sql-host . "localhost")
 (sql-port . 3306)
 (virtuoso-port . 9081)
 (virtuoso-username . "dba")
 (virtuoso-password . "my-secret-virtuoso-password")
 (sparql-scheme . http)
 (sparql-host . "localhost")
 (sparql-port . 9082))

Transform the database

Example: Transform the phenotype from SQL to Terse RDF Triple Language (TTL)

guile -s examples/phenotype.scm \
      --settings=conn.scm \
      --output=tmp/phenotype.ttl \
      --documentation=tmp/phenotype.ttl.md

the -s option to guile runs the examples/phenotype.scm file as a script. Everything else on the command line is passed onto the script as command-line arguments.

This should create the files: - tmp/phenotype.ttl: will contain the data in the database in TTL format - tmp/phenotype.ttl.md: will contain a short documentation on the data in the file above.

Note to Devs: The current pre-inst-env script will not work within containers since it assumes the existence of /usr/bin/env. We need to fix that if we intend to keep using that.

There is a shorter form of the command above:

guile -s examples/phenotype.scm \
      -s conn.scm \
      -o tmp/phenotype.ttl \
      -d tmp/phenotype.ttl.md

which does the same thing, but has the potential to be confusing due to the two -s options: the first -s option is to guile while the second is to the script itself.

Validate and load dump

Then, validate the dumped RDF using rapper:

$ guix shell -m manifest.scm -- rapper --input turtle --count ~/data/dump/dump.ttl

If there are no errors, load the relevant RDF files into the http://genenetwork.org graph using the load-rdf.scm script:

$ guix shell -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm ~/data/dump/dump.ttl

This load-rdf.scm script replaces the existing graph with the ttl files from: "/var/lib/data", and indexes all the text data for quicker searches.

Upload data to virtuoso

See https://issues.genenetwork.org/topics/systems/virtuoso

Visualize schema

Now, you may query virtuoso to visualize the SQL and RDF schema.

$ guix shell -m manifest.scm -- ./pre-inst-env ./visualize-schema.scm conn.scm

This will output graphviz dot files sql.dot and rdf.dot describing the schema. Render them into SVG images like so.

$ dot -Tsvg -osql.svg sql.dot
$ dot -Tsvg -ordf.svg rdf.dot

Or, peruse them interactively with xdot.

$ xdot sql.dot
$ xdot rdf.dot

The dump-genenetwork-database continuous integration job runs these steps on every commit and publishes its version of sql.svg and rdf.svg.

LICENSE

AGPLv3. See LICENSE.txt

Contributing

The main code tree is hosted at https://git.genenetwork.org/.

git remote add gn git.genenetwork.org:/home/git/public/gn-transform-databases

See bugs and tasks in BUGS.org.