README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166

[![dump-genenetwork-database-tests CI
badge](https://ci.genenetwork.org/badge/dump-genenetwork-database-tests.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database-tests)
[![dump-genenetwork-database CI
badge](https://ci.genenetwork.org/badge/dump-genenetwork-database.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database)

This repository contains code to dump the metadata in the GeneNetwork relational database to RDF. It requires a connection to a SQL server.

# Using

Drop into a development environment with

``` shell
$ guix shell -m manifest.scm
```

If the path is not picked up add

```
export PATH=$GUIX_ENVIRONMENT/bin:$PATH
```

Build the sources.

``` shell
$ make
```

or for a container

```shell
mkdir ./tmp
guix shell -C --network --share=/run/mysqld/ --manifest=manifest.scm
export GUILE_LOAD_PATH=.:$GUILE_LOAD_PATH
guile json-to-ttl.scm etc/sample.json tmp/
```

That reads the `etc/sample.json` file included in this repository and converts it to an RDF representation that is stored in a file `./tmp/sampledata.ttl`.


## Set up connection parameters

Describe the database connection parameters in a file *conn.scm* file as
shown below. Take care to replace the placeholders within angle brackets
with the appropriate values.

``` scheme
((sql-username . "<sql-username-here>")
 (sql-password . "<sql-password-here>")
 (sql-database . "<sql-database-name-here>")
 (sql-host . "<sql-hostname-here>")
 (sql-port . <sql-port-here>)
 (virtuoso-port . <virtuoso-port-here>)
 (virtuoso-username . "<virtuoso-username-here>")
 (virtuoso-password . "<virtuoso-password-here>")
 (sparql-scheme . <sparql-endpoint-scheme-here>)
 (sparql-host . "<sparql-endpoint-hostname-here>")
 (sparql-port . <sparql-endpoint-port-here>))
```

Here's a sample *conn.scm*.
``` scheme
((sql-username . "webqtlout")
 (sql-password . "my-secret-password")
 (sql-database . "db_webqtl")
 (sql-host . "localhost")
 (sql-port . 3306)
 (virtuoso-port . 9081)
 (virtuoso-username . "dba")
 (virtuoso-password . "my-secret-virtuoso-password")
 (sparql-scheme . http)
 (sparql-host . "localhost")
 (sparql-port . 9082))
```

## Transform the database

Example: Transform the phenotype from SQL to Terse RDF Triple Language (TTL)

```sh
guile -s examples/phenotype.scm \
      --settings=conn.scm \
      --output=tmp/phenotype.ttl \
      --documentation=tmp/phenotype.ttl.md
```

the `-s` option to *guile* runs the `examples/phenotype.scm` file as a script. Everything else on the command line is passed onto the script as command-line arguments.

This should create the files:
- `tmp/phenotype.ttl`: will contain the data in the database in TTL format
- `tmp/phenotype.ttl.md`: will contain a short documentation on the data in the file above.

**Note to Devs**: The current `pre-inst-env` script will not work within containers since it assumes the existence of `/usr/bin/env`. We need to fix that if we intend to keep using that.


There is a shorter form of the command above:

```sh
guile -s examples/phenotype.scm \
      -s conn.scm \
      -o tmp/phenotype.ttl \
      -d tmp/phenotype.ttl.md
```

which does the same thing, but has the potential to be confusing due to the two `-s` options: the first `-s` option is to guile while the second is to the script itself.

## Validate and load dump

Then, validate the dumped RDF using `rapper` and load it into
virtuoso. This will load the dumped RDF into the
`http://genenetwork.org` graph, and will delete all pre-existing data
in that graph (FIXME)

``` shell
$ guix shell -m manifest.scm -- rapper --input turtle --count ~/data/dump/dump.ttl
$ guix shell -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm ~/data/dump/dump.ttl
```


## Upload data to virtuoso

See https://issues.genenetwork.org/topics/systems/virtuoso

## Visualize schema

Now, you may query virtuoso to visualize the SQL and RDF schema.

``` shell
$ guix shell -m manifest.scm -- ./pre-inst-env ./visualize-schema.scm conn.scm
```

This will output graphviz dot files `sql.dot` and `rdf.dot` describing
the schema. Render them into SVG images like so.

``` shell
$ dot -Tsvg -osql.svg sql.dot
$ dot -Tsvg -ordf.svg rdf.dot
```

Or, peruse them interactively with `xdot`.

``` shell
$ xdot sql.dot
$ xdot rdf.dot
```

The
[dump-genenetwork-database](https://ci.genenetwork.org/jobs/dump-genenetwork-database)
continuous integration job runs these steps on every commit and
publishes its version of
[sql.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/sql.svg)
and
[rdf.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/rdf.svg).

# LICENSE

AGPLv3. See LICENSE.txt

# Contributing

The main code tree is hosted at https://git.genenetwork.org/.

```sh
git remote add gn git.genenetwork.org:/home/git/public/gn-transform-databases
```

See bugs and tasks in BUGS.org.