README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

[![dump-genenetwork-database-tests CI
badge](https://ci.genenetwork.org/badge/dump-genenetwork-database-tests.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database-tests)
[![dump-genenetwork-database CI
badge](https://ci.genenetwork.org/badge/dump-genenetwork-database.svg)](https://ci.genenetwork.org/jobs/dump-genenetwork-database)

The GeneNetwork database is being migrated from a relational database to
a plain text and RDF database. This repository contains code to dump the
relational database to plain text.

# Using

Drop into a development environment with

``` shell
$ guix shell
```

Build the sources.

``` shell
$ make
```

Describe the database connection parameters in a file *conn.scm* file as
shown below. Take care to replace the placeholders within angle brackets
with the appropriate values.

``` scheme
((sql-username . "<sql-username-here>")
 (sql-password . "<sql-password-here>")
 (sql-database . "<sql-database-name-here>")
 (sql-host . "<sql-hostname-here>")
 (sql-port . <sql-port-here>)
 (virtuoso-port . <virtuoso-port-here>)
 (virtuoso-username . "<virtuoso-username-here>")
 (virtuoso-password . "<virtuoso-password-here>")
 (sparql-scheme . <sparql-endpoint-scheme-here>)
 (sparql-host . "<sparql-endpoint-hostname-here>")
 (sparql-port . <sparql-endpoint-port-here>))
```

Then, to dump the database to \~/data/dump, run

``` shell
$ ./pre-inst-env ./dump.scm conn.scm ~/data/dump
```

Make sure there is enough free space! It\'s best to dump the database on
penguin2 where disk space and bandwidth are not significant
constraints.

Then, validate the dumped RDF using `rapper` and load it into
virtuoso. This will load the dumped RDF into the
`http://genenetwork.org` graph, and will delete all pre-existing data
in that graph.

``` shell
$ rapper --input turtle --count ~/data/dump/dump.ttl
$ ./pre-inst-env ./load-rdf.scm conn.scm ~/data/dump/dump.ttl
```

Now, you may query virtuoso to visualize the SQL and RDF schema.

``` shell
$ ./pre-inst-env ./visualize-schema.scm conn.scm
```

This will output graphviz dot files `sql.dot` and `rdf.dot` describing
the schema. Render them into SVG images like so.

``` shell
$ dot -Tsvg -osql.svg sql.dot
$ dot -Tsvg -ordf.svg rdf.dot
```

Or, peruse them interactively with `xdot`.

``` shell
$ xdot sql.dot
$ xdot rdf.dot
```

The
[dump-genenetwork-database](https://ci.genenetwork.org/jobs/dump-genenetwork-database)
continuous integration job runs these steps on every commit and
publishes its version of
[sql.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/sql.svg)
and
[rdf.svg](https://ci.genenetwork.org/archive/dump-genenetwork-database/latest/rdf.svg).

# Contributing

See bugs and tasks in BUGS.org.