summaryrefslogtreecommitdiff
path: root/topics/systems/virtuoso.gmi
blob: e911a8ba087190c0a4b21dd291d6fa0053ce5c58 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# Virtuoso

We run instances of virtuoso for our graph databases. Virtuoso is remarkable software and runs some really large databases, including Uniprot. Virtuoso can sometimes feel old and clunky. But, we still prefer it to other shiny new ones because it is the only large one not written in Java. Java packages are almost impossible to package in Guix.

=> https://github.com/openlink/virtuoso-opensource
=> https://www.uniprot.org/sparql/

## Running virtuoso
### Running virtuoso in a guix system container

We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on.

Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with.
```
(use-modules (gnu)
             (gn services databases))

(operating-system
  (host-name "virtuoso")
  (timezone "UTC")
  (locale "en_US.utf8")
  (bootloader (bootloader-configuration
               (bootloader grub-bootloader)
	       ;; It doesn't matter what "/dev/sdX" is
               (targets (list "/dev/sdX"))))
  (file-systems %base-file-systems)
  (users %base-user-accounts)
  (packages %base-packages)
  (services (cons (service virtuoso-service-type
                           (virtuoso-configuration
                            (server-port 8891)
                            (http-server-port 8892)))
                  %base-services)))
```

You can write the above configuration to a file, say virtuoso-os.scm, build a container with it, and run it with the command below. Everything inside the container is ephemeral and vanishes when the container is stopped. In order to persist the database, we mount a host directory /tmp/virtuoso-state at /var/lib/virtuoso in the container. /var/lib/virtuoso is the default state directory used by the Guix virtuoso service.
```
sudo $(guix system container --network --share=/tmp/virtuoso-state=/var/lib/virtuoso virtuoso-os.scm)
```

When running the above command, you will be given the container's PID. Should you want to inspect the container, you can run:
```
sudo nsenter -at PID /run/current-system/profile/bin/bash
```
If you have only one shepherd process running on your system, you may use the following quick hack to get the PID.
```
sudo nsenter -at $(pgrep shepherd) /run/current-system/profile/bin/bash
```

Also, in this set-up, note that the conductor web interface is not supported in the GUIX Service that's part of guix-bioinformatics. It isn't required for using virtuoso as a SPARQL server and only adds to the confusion.

### Running virtuoso by invoking it on the command line

You may also choose to run virtuoso the traditional way by invoking it on the command line. Managing long-running instances started from the command line is messy. So, this method works best for temporary instances.

First, we create a new directory for virtuoso and change into it. We will run virtuoso from this directory, and virtuoso will store all its state in this directory.
```
mkdir virtuoso
cd virtuoso
```
Then, we create a configuration file---virtuoso.ini. A basic configuration need only specify the ports to listen on. Here we specify port 8891 for the virtuoso server and port 8892 for the HTTP server that includes the SPARQL endpoint.
```
[Parameters]
ServerPort = localhost:8891

[HTTPServer]
ServerPort = localhost:8892
```
Finally, we start virtuoso.
```
virtuoso-t +foreground +configfile virtuoso.ini
```

Detailed documentation of the virtuoso configuration file format is at
=> http://docs.openlinksw.com/virtuoso/dbadm/#configsrvstupfiles Virtuoso configuration file
In particular, consider setting NumberOfBuffers and MaxDirtyBuffers as described at
=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning Performance tuning virtuoso

For a working configuration file, you can also look at /export/virtuoso/var/lib/virtuoso/db/virtuoso.ini in penguin2.

### Running SPARQL Queries using isql

The straight-forward way of running SPARQL queries
is using the web-interface:

=> http://localhost:<server-port>/sparql/

To use a CLI tool, you can utilise isql by running:

```
guix shell virtuoso-ose -- isql -U dba -P password <server-port>
```

Queries within isql look like:

```
SQL> SPARQL SELECT * WHERE {?s ?p ?o};
```

## Set passwords for virtuoso users

After running virtuoso, you will want to change the default password of the `dba` user. The default password of the `dba` user is `dba`. You can change passwords using the isql command-line client. See
=> http://docs.openlinksw.com/virtuoso/defpasschange/ Virtuoso users and how to set their passwords

In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting.
```
SQL> set password "dba" "rFw,OntlJ@Sz";
SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav';
SQL> CHECKPOINT;
```

## Loading data into virtuoso

Virtuoso supports at least three different ways to load RDF.

### Bulk loading using the isql command-line client

=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Bulk loading using the isql command-line client
Bulk loading using the isql command-line client is usually the fastest. But, it requires correct handling of file system permissions, and cannot work on remote servers.

### SPARQL 1.1 Update

The standard SPARQL protocol allows update of RDF too.
=> https://www.w3.org/TR/sparql11-update/ SPARQL 1.1 Update

### SPARQL 1.1 Graph Store HTTP Protocol

For ease of implementation, SPARQL 1.1 also specifies an additional REST-like API to update data.
=> https://www.w3.org/TR/sparql11-http-rdf-update/ SPARQL 1.1 Graph Store HTTP Protocol
The virtuoso documentation shows examples of using this protocol with cURL.
=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtGraphProtocolCURLExamples Virtuoso SPARQL 1.1 Graph Store HTTP Protocol examples using cURL
We recap the same here.

When uploading data, the virtuoso server often does not report errors properly. It simply freezes up. So, it is very helpful to validate your RDF before uploading. For this, use rapper from the raptor2 package. To validate data.ttl, a turtle file, run
```
rapper --input turtle --count data.ttl
rapper: Parsing URI file: data.ttl with parser turtle
rapper: Parsing returned 652395 triples
```
Then, upload it to a virtuoso SPARQL endpoint running at port 8892
```
curl -v -X PUT --digest -u 'dba:password' -T data.ttl -G http://localhost:8892/sparql-graph-crud-auth --data-urlencode graph=http://genenetwork.org
```
where http://genenetwork.org is the name of the graph. Note that single quoting the password is good to do especially when you have special characters in the password.

The PUT method deletes the existing data in the graph before loading the new one. A POST method can be used instead. There is usually no need to manually delete old data before loading new data. virtuoso is slow at deleting millions of triples, resulting in an apparent freeze-up. So, it is preferable to handle such deletes manually using a lower-level SQL statement issued via the isql client.

Start isql with something like

```
guix shell --expose=verified-data=/var/lib/data virtuoso-ose -- isql -U dba -P password 8981
```

To delete a graph:

```
$ isql
SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org');
```

To add ttl files through isql:

```
ld_dir('/dir', '*.ttl', 'http://genenetwork.org');
rdf_loader_run();
checkpoint;
```

=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?

When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit:

=> https://github.com/genenetwork/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f
 Retry uploading to virtuoso (commit from dump-genenetwork-database repo)
 formerly (https://git.genenetwork.org/arunisaac/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f)

### Using load-rdf.scm script

You can use the following script to upload data in rdf.

=> https://github.com/genenetwork/dump-genenetwork-database/blob/master/load-rdf.scm load-rdf.scm

This script first clears the database before uploading data.  To run it:

```
guix shell -N virtuoso-ose -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm dump.ttl
```

=> https://github.com/genenetwork/dump-genenetwork-database/blob/master/conn.scm Example conn.scm

### Bulk Loading Data

Virtuoso has access to the folder: /export/data/genenetwork-virtuoso/.  As such, place all the turtle files for bulk uploads here.  To bulk load data:

First make sure that all the data is deleted:

```
$ isql
SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org');
```

Also, make sure that the load list is empty before registering your turtle files.

```
DELETE FROM DB.DBA.load_list;
```

Note that the directory may be mapped to a different location by the service. On tux02 it is `/export/data/genenetwork-virtuoso/`.

Use isql to register all the turtle files:

```
SQL> ld_dir('/var/lib/data', '*.ttl', 'http://genenetwork.org');
```

Note, for the prior step, you can specify a specific file instead of adding all the files using the wildcard "*".  Here's an example of doing this:

```
SQL> ld_dir('/var/lib/data', 'species.ttl', 'http://genenetwork.org');
```

Check the table DB.DBA.load_list to see the list of registered files that will be loaded:

```
SQL> SELECT * FROM DB.DBA.load_list;
```

Complete the actual bulk load of all data by running:

```
SQL> rdf_loader_run();
```

Commit the bulk loaded data to the Virtuoso database file by running:

```
checkpoint;
```

Run a query to make sure that indeed you have loaded data E.g.

```
SPARQL
PREFIX gn: <http://genenetwork.org/id/>

SELECT * FROM <http://genenetwork.org> WHERE {
gn:Mus_musculus ?p ?o.
};
```

In case you want to get a list of all queries:

```
SPARQL
SELECT  DISTINCT ?g
   WHERE  { GRAPH ?g {?s ?p ?o} }
ORDER BY ?g;
```

Other resources:

=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Bulk Loading RDF Source Files into one or more Graph IRIs

=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoaderExampleSingle VOS.VirtBulkRDFLoaderExampleSingle

## Dumping to RDF from the GeneNetwork MySQL database

See also

=> ../RDF/genenetwork-sql-database-to-rdf

To dump data into a ttl file, first make sure that you are in the guix environment in the "dump-genenetwork-database" repository

=> https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database

See the README for instructions.