summaryrefslogtreecommitdiff
path: root/topics/systems/virtuoso.gmi
blob: 1665f4342ea3b438e8d0a911aca5aa7984b849a5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Virtuoso

We run instances of virtuoso for our graph databases. Virtuoso is remarkable software and runs some really large databases, including Uniprot. Virtuoso can sometimes feel old and clunky. But, we still prefer it to other shiny new ones because it is the only large one not written in Java. Java packages are almost impossible to package in Guix.

=> https://github.com/openlink/virtuoso-opensource
=> https://www.uniprot.org/sparql/

## Running virtuoso
### Running virtuoso in a guix system container

We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on.

Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with.
```
(use-modules (gnu)
             (gn services databases))

(operating-system
  (host-name "virtuoso")
  (timezone "UTC")
  (locale "en_US.utf8")
  (bootloader (bootloader-configuration
               (bootloader grub-bootloader)
	       ;; It doesn't matter what "/dev/sdX" is
               (targets (list "/dev/sdX"))))
  (file-systems %base-file-systems)
  (users %base-user-accounts)
  (packages %base-packages)
  (services (cons (service virtuoso-service-type
                           (virtuoso-configuration
                            (server-port 8891)
                            (http-server-port 8892)))
                  %base-services)))
```

You can write the above configuration to a file, say virtuoso-os.scm, build a container with it, and run it with the command below. Everything inside the container is ephemeral and vanishes when the container is stopped. In order to persist the database, we mount a host directory /tmp/virtuoso-state at /var/lib/virtuoso in the container. /var/lib/virtuoso is the default state directory used by the Guix virtuoso service.
```
sudo $(guix system container --network --share=/tmp/virtuoso-state=/var/lib/virtuoso virtuoso-os.scm)
```

When running the above command, you will be given the container's PID. Should you want to inspect the container, you can run:
```
sudo nsenter -at PID /run/current-system/profile/bin/bash
```
If you have only one shepherd process running on your system, you may use the following quick hack to get the PID.
```
sudo nsenter -at $(pgrep shepherd) /run/current-system/profile/bin/bash
```

Also, in this set-up, note that the conductor web interface is not supported in the GUIX Service that's part of guix-bioinformatics. It isn't required for using virtuoso as a SPARQL server and only adds to the confusion.

### Running virtuoso by invoking it on the command line

You may also choose to run virtuoso the traditional way by invoking it on the command line. Managing long-running instances started from the command line is messy. So, this method works best for temporary instances.

First, we create a new directory for virtuoso and change into it. We will run virtuoso from this directory, and virtuoso will store all its state in this directory.
```
mkdir virtuoso
cd virtuoso
```
Then, we create a configuration file---virtuoso.ini. A basic configuration need only specify the ports to listen on. Here we specify port 8891 for the virtuoso server and port 8892 for the HTTP server that includes the SPARQL endpoint.
```
[Parameters]
ServerPort = localhost:8891

[HTTPServer]
ServerPort = localhost:8892
```
Finally, we start virtuoso.
```
virtuoso-t +foreground +configfile virtuoso.ini
```

Detailed documentation of the virtuoso configuration file format is at
=> http://docs.openlinksw.com/virtuoso/dbadm/#configsrvstupfiles Virtuoso configuration file
In particular, consider setting NumberOfBuffers and MaxDirtyBuffers as described at
=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning Performance tuning virtuoso

For a working configuration file, you can also look at /export/virtuoso/var/lib/virtuoso/db/virtuoso.ini in penguin2.

### Running SPARQL Queries using isql

The straight-forward way of running SPARQL queries
is using the web-interface:

=> http://localhost:<server-port>/sparql/

To use a CLI tool, you can utilise isql by running:

```
guix shell virtuoso-ose -- isql <server-port>
```

Queries within isql look like:

```
SQL> SPARQL SELECT * WHERE {?s ?p ?o};
```

## Set passwords for virtuoso users

After running virtuoso, you will want to change the default password of the `dba` user. The default password of the `dba` user is `dba`. You can change passwords using the isql command-line client. See
=> http://docs.openlinksw.com/virtuoso/defpasschange/ Virtuoso users and how to set their passwords

In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting.
```
SQL> set password "dba" "new-password";
SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav';
SQL> CHECKPOINT;
```

## Loading data into virtuoso

Virtuoso supports at least three different ways to load RDF.

### Bulk loading using the isql command-line client

=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Bulk loading using the isql command-line client
Bulk loading using the isql command-line client is usually the fastest. But, it requires correct handling of file system permissions, and cannot work on remote servers.

### SPARQL 1.1 Update

The standard SPARQL protocol allows update of RDF too.
=> https://www.w3.org/TR/sparql11-update/ SPARQL 1.1 Update

### SPARQL 1.1 Graph Store HTTP Protocol

For ease of implementation, SPARQL 1.1 also specifies an additional REST-like API to update data.
=> https://www.w3.org/TR/sparql11-http-rdf-update/ SPARQL 1.1 Graph Store HTTP Protocol
The virtuoso documentation shows examples of using this protocol with cURL.
=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtGraphProtocolCURLExamples Virtuoso SPARQL 1.1 Graph Store HTTP Protocol examples using cURL
We recap the same here.

When uploading data, the virtuoso server often does not report errors properly. It simply freezes up. So, it is very helpful to validate your RDF before uploading. For this, use rapper from the raptor2 package. To validate data.ttl, a turtle file, run
```
rapper --input turtle --count data.ttl
rapper: Parsing URI file: data.ttl with parser turtle
rapper: Parsing returned 652395 triples
```
Then, upload it to a virtuoso SPARQL endpoint running at port 8892
```
curl -v -X PUT --digest -u dba:password -T data.ttl -G http://localhost:8892/sparql-graph-crud-auth --data-urlencode graph=http://example.org
```
where http://example.org is the name of the graph.

The PUT method deletes the existing data in the graph before loading the new one. So, there is no need to manually delete old data before loading new data. However, virtuoso is slow at deleting millions of triples, resulting in an apparent freeze-up. So, it is preferable to handle such deletes manually using a lower-level SQL statement issued via the isql client.
```
$ isql
SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://example.org');
```
=> http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?

When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit:

=>https://github.com/genenetwork/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f
 Retry uploading to virtuoso (commit from dump-genenetwork-database repo)
 formerly (https://git.genenetwork.org/arunisaac/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f)

## Dumping data from a MySQL database

See also

=> ../RDF/genenetwork-sql-database-to-rdf.gmi

To dump data into a ttl file, first make sure that you are in the guix environment in the "dump-genenetwork-database" repository

=> https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database

See the README for instructions.