summaryrefslogtreecommitdiff
path: root/issues/genenetwork/virtuoso-shutdown-clears-data.gmi
blob: e727ca4d7341f974568fb25eaf4a42e88cbda2c4 (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# Virtuoso: Shutdown Clears Data

## Tags

* type: bug
* status: open
* assigned: fredm
* priority: critical
* interested: bonfacem, pjotrp, zsloan
* keywords: production, container, tux04, virtuoso

## Description

It seems that virtuoso has the bad habit of clearing data whenever it is stopped/restarted.

This issue will track the work necessary to get the service behaving correctly.

According to the documentation on
=> https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader the bulk loading process

```
The bulk loader also disables checkpointing and the scheduler, which also need to be re-enabled post bulk load
```

That needs to be handled.

### Notes

After having a look at
=> https://docs.openlinksw.com/virtuoso/ch-server/#databaseadmsrv the configuration documentation
it occurs to me that the reason virtuoso supposedly clears the data is that the `DatabaseFile` value is not set, so it defaults to a new database file every time the server is restarted (See also the `Striping` setting).

### Troubleshooting

Reproduce locally:

We begin by getting a look at the settings for the remote virtuoso
```
$ ssh tux04
fredm@tux04:~$ cat /gnu/store/bg6i4x96nm32gjp4qhphqmxqc5vggk3h-virtuoso.ini
[Parameters]
ServerPort = localhost:8981
DirsAllowed = /var/lib/data
NumberOfBuffers = 4000000
MaxDirtyBuffers = 3000000
[HTTPServer]
ServerPort = localhost:8982
```

Copy these into a file locally, and adjust the `NumberOfBuffers` and `MaxDirtyBuffers` for smaller local dev environment. Also update `DirsAllowed`.

We end up with our local configuration in `~/tmp/virtuoso/etc/virtuoso.ini` with the content:

```
[Parameters]
ServerPort = localhost:8981
DirsAllowed = /var/lib/data
NumberOfBuffers = 10000
MaxDirtyBuffers = 6000
[HTTPServer]
ServerPort = localhost:8982
```

Run virtuoso!
```
$ cd ~/tmp/virtuoso/var/lib/virtuoso/
$ ls
$ ~/opt/virtuoso/bin/virtuoso-t +foreground +configfile ~/tmp/virtuoso/etc/virtuoso.ini
```

Here we start by changing into the `~/tmp/virtuoso/var/lib/virtuoso/` directory which will be where virtuoso will put its state. Now in a different terminal list the files created int the state directory:

```
$ ls ~/tmp/virtuoso/var/lib/virtuoso
virtuoso.db  virtuoso.lck  virtuoso.log  virtuoso.pxa  virtuoso.tdb  virtuoso.trx
```

That creates the database file (and other files) with the documented default values, i.e. `virtuoso.*`.

We cannot quite reproduce the issue locally, since every reboot will have exactly the same value for the files locally.

Checking the state directory for virtuoso on tux04, however:

```
fredm@tux04:~$ sudo ls -al /export2/guix-containers/genenetwork/var/lib/virtuoso/ | grep '\.db$'
-rw-r--r-- 1  986  980 3787456512 Oct 28 14:16 js1b7qjpimdhfj870kg5b2dml640hryx-virtuoso.db
-rw-r--r-- 1  986  980 4152360960 Oct 28 17:11 rf8v0c6m6kn5yhf00zlrklhp5lmgpr4x-virtuoso.db
```

We see that there are multiple db files, each created when virtuoso was restarted. There is an extra (possibly) random string prepended to the `virtuoso.db` part. This happens for our service if we do not actually provide the `DatabaseFile` configuration.


## Fixes

=> https://github.com/genenetwork/gn-gemtext-threads/commit/8211c1e49498ba2f3b578ed5b11b15c52299aa08 Document how to restart checkpointing and the scheduler after bulk loading
=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=2dc335ca84ea7f26c6977e6b432f3420b113f0aa Add configs for scheduler and checkpointing
=> https://git.genenetwork.org/guix-bioinformatics/commit/?id=7d793603189f9d41c8ee87f8bb4c876440a1fce2 Set up virtuoso database configurations