# Virtuoso We run instances of virtuoso for our graph databases. Virtuoso is remarkable software and runs some really large databases, including Uniprot. Virtuoso can sometimes feel old and clunky. But, we still prefer it to other shiny new ones because it is the only large one not written in Java. Java packages are almost impossible to package in Guix. => https://github.com/openlink/virtuoso-opensource => https://www.uniprot.org/sparql/ ## Running virtuoso ### Running virtuoso in a guix system container We have a Guix virtuoso service in the guix-bioinformatics channel. The easiest way to run virtuoso is to use the virtuoso service to run it in a guix system container. The only downside of this method is that, since guix system containers require root privileges to start up, you will need root priviliges on the machine you are running this on. Here is a basic guix system configuration that runs virtuoso listening on port 8891, and with its HTTP server listening on port 8892. Among other things, the HTTP server provides a SPARQL endpoint to interact with. ``` (use-modules (gnu) (gn services databases)) (operating-system (host-name "virtuoso") (timezone "UTC") (locale "en_US.utf8") (bootloader (bootloader-configuration (bootloader grub-bootloader) ;; It doesn't matter what "/dev/sdX" is (targets (list "/dev/sdX")))) (file-systems %base-file-systems) (users %base-user-accounts) (packages %base-packages) (services (cons (service virtuoso-service-type (virtuoso-configuration (server-port 8891) (http-server-port 8892))) %base-services))) ``` You can write the above configuration to a file, say virtuoso-os.scm, build a container with it, and run it with the command below. Everything inside the container is ephemeral and vanishes when the container is stopped. In order to persist the database, we mount a host directory /tmp/virtuoso-state at /var/lib/virtuoso in the container. /var/lib/virtuoso is the default state directory used by the Guix virtuoso service. ``` sudo $(guix system container --network --share=/tmp/virtuoso-state=/var/lib/virtuoso virtuoso-os.scm) ``` When running the above command, you will be given the container's PID. Should you want to inspect the container, you can run: ``` sudo nsenter -at PID /run/current-system/profile/bin/bash ``` If you have only one shepherd process running on your system, you may use the following quick hack to get the PID. ``` sudo nsenter -at $(pgrep shepherd) /run/current-system/profile/bin/bash ``` Also, in this set-up, note that the conductor web interface is not supported in the GUIX Service that's part of guix-bioinformatics. It isn't required for using virtuoso as a SPARQL server and only adds to the confusion. ### Running virtuoso by invoking it on the command line You may also choose to run virtuoso the traditional way by invoking it on the command line. Managing long-running instances started from the command line is messy. So, this method works best for temporary instances. First, we create a new directory for virtuoso and change into it. We will run virtuoso from this directory, and virtuoso will store all its state in this directory. ``` mkdir virtuoso cd virtuoso ``` Then, we create a configuration file---virtuoso.ini. A basic configuration need only specify the ports to listen on. Here we specify port 8891 for the virtuoso server and port 8892 for the HTTP server that includes the SPARQL endpoint. ``` [Parameters] ServerPort = localhost:8891 [HTTPServer] ServerPort = localhost:8892 ``` Finally, we start virtuoso. ``` virtuoso-t +foreground +configfile virtuoso.ini ``` Detailed documentation of the virtuoso configuration file format is at => http://docs.openlinksw.com/virtuoso/dbadm/#configsrvstupfiles Virtuoso configuration file In particular, consider setting NumberOfBuffers and MaxDirtyBuffers as described at => http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning Performance tuning virtuoso For a working configuration file, you can also look at /export/virtuoso/var/lib/virtuoso/db/virtuoso.ini in penguin2. ### Running SPARQL Queries using isql The straight-forward way of running SPARQL queries is using the web-interface: => http://localhost:/sparql/ To use a CLI tool, you can utilise isql by running: ``` guix shell virtuoso-ose -- isql -U dba -P password ``` Queries within isql look like: ``` SQL> SPARQL SELECT * WHERE {?s ?p ?o}; ``` ## Set passwords for virtuoso users After running virtuoso, you will want to change the default password of the `dba` user. The default password of the `dba` user is `dba`. You can change passwords using the isql command-line client. See => http://docs.openlinksw.com/virtuoso/defpasschange/ Virtuoso users and how to set their passwords In a typical production virtuoso installation, you will want to change the password of the dba user and disable the dav user. Here are the commands to do so. Pay attention to the single versus double quoting. ``` SQL> set password "dba" "new-password"; SQL> UPDATE ws.ws.sys_dav_user SET u_account_disabled=1 WHERE u_name='dav'; SQL> CHECKPOINT; ``` ## Loading data into virtuoso Virtuoso supports at least three different ways to load RDF. ### Bulk loading using the isql command-line client => http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Bulk loading using the isql command-line client Bulk loading using the isql command-line client is usually the fastest. But, it requires correct handling of file system permissions, and cannot work on remote servers. ### SPARQL 1.1 Update The standard SPARQL protocol allows update of RDF too. => https://www.w3.org/TR/sparql11-update/ SPARQL 1.1 Update ### SPARQL 1.1 Graph Store HTTP Protocol For ease of implementation, SPARQL 1.1 also specifies an additional REST-like API to update data. => https://www.w3.org/TR/sparql11-http-rdf-update/ SPARQL 1.1 Graph Store HTTP Protocol The virtuoso documentation shows examples of using this protocol with cURL. => http://vos.openlinksw.com/owiki/wiki/VOS/VirtGraphProtocolCURLExamples Virtuoso SPARQL 1.1 Graph Store HTTP Protocol examples using cURL We recap the same here. When uploading data, the virtuoso server often does not report errors properly. It simply freezes up. So, it is very helpful to validate your RDF before uploading. For this, use rapper from the raptor2 package. To validate data.ttl, a turtle file, run ``` rapper --input turtle --count data.ttl rapper: Parsing URI file: data.ttl with parser turtle rapper: Parsing returned 652395 triples ``` Then, upload it to a virtuoso SPARQL endpoint running at port 8892 ``` curl -v -X PUT --digest -u 'dba:password' -T data.ttl -G http://localhost:8892/sparql-graph-crud-auth --data-urlencode graph=http://genenetwork.org ``` where http://genenetwork.org is the name of the graph. Note that single quoting the password is good to do especially when you have special characters in the password. The PUT method deletes the existing data in the graph before loading the new one. So, there is no need to manually delete old data before loading new data. However, virtuoso is slow at deleting millions of triples, resulting in an apparent freeze-up. So, it is preferable to handle such deletes manually using a lower-level SQL statement issued via the isql client. To delete a graph: ``` $ isql SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org'); ``` To add ttl files: ``` ld_dir('/dir', '*.ttl', 'http://genenetwork.org'); rdf_loader_run(); ``` => http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksGuideDeleteLargeGraphs How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store? When virtuoso has just been started up with a clean state (that is, the virtuoso state directory was empty before virtuoso started), uploading large amounts of data using the SPARQL 1.1 Graph Store HTTP Protocol fails the first time. It succeeds only the second time. It is not clear why. I can only recommend retrying as in this commit: => https://github.com/genenetwork/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f Retry uploading to virtuoso (commit from dump-genenetwork-database repo) formerly (https://git.genenetwork.org/arunisaac/dump-genenetwork-database/commit/8f60fde7f5499e5ffe352d7ae98a2de34a91b89f) ### Using load-rdf.scm script You can use the following script to upload data in rdf. => https://github.com/genenetwork/dump-genenetwork-database/blob/master/load-rdf.scm load-rdf.scm This script first clears the database before uploading data. To run it: ``` guix shell -N virtuoso-ose -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm dump.ttl ``` => https://github.com/genenetwork/dump-genenetwork-database/blob/master/conn.scm Example conn.scm ### Bulk Loading Data Virtuoso has access to the folder: /export/data/genenetwork-virtuoso/. As such, place all the turtle files for bulk uploads here. To bulk load data: First make sure that all the data is deleted: ``` $ isql SQL> DELETE FROM rdf_quad WHERE g = iri_to_id('http://genenetwork.org'); ``` Use isql to register all the title files: ``` SQL> ld_dir('/var/lib/data', '*.ttl', 'http://genenetwork.org'); ``` You can check the table DB.DBA.load_list to check the list of datasets registered for loading.: ``` SQL> SELECT * FROM DB.DBA.load_list; ``` In case you want to empty the list: ``` DELETE FROM DB.DBA.load_list WHERE ll_file='*.ttl'; ``` Perform the bulk load of all data by running: ``` SQL> rdf_loader_run(); ``` Commit the bulk loaded data to the Virtuoso database file by running: ``` checkpoint; ``` Run a query to make sure that indeed you have loaded data E.g. ``` SPARQL PREFIX gn: SELECT * FROM WHERE { gn:Mus_musculus ?p ?o. }; ``` In case you want to get a list of all queries: ``` SPARQL SELECT DISTINCT ?g WHERE { GRAPH ?g {?s ?p ?o} } ORDER BY ?g; ``` Other resources: => https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader Bulk Loading RDF Source Files into one or more Graph IRIs => https://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoaderExampleSingle VOS.VirtBulkRDFLoaderExampleSingle ## Dumping to RDF from the GeneNetwork MySQL database See also => ../RDF/genenetwork-sql-database-to-rdf To dump data into a ttl file, first make sure that you are in the guix environment in the "dump-genenetwork-database" repository => https://github.com/genenetwork/dump-genenetwork-database/ Dump Genenetwork Database See the README for instructions.