summaryrefslogtreecommitdiff
path: root/topics/xapian
diff options
context:
space:
mode:
authorJohn Nduli2024-06-19 09:19:01 +0300
committerBonfaceKilz2024-06-24 17:58:07 +0300
commit91e43a0bf17ec012c4386c44b9832832255f3a6c (patch)
tree109ccd4bef348939c954bbacf65597fafb7dfd95 /topics/xapian
parent3f02dcfa33c6c50a094a5194c0dc4dd9b8eaa594 (diff)
downloadgn-gemtext-91e43a0bf17ec012c4386c44b9832832255f3a6c.tar.gz
docs: include local development set up for indexing
Diffstat (limited to 'topics/xapian')
-rw-r--r--topics/xapian/xapian-indexing.gmi52
1 files changed, 46 insertions, 6 deletions
diff --git a/topics/xapian/xapian-indexing.gmi b/topics/xapian/xapian-indexing.gmi
index 1c82018..be0edc9 100644
--- a/topics/xapian/xapian-indexing.gmi
+++ b/topics/xapian/xapian-indexing.gmi
@@ -2,18 +2,58 @@
Due to the enormous size of the GeneNetwork database, indexing it in a reasonable amount of time is a tricky process that calls for careful identification and optimization of the performance bottlenecks. This document is a description of how we achieve it.
-Indexing happens in the following three phases.
+Indexing happens in these phases.
* Phase 1: retrieve data from SQL
-* Phase 2: index text
-* Phase 3: write Xapian index to disk
+* Phase 2: retrieve metadata from RDF
+* Phase 3: index text
+* Phase 4: write Xapian index to disk
-Phases 1 and 3 (that is, the retrieval of data from SQL and writing of the Xapian index to disk) are I/O bound processes. Phase 2 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1 and 3 sequential.
+Phases 1, 2 and 4 are I/O bound processes. Phase 3 (the actual indexing of text) is CPU bound. So, we parallelize phase 2 while keeping phases 1, 2 and 3 sequential.
-There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 2 worker processes. Thus, we interleave phase 1 and 2 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1.
+There is a long delay in retrieving data from SQL and loading it into memory. In this time, the CPU is waiting on I/O and idling away. In order to avoid this, we retrieve SQL data chunk by chunk and spawn off phase 3 worker processes. We get RDF data in one large call before any processing is done. Thus, we interleave phase 1 and 3 so that they don't block each other. Despite this, on tux02, the indexing script is only able to keep around 10 of the 128 CPUs busy. As phase 1 is dishing out jobs to phase 2 worker processes, before it can finish dishing out jobs to all 128 CPUs, the earliest worker processes finish and exit. The only way to avoid this and improve CPU utilization would be to further optimize the I/O of phase 1.
Building a single large Xapian index is not scalable. See detailed report on Xapian scalability.
=> xapian-scalability
So, we let each process of phase 2 build its own separate Xapian index. Finally, we compact and combine them into one large index. When writing smaller indexes in parallel, we take care to lock access to the disk so that only one process is writing to the disk at any given time. If many processes try to simultaneously write to the disk, the write speed is slowed down, often considerably, due to I/O contention.
-It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 2 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around.
+It is important to note that the performance bottlenecks identified in this document are machine-specific. For example, on my laptop with only 2 cores, CPU performance in phase 3 is the bottleneck. Phase 1 I/O waits on the CPU to finish instead of the other way around.
+
+## Local Development
+
+You'll set up virtuoso locally. Here are guix specific instructions:
+
+> guix install virtuoso-ose # or any other means to install virtuoso
+> cd /path/to/virtuoso/database/folder
+> cp $HOME/.guix-profile/var/lib/virtuoso/db/virtuoso.ini ./virtuoso.ini
+> # modify the virtuoso.ini file to save files to the folder you'd prefer
+> virtuoso-t +foreground +wait +debug
+
+and load up data by:
+
+> isql
+> # subsquent commands run in isql prompt
+> ld_dir ('/path/to/folder/with/ttls', '*.ttl', 'http://genenetwork.org') # preferable have this folder be in the virtuoso database folder
+> rdf_run_loader
+
+Ping @bmunyoki for the ttl folder backups.
+
+Set up mysql with instructions from
+
+=> https://issues.genenetwork.org/topics/database/setting-up-local-development-database
+
+and load up the backup file using:
+
+> mariadb gn2 < /path/to/backup/file.sql
+
+A backup file can be generated using:
+
+> mysqldump --opt --where="1 limit 1000000" database
+
+And run the index script using:
+
+> python3 scripts/index-genenetwork create-xapian-index /tmp/xapian "mysql://gn2:password@localhost/gn2" "http://localhost:8890/sparql"
+
+Verify the index with:
+
+> xapian-delve /tmp/xapian