summaryrefslogtreecommitdiff
path: root/topics/systems/orchestration.gmi
blob: 93b79ef6ef775bf11133676eb607380d3c389820 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Orchestration and fallbacks

After the Penguin2 crash in Aug. 2022 it has become increasingly clear how hard it is to deploy GeneNetwork. GNU Guix helps a great deal with dependencies, but it does not handle orchestration between machines/services well. Also we need to look at the future.

What is GN today in terms of services

* [X] Main GN2 server (Python, 20+ processes, 3+ instances: depends on all below)
* [X] Matching GN3 server and REST endpoint (Python: less dependencies)
* [X] Mariadb
* [X] redis
* [X] virtuoso (@aruni)
* [X] GN-proxy (Racket, authentication handler: redis, mariadb)
* [X] Alias proxy (Racket, gene aliases wikidata)
* [X] opar server
* [+] Jupyter, R-shiny and Julia notebooks, nb-hub server
* [X] BNW server (@efraimf)
* [+] UCSC browser (@efraimf)
* [X] GN1 instances (older python, 12 instances in principle, 2 running today)
* [ ] Access to HPC for GEMMA (coming)
* [+] Backup services (sheepdog, rsync, borg)
* [+] monitoring services (incl. systemd, gunicorn, shepherd, sheepdog)
* [ ] mail server
* [X] https certificates
* [X] http(s) proxy (nginx)
* [X] CI/CD services (with github webhooks)
* [+] git server (gitea or cgit)
* [X] file server (formerly IPFS)
* [ ] SPARQL endpoint

Somewhat decoupled services:

* [X] genecup
* [X] R/shiny power service Dave
* [ ] biohackrxiv
* [ ] hegp
* [ ] covid19
* [ ] guix publish server (runs on penguin2, needs tux02 @efraimf)

I am still missing a few! All run by a man and his diligent dog.

For the future the orchestration needs to be more robust and resilient. This means:

* A fallback for every service on a separate machine
* Improved privacy protection for (future) human data
* Separate servers serving different data sources
* Partial synchronization between data sources

The only way we *can* scale is by adding machines. But the system is not yet ready for that. Also getting rid of monolithic primary databases in favor of files helps synchronization.