summaryrefslogtreecommitdiff
path: root/issues/systems/fallbacks-and-backups.gmi
blob: 34cecd26c6b5711b0cd1d773f514ad158285f4f5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Fallbacks and backups

A revisit to previous work on backups etc. The sheepdog hosts are no longer responding and we should really run sheepdog on a machine that is not physically with the other machines. In time sheepdog should also move away from redis and run in a system container, but that is for later. I did most of the work late 2021 when I wrote:

> As a hurricane is barreling towards our machine room in Memphis we are checking our fallbacks and backups for GeneNetwork. For years we have been making backups on Amazon - both S3 and a running virtual machine. The latter was expensive, so I replaced it with a bare metal server which earns itself (if it hadn't been down for months, but that is a different story).

As we are introducing an external sheepdog server we may give it a DNS entry as sheepdog.genenetwork.org.

See also

=> /topics/systems/restore-backups Restore Backups

## Tags

* type: enhancement
* assigned: pjotrp
* keywords: systems, fallback, backup, deploy
* status: in progress
* priority: critical

## Tasks

* [X] fix redis queue and sheepdog server
* [ ] check backups on tux01
* [ ] backup ratspub, r/shiny, bnw, covid19, hegp, pluto services
* [ ] /etc /home/shepherd backups for Octopus
* [ ] /etc /home/shepherd /home/git CI-CD GN-QA backups on Tux02
* [ ] Get backups running again on fallback
* [ ] fix bacchus large backups
* [ ] mount bacchus on HPC

## Backup and restore

We are using borg for backing up data. Borg is excellent at deduplication and compression of data and is pretty fast too. Incremental copies work with rsync - so that is fast. To restore the full MariaDB database from a local borg repo takes a few minutes:

```
wrk@epysode:/export/restore_tux01$ time borg extract -v /export2/backup/tux01/borg-tux01::BORG-TUX01-MARIADB-20210829-04:20-Sun
real    17m32.498s
user    8m49.877s
sys     4m25.934s
```

This all contrasts heavily with restoring 300GB from Amazon S3.

Next restore the GN2 home dir

```
root@epysode:/# borg extract export2/backup/tux01/borg-genenetwork::TUX01_BORG_GN2_HOME-20210830-04:00-Mon
```

## Get backups running on fallback

Recently epysode was reinstated after hardware failure. I took the opportunity to reinstall the machine. The backups are described in the repo (genenetwork org members have access)

=> https://github.com/genenetwork/gn-services/blob/master/services/backups.org BACKUPS

As epysode was one of the main sheepdog messaging servers I need to reinstate:

* [X] scripts for sheepdog
* [ ] Check tunnel on tux01 is reinstated
* [ ] enable trim
* [ ] reinstate monitoring web services
* [ ] reinstate daily backups
* [ ] CRON
* [ ] make sure messaging works through redis
* [ ] fix and propagate GN1 backup
* [ ] fix and propagate fileserver and git backups
* [ ] add GN1 backup
* [ ] other backups
* [ ] email on fail

Tux01 is backed up now. Need to make sure it propagates to

* [ ] rabbit
* [ ] Tux02
* [ ] balg01
* [ ] bacchus