diff options
author | Arun Isaac | 2022-07-01 18:44:01 +0530 |
---|---|---|
committer | Arun Isaac | 2022-07-01 18:44:01 +0530 |
commit | 2f84a722e5944bd4458abbd45fd637a9286bab04 (patch) | |
tree | 33e02614a9eb5e457c6162e8b2f67278ba07ddfe /topics/systems | |
parent | c049be0d57f87151ad8db733150d9a49fb30ea31 (diff) | |
download | gn-gemtext-2f84a722e5944bd4458abbd45fd637a9286bab04.tar.gz |
Move issues lost inside the topics directory.
Diffstat (limited to 'topics/systems')
-rw-r--r-- | topics/systems/ci-cd.gmi | 110 | ||||
-rw-r--r-- | topics/systems/decommission-machines.gmi | 63 | ||||
-rw-r--r-- | topics/systems/fallbacks-and-backups.gmi | 69 | ||||
-rw-r--r-- | topics/systems/gn1-time-machines.gmi | 49 | ||||
-rw-r--r-- | topics/systems/letsencrypt.gmi | 26 | ||||
-rw-r--r-- | topics/systems/machine-room.gmi | 19 | ||||
-rw-r--r-- | topics/systems/octopus.gmi | 28 | ||||
-rw-r--r-- | topics/systems/reboot-tux01-tux02.gmi | 54 | ||||
-rw-r--r-- | topics/systems/sheepdog.gmi | 34 | ||||
-rw-r--r-- | topics/systems/tux02-production.gmi | 65 |
10 files changed, 0 insertions, 517 deletions
diff --git a/topics/systems/ci-cd.gmi b/topics/systems/ci-cd.gmi deleted file mode 100644 index a5e00b6..0000000 --- a/topics/systems/ci-cd.gmi +++ /dev/null @@ -1,110 +0,0 @@ -# CI/ CD for genetwork projects - -We need to figure out/ discuss and document how to go about doing the -whole automated testing and deployment, from pushing code to -deployment to production. - -For a first, we need various levels of tests to be run, from unit -tests to the more complicated ones like integration, performance, -regression, etc tests, and of course, they cannot all be run for each -and every commit, and will thus need to be staggered across the entire -deployment cycle to help with quick iteration of the code. - -## Tags - -* assigned: bonfacem, fredm, efraimf, aruni -* keywords: deployment, CI, CD, testing -* status: in progress -* priority: high -* type: enhancement - -## Tasks - -As part of the CI/CD effort, it is necessary that there is -=> ../testing/automated-testing.gmi automated testing. - -#### Ideas - -GeneNetwork is interested in doing two things on every commit (or -periodically, say, once an hour/day): - -- CI: run unit tests -- CD: rebuild and redeploy a container running GN3 - -Arun has figured out the CI part. It runs a suitably configured -laminar CI service in a Guix container created with `guix system -container'. A cron job periodically triggers the laminar CI job. - -=> https://git.systemreboot.net/guix-forge/about/ - -CD hasn't been figured out. Normally, Guix VMs and containers created -by `guix system` can only access the store read-only. Since containers -don't have write access to the store, you cannot `guix build' from -within a container or deploy new containers from within a -container. This is a problem for CD. How do you make Guix containers -have write access to the store? - -Another alternative for CI/ CID were to have the quick running tests, -e.g unit tests, run on each commit to branch "main". Once those are -successful, the CI/CD system we choose should automatically pick the -latest commit that passed the quick running tests for for further -testing and deployment, maybe once an hour or so. Once the next -battery of tests is passed, the CI/CD system will create a -build/artifact to be deployed to staging and have the next battery of -tests runs against it. If that passes, then that artifact could be -deployed to production, and details on the commit and - -#### Possible Steps - -Below are some possible steps (and tasks) to undertake for automated deployment - -##### STEP 01: Build package - -- Triggered by a commit to "main" branch (for now) -- Trigger build of the package -- Run unit tests as part of the build: - - This has been done with the laminar scripts under `scripts/laminar` in genenetwork3 - - Maybe just change the command to ensure only specific tests are run, - especially when we add in non-functional tests and the like -- If the build fails (tests fail, other failures): abort and send notifications to development team -- If build succeeds, go to STEP 02 - -##### STEP 02: Deploy to Staging - -- Triggered by a successful build -- Run in intervals of maybe one hour or so... -- Build the container/VM for deployment: here's the first time `guix system container ...` is run -- Deploy the container/VM to staging: the details are fuzzy here -- Run configuration tests here -- Run performance tests -- Run integration tests -- Run UI tests -- Run ... tests -- On failure, abort and send out notification to development team -- On success go to STEP 03 - -##### STEP 03: Deploy to Release Candidate - -- Triggered by a successful deploy to Staging -- Run in intervals of maybe 6 hours -- Pick latest successful commit to pass staging tests -- Build the container/VM for deployment: run `guix system container ...` or reuse container from staging -- Update configurations for production -- Run configuration tests -- Run acceptance tests -- On failure, abort and send out notification to development team -- On success go to STEP 04 - -##### STEP 03: Deploy to Production - -- Triggered by a successful Release Candidate -- Tag the commit as a release - - Maybe include the commit hash and date - e.g gn3-v0.0.12-794db6e2-20220113 -- Build the container/VM for deployment - - run `guix system container ...` or reuse container from staging - - tag container as a release container -- Deploy container to production -- Generate documentation for tagged commit -- Generate guix declaration for re-generating the release -- Archive container image, documentation and guix declaration for possible rollback diff --git a/topics/systems/decommission-machines.gmi b/topics/systems/decommission-machines.gmi deleted file mode 100644 index 496f23f..0000000 --- a/topics/systems/decommission-machines.gmi +++ /dev/null @@ -1,63 +0,0 @@ -# Decommission machines - -# Tags - -* assigned: pjotrp, arthurc, dana -* priority: high -* keywords: systems -* type: system administration -* status: unclear - -# Tasks - -## Running (OK) - -* rabbit is used for backups (R815 - 2010 model - , 24core, AMD, 64G, 5TB: pjotr, root) - -## Still running old Linux - -* [ ] xeon (1950 - 2006 model - pjotr root) BNW, pivotcollections.org - - made a full backup on rabbit - - need to move DNS for pivotcollections to Tux02 or P2 -* [ ] lily (1950 - still in use by gn1: pjotr root) - - runs gn1-lily and Arthur uses it -* [ ] rhodes (860) - runs wiki? - - login fixed by @acenteno - - services are - ServerName wiki.genenetwork.org - DocumentRoot /var/www/mediawiki - mysql -u mediawikiuser -pmonroe815link201 mediawikidb - 35 pages - save pages as HTML - ServerName lookseq.genenetwork.org - DocumentRoot /var/www/lookseq/html - ServerName lookseq2.genenetwork.org - DocumentRoot /var/www/lookseq2/html - ServerName galaxy.genenetwork.org - DocumentRoot /var/www/galaxy - - Dave is doing the final backup - -* [ ] NB (860) unused? - was mailman trial - - no login -* [ ] tyche (2950 - 2006 model - login as arthur, no root, hacked?) - reset passwords - - tyche hard disk array is broken, failed to recover - -## Switched off/down - -* [X] summer211 - ran UTHSC browser (R610) - needs backup to fetch annotations, but no access - - need to access when in machine room -* [X] alexandria (off) -* [X] proust (off) -* [X] artemis (Poweredge 860 -2006 model-, 2 core XEON, 2GB RAM: pjotr, arthur, root - runs time machines) see also Artemis runs time machines - - dead -* [X] zeus (860 pjotr/root) - Genome browser? -* [X] plum (860 2006 model) -* [X] bamboo (860) -* [X] pine (860) -* [X] winter211 (R610) -* [X] spring211 (Poweredge R610 2010 model - no access) -* [X] autumn211 (R610) - -see also - -=> https://trello.com/c/usOYPBG9/72-decommissioning-older-machines diff --git a/topics/systems/fallbacks-and-backups.gmi b/topics/systems/fallbacks-and-backups.gmi deleted file mode 100644 index 1a22db9..0000000 --- a/topics/systems/fallbacks-and-backups.gmi +++ /dev/null @@ -1,69 +0,0 @@ -# Fallbacks and backups - -As a hurricane is barreling towards our machine room in Memphis we are checking our fallbacks and backups for GeneNetwork. For years we have been making backups on Amazon - both S3 and a running virtual machine. The latter was expensive, so I replaced it with a bare metal server which earns itself (if it hadn't been down for months, but that is a different story). - -## Tags - -* type: enhancement -* assigned: pjotrp -* keywords: systems, fallback, backup, deploy -* status: in progress -* priority: critical - -## Tasks - -* [.] backup ratspub, r/shiny, bnw, covid19, hegp, pluto services -* [X] /etc /home/shepherd backups for Octopus -* [X] /etc /home/shepherd backups for P2 -* [X] Get backups running again on fallback -* [ ] fix redis queue for P2 - needs to be on rabbit -* [ ] fix bacchus large backups -* [ ] backup octopus01:/lizardfs/backup-pangenome on bacchus - -## Backup and restore - -We are using borg for backing up data. Borg is excellent at deduplication and compression of data and is pretty fast too. Incremental copies work with rsync - so that is fast. To restore the full MariaDB database from a local borg repo takes a few minutes: - -``` -wrk@epysode:/export/restore_tux01$ time borg extract -v /export2/backup/tux01/borg-tux01::BORG-TUX01-MARIADB-20210829-04:20-Sun -real 17m32.498s -user 8m49.877s -sys 4m25.934s -``` - -This all contrasts heavily with restoring 300GB from Amazon S3. - -Next restore the GN2 home dir - -``` -root@epysode:/# borg extract export2/backup/tux01/borg-genenetwork::TUX01_BORG_GN2_HOME-20210830-04:00-Mon -``` - -## Get backups running on fallback - -Recently epysode was reinstated after hardware failure. I took the opportunity to reinstall the machine. The backups are described in the repo (genenetwork org members have access) - -=> https://github.com/genenetwork/gn-services/blob/master/services/backups.org - -As epysode was one of the main sheepdog messaging servers I need to reinstate: - -* [X] scripts for sheepdog -* [X] enable trim -* [X] reinstate monitoring web services -* [X] reinstate daily backup from penguin2 -* [X] CRON -* [X] make sure messaging works through redis -* [X] fix and propagate GN1 backup -* [X] fix and propagate IPFS and gitea backups -* [X] add GN1 backup -* [X] add IPFS backup -* [X] other backups -* [ ] email on fail - -Tux01 is backed up now. Need to make sure it propagates to - -* [X] P2 -* [X] epysode -* [X] rabbit -* [X] Tux02 -* [ ] bacchus diff --git a/topics/systems/gn1-time-machines.gmi b/topics/systems/gn1-time-machines.gmi deleted file mode 100644 index 0029016..0000000 --- a/topics/systems/gn1-time-machines.gmi +++ /dev/null @@ -1,49 +0,0 @@ -# GN1 Time machines - -We want to reinstate the time machines and run them in containers. Databases and source code as they were running before are located in - -=> penguin2:/export/backup/artemis/borg - -Use the borg backup tool to extract with the passphrase in my ~/.borg-pass - -``` -TM Fri, 2019-12-27 07:48:12 -TM_etc_httpd Fri, 2019-12-27 07:49:18 -TM_gn_web.checkpoint Fri, 2019-12-27 07:52:21 -TM_gn_web Fri, 2019-12-27 09:45:09 -TM_artemis Sat, 2019-12-28 01:55:40 -TM_artemis2 Sun, 2019-12-29 02:20:38 -``` - -note that borg contents can be listed with `borg list dir::TM_etc_httpd`. - -Essentially it contains the full mariadb databases, source code and etc files to set up the containers. Start with the most recent one and see if you can get that to run on Penguin2. After that we'll do the others. The database are named, for example, - -``` -rw-rw-r-- wrk sudo 37907660 Wed, 2009-02-18 16:40:47 export/backup/artemis/mysql/gn_db_20090304/Geno.MYD --rw-rw-r-- wrk sudo 37940692 Fri, 2009-12-25 00:42:24 export/backup/artemis/mysql/gn_db_20091225/Geno.MYD --rw-rw-r-- wrk sudo 37940764 Thu, 2010-08-05 14:13:30 export/backup/artemis/mysql/gn_db_20100810/Geno.MYD --rw-rw-r-- wrk sudo 4451132 Sun, 2011-04-24 03:38:14 export/backup/artemis/mysql/gn_db_20110424/Geno.MYD --rw-rw-r-- wrk sudo 4451132 Mon, 2011-08-08 03:25:18 export/backup/artemis/mysql/gn_db_20110808/Geno.MYD --rw-rw-r-- wrk sudo 4424608 Mon, 2012-09-24 11:43:46 export/backup/artemis/mysql/gn_db_20120928/Geno.MYD --rw-rw-r-- wrk sudo 23838240 Tue, 2014-01-14 14:40:47 export/backup/artemis/mysql/gn_db_20140123/Geno.MYD --rw-rw-r-- wrk sudo 24596912 Sun, 2014-11-23 23:22:30 export/backup/artemis/mysql/gn_db_20150224/Geno.MYD --rw-rw-r-- wrk sudo 24597500 Wed, 2015-07-22 12:48:33 export/backup/artemis/mysql/gn_db_20160316/Geno.MYD --rw-rw-r-- wrk sudo 24596908 Tue, 2016-06-07 17:39:37 export/backup/artemis/mysql/gn_db_20160822/Geno.MYD --rw-rw-r-- wrk sudo 24596908 Tue, 2016-06-07 17:39:37 export/backup/artemis/mysql/gn_db_20161212/Geno.MYD --rw-rw-r-- wrk sudo 33730812 Wed, 2017-04-12 17:06:40 export/backup/artemis/mysql/gn_db_20180228/Geno.MYD -``` - -so you can see all the different versions. The matching code bases should be there too. - -## Tags - -* assigned: pjotrp, efraimf -* priority: high -* status: unclear -* type: system administration -* keywords: systems - -## Tasks - -## Info diff --git a/topics/systems/letsencrypt.gmi b/topics/systems/letsencrypt.gmi deleted file mode 100644 index bafae55..0000000 --- a/topics/systems/letsencrypt.gmi +++ /dev/null @@ -1,26 +0,0 @@ -# Letsencrypt - -## Tags - -* assigned: pjotr -* type: bug -* priority: critical -* status: completed, done, closed - -## Tasks - -* [X] letsencrypt is failing on P2 and Tux01 (expiry Nov12) - - letsencrypt was down -* [X] ucscbrowser needs a certificate (now forwards http -> https) - -## Notes - -``` -certbot renew --dry-run -``` - -Add certificate - -``` -certbot certonly --nginx --agree-tos --preferred-challenges http -d ucscbrowser.genenetwork.org --register-unsafely-without-email -``` diff --git a/topics/systems/machine-room.gmi b/topics/systems/machine-room.gmi deleted file mode 100644 index 28d9921..0000000 --- a/topics/systems/machine-room.gmi +++ /dev/null @@ -1,19 +0,0 @@ -# Machine room - -## Tags - -* assign: pjotrp, dana -* type: system administration -* priority: high -* keywords: systems -* status: unclear - -## Tasks - -* [X] Make tux02e visible from outside -* [ ] Network switch 10Gbs - add hosts -* [ ] Add disks to tux01 and tux02 - need to reboot -* [ ] Set up E-mail relay for tux01 and tux02 smtp.uthsc.edu, port 25 - -=> tux02-production.gmi setup new production machine -=> decommission-machines.gmi Decommission machines diff --git a/topics/systems/octopus.gmi b/topics/systems/octopus.gmi deleted file mode 100644 index 1420865..0000000 --- a/topics/systems/octopus.gmi +++ /dev/null @@ -1,28 +0,0 @@ -# Octopus sysmaintenance - -# Tags - -* assigned: pjotrp, efraimf, erikg -* priority: high -* status: completed, closed -* type: system administration -* keywords: systems, octopus - -# Tasks - -* [X] install sheepdog -* [X] run borg backup -* [X] propagate backup to rabbit -* [X] fix redis updates - use rev tunnel -* [X] check other dirs - -# Info - -Intermediate routing on Octopus08 - -``` -default via 172.23.16.1 dev ens1f0np0 -172.23.16.0/21 dev ens1f0np0 proto kernel scope link src 172.23.17.24 -172.23.16.0/21 dev eno1 proto kernel scope link src 172.23.18.68 -172.23.16.0/21 dev eno2 proto kernel scope link src 172.23.17.134 -``` diff --git a/topics/systems/reboot-tux01-tux02.gmi b/topics/systems/reboot-tux01-tux02.gmi deleted file mode 100644 index 3186a0d..0000000 --- a/topics/systems/reboot-tux01-tux02.gmi +++ /dev/null @@ -1,54 +0,0 @@ -# Rebooting the GN production machine(s) - -I needed to add the hard disks in the BIOS to make them visible - one of the annoying aspects of these Dell machines. First on Tux02 I cheched the borg backups to see if we have a recent copy of MariaDB, GN2 etc. The DB is from 2 days ago and the genotypes of GN2 are a week old (because of a permission problem). I'll add a copy by hand of both - an opportunity to test the new 10Gbs router. - -Something funny is going on. When eno4 goes down the external webserver interface is not working. It appears, somehow, that 128.169.4.67 is covering for 128.169.5.59. I need to check that! - -# Tasks - -Before rebooting - -On Tux02: - -* [X] Check backups of DB and services -* [X] Copy trees between machines - -On Tux01: - -* [ ] Network confused. See above. - -On both: - -* [X] Check network interface definitions (what happens on reboot) -* [X] Check IPMI access - should get serial login - - -# Info - -## Routing - -On tux02 eno2d1 is the 10Gbs network interface. Unfortunately I can't get it to connect at 10Gbs with Tux01 because the latter is using that port for the outside world. - -Playing with 10Gbs on Tux01 sent the hardware in a tail spin, what to think of this solution on - -``` -bnxt_en 0000:01:00.1 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0 - -Solution was to power down the server(s) and *remove* power cords for 5 minutes. -``` - -=> https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769 - -The Linux kernel shows some fixes that are not on Tux01 yet - -=> https://lkml.org/lkml/2021/2/17/970 - -In our case a simple reboot worked, fortunately. - -## Tags - -* assigned: pjotrp -* status: unclear -* priority: medium -* type: system administration -* keywords: systems, tux01, tux02, production diff --git a/topics/systems/sheepdog.gmi b/topics/systems/sheepdog.gmi deleted file mode 100644 index 5285553..0000000 --- a/topics/systems/sheepdog.gmi +++ /dev/null @@ -1,34 +0,0 @@ -# Sheepdog - -I have written sheepdog to keep track of backups etc. Here are some issues -that need resolving at some point. - -=> https://github.com/pjotrp/deploy - -## Tags - -* assigned: pjotrp -* type: enhancement -* status: in progress, halted -* priority: medium -* keywords: system, sheepdog - -## Tasks - -* [X] add locking functionality for tags - added borg with-lock (test) -* [X] chgrp functionality in sheepdog_borg -* [ ] check whether rsync dir exists, repo valid and/or no lock before proceeding -* [ ] send digest E-mails -* [ ] smart state E-mails on services going down -* [ ] block on root user if not running from protected dir -* [ ] borg/rsync should check validity of repo before propagating -* [ ] borg/rsync ignore files that have wrong permissions -* [ ] package in GNU Guix for root scripts -* [ ] list current state - it means parsing the state list (some exists) -* [ ] synchronize between queues using a dump -* [ ] sheepdog_expect.rb - expect PINGs -* [ ] sheepdog_rsync.rb - test for 'total size is 0' -* [ ] sheepdog_list tag and filter switches improve behaviour -* [ ] add sheepdog_web_monitor - currently using plain curl -* [X] borg: set user/group after backup -* [ ] add remote borg backup with sshfs diff --git a/topics/systems/tux02-production.gmi b/topics/systems/tux02-production.gmi deleted file mode 100644 index 0b698ab..0000000 --- a/topics/systems/tux02-production.gmi +++ /dev/null @@ -1,65 +0,0 @@ -# Tux02 Production - -We are going to move production to tux02 - tux01 will be the staging machine. This machine is aimed to be rock solid. The idea is to have 4-6 times a year upgrades. Also we should be able to roll back on an upgrade and be able to create time machines. - -## Tags - -* assigned: pjotrp -* status: in progress -* priority: medium -* type: system administration -* keywords: systems, tux02, production - -## Tasks - -* [X] update guix guix-1.3.0-9.f743f20 -* [X] set up nginx (Debian) -* [X] test ipmi console (172.23.30.40) -* [X] test ports (nginx) -* [?] set up network for external tux02e.uthsc.edu (128.169.4.52) -* [X] set up deployment evironment -* [X] sheepdog copy database backup from tux01 on a daily basis using ibackup user -* [X] same for GN2 production environment -* [X] sheepdog borg borg the backups -* [X] start GN2 production services -* [X] add GN3 aliases server -* [X] add Genenetwork3 service - env FLASK_APP="main.py" flask run --port=8087 -* [X] add proxy -* [ ] set up databases -* [ ] set up https and letsencrypt -* [X] set up firewalling -* [ ] set up systemd -* [ ] setup logrotate for production log files -* [ ] run git automatically on /etc and backup without passwords -* [ ] add borg backups -* [ ] create check list for manual testing -* [ ] look at performance - -## Info - -We have a protocol for updating GN2 on Tux02. - -### Restore database from backup - -Databases no longer get copied. We only restore from backup. First because these are reproducible [installs]. Second because the backup should be in a sane state(!). - -Restoring a database from backup (about an hour) - -``` -root@tux02:/export3/backup/tux01/borg# borg extract borg-tux01::borg-backup-mariadb-20211024-03:09-Sun --progress -``` - -Next move the dir to fast storage. - -#### Symlink /var/lib/mysql - -The database is symlinked. You can point that to the recovered backup. Restart the DB and run mysql_upgrade followed by our tests. E.g. - -``` -systemctl stop mariadb -ln -s /export2/mysql/borg-backup-mariadb-20211024-03\:09-Sun /var/lib/mysql -systemctl start mariadb -/usr/local/guix-profiles/gn-latest-20211021/bin/mysql_upgrade -u webqtlout -pwebqtlout -/export/backup/scripts/tux02/system_check.sh -``` |