diff options
Diffstat (limited to 'topics/systems/linux')
| -rw-r--r-- | topics/systems/linux/GPU-on-balg01.gmi | 201 | ||||
| -rw-r--r-- | topics/systems/linux/add-boot-partition.gmi | 52 | ||||
| -rw-r--r-- | topics/systems/linux/adding-nvidia-drivers-penguin2.gmi | 74 |
3 files changed, 327 insertions, 0 deletions
diff --git a/topics/systems/linux/GPU-on-balg01.gmi b/topics/systems/linux/GPU-on-balg01.gmi new file mode 100644 index 0000000..d0cb3fc --- /dev/null +++ b/topics/systems/linux/GPU-on-balg01.gmi @@ -0,0 +1,201 @@ +# Installing GPU on Balg01 server + +lspci shows the card, an L4 + +=> https://www.techpowerup.com/gpu-specs/l4.c4091 + +``` +lspci|grep NVIDIA +NVIDIA Corporation AD104GL +``` + +The machine had raspi and Tesla support installed (?!), so I removed that: + +``` +apt-get remove firmware-nvidia-tesla-gsp +``` + +Disabled nouveau drivers + +```/etc/modprobe.d/blacklist-nouveau.conf +blacklist nouveau +options nouveau modeset=0 +``` + +``` +dpkg --purge raspi-firmware +update-initramfs -u +reboot (can skip for a bit) +``` + +## Create fallback boot partition + +Well, before rebooting I should have created another fallback boot partitition with a more recent debian. +Unfortunately I had not prepared space on one of the disks (something I normally do). Turned out /dev/sdc on /export3 was not really used lately, so I could move that data and reuse that partition. + +``` +/dev/sdc1 1.8T 552G 1.2T 33% /export3 +``` + +it is a very slow drive (btw), not sure why. I ran badblocks but it does not make a difference. The logs show: + +``` +Oct 04 09:34:37 balg01 kernel: I/O error, dev sdc, sector 23392285 op 0x9:(WRITE_ZEROES) flags 0x8000000 > +O +``` + +but it looks more like a driver problem than an actual disk error. Well, maybe on the new debian install it will be fine. +At this point it is just to install a fallback boot partition, so no real worries. + +On using debootstrap, grub etc. the old partition came back fine and I tested I can also boot into the new Debian install. Especially with remote servers this is a great comfort. + +## CUDA continued + +Now we have a fallback boot partition it is a bit easier to mess with CUDA drivers. + +To install the CUDA drivers you may need to disable 'secure boot' in the bios. + +``` +apt install build-essential gcc make cmake dkms +apt install linux-headers-$(uname -r) +``` + +The debian selector, choose data center and L series: Driver Version:580.95.05 CUDA Toolkit:13.0 Release Date:Wed Oct 01, 2025 File Size:844.44 MB + +Note I installed the nvidia-open drivers. If things are not working we should look at the proprietary stuff. I used the 'local repository installation' instructions of + +=> https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#debian-installation + + +``` +apt-get install nvidia-libopencl1 nvidia-open nvidia-driver-cuda +``` + +The first one is to prevent + +``` +libnppc11 : Conflicts: nvidia-libopencl1 +``` + +now this should run + +``` +balg01:~# nvidia-smi +Sat Oct 4 11:56:19 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA L4 Off | 00000000:81:00.0 Off | 0 | +| N/A 57C P0 29W / 72W | 0MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +``` + +## Testing GPU + + +Using Guix python I ran: + +``` +pip install "gpu-benchmark-tool[nvidia]" +``` + +of course it downloads a ridiculous amount of binaries... But then we can run + +``` +export PATH=/home/wrk/.local/bin:$PATH +gpu-benchmark benchmark --duration=30 +``` + +that did not work. CUDA samples are packaged in Debian and requires building the scripts: + +``` +apt-get install nvidia-cuda-samples nvidia-cuda-toolkit-gcc +cd /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/6_Performance/transpose +export CUDA_PATH=/usr +make +./transpose +> [NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores) +> Compute performance scaling factor = 1.00 +... +Test passed +``` + +Note that this removed nvidia-smi. Let's look at versions: + +``` +pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb +pool/contrib/n/nvidia-cuda-samples/nvidia-cuda-samples_11.8~dfsg-2_all.deb +pool/non-free/n/nvidia-cuda-toolkit/nvidia-cuda-toolkit-gcc_11.8.0-5~deb12u1_amd64.deb +pool/non-free/n/nvidia-graphics-drivers/nvidia-libopencl1_535.247.01-1~deb12u1_amd64.deb +``` + +while + +``` +Filename: ./nvidia-open_580.95.05-1_amd64.deb +Package: nvidia-driver-cuda +Version: 580.95.05-1 +Section: NVIDIA +Source: nvidia-graphics-drivers +Provides: nvidia-cuda-mps, nvidia-smi +``` + +and it turns out to be a mixture. I have to take real care not to mix in Debian packages! For example this package is a Debian original: + +``` +ii nvidia-cuda-gdb 11.8.86~11.8.0-5~deb12u1 amd64 NVIDIA CUDA Debugger (GDB) +``` + +``` +apt remove --purge nvidia-* cuda-* libnvidia-* +``` + +says + +``` +Note, selecting 'libnvidia-gpucomp' instead of 'libnvidia-gpucomp-580.95.05' +``` + +To view installed packages belonging to Debian itself: + +``` +dpkg -l|grep nvid|grep deb12 +dpkg -l|grep cuda|grep deb12 +``` + +Let's reinstall and make sure only NVIDIA packages are used: + +``` +wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb +dpkg -i cuda-keyring_1.1-1_all.deb +apt-get update +apt-get install cuda-toolkit cuda-compiler-12-2 +``` + +Now we have: + +``` +/usr/local/cuda-12.3/bin/nvcc --version +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2023 NVIDIA Corporation +Built on Wed_Nov_22_10:17:15_PST_2023 +``` + +# Pytorch + +CUDA environment variable for pytorch is probably useful: + +=> https://docs.pytorch.org/docs/stable/cuda_environment_variables.html diff --git a/topics/systems/linux/add-boot-partition.gmi b/topics/systems/linux/add-boot-partition.gmi new file mode 100644 index 0000000..564e044 --- /dev/null +++ b/topics/systems/linux/add-boot-partition.gmi @@ -0,0 +1,52 @@ +# Add (2nd) boot and other partitions + +As we handle machines remotely it is often useful to have a secondary boot partition that can be used from grub. + +Basically, create a similar sized boot partition on a different disk and copy the running one over with: + +``` +parted -a optimal /dev/sdb +(parted) p +Model: NVMe CT4000P3SSD8 (scsi) +Disk /dev/sdb: 4001GB +Sector size (logical/physical): 512B/512B +Partition Table: gpt +Disk Flags: + +Number Start End Size File system Name Flags + 1 32.0GB 4001GB 3969GB ext4 bulk + +(parted) rm 1 +mklabel gpt +mkpart fat23 1 1GB +set 1 esp on +align-check optimal 1 +mkpart ext4 1GB 32GB +mkpart swap 32GB 48GB +set 2 boot on # other flags are raid, swap, lvm +set 3 swap on +mkpart scratch 48GB 512GB +mkpart ceph 512GB -1 +``` + +We also took the opportunity to create a new scratch partition (for moving things around) and a ceph partition (for testing). +Resulting in + +``` +Number Start End Size File system Name Flags + 1 1049kB 1000MB 999MB fat23 boot, esp + 2 1000MB 24.0GB 23.0GB ext4 boot, esp + 3 24.0GB 32.0GB 8001MB swap swap + 4 32.0GB 512GB 480GB ext4 scratch + 5 512GB 4001GB 3489GB ceph +``` + +Now we have the drive ready we can copy the existing boot partitions and make sure you don't get it wrong and the target partitiong is larger. +Here the original boot disk is /dev/sda (894Gb). We copy that to the new disk /dev/sdb (3.64Tb) + +``` +root@tux05:/home/wrk# dd if=/dev/sda1 of=/dev/sdb1 +root@tux05:/home/wrk# dd if=/dev/sda2 of=/dev/sdb2 +``` + +Next, test mount the dirs and reboot. You make want to run e2fsck and resize2fs on the new partitions (or their equivalent if you use xfs or something). diff --git a/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi new file mode 100644 index 0000000..81e721f --- /dev/null +++ b/topics/systems/linux/adding-nvidia-drivers-penguin2.gmi @@ -0,0 +1,74 @@ +# GPU Graphics Driver Set-Up + +Tux02 has the Tesla K80 (GK210GL) GPU. For machine learning, we want the official proprietary NVIDIA drivers. + +## Installation + +* Debian 12 moved NVIDIA driver into the non-free-firmware repo. Add the following to "/etc/apt/sources.list" and run "sudo apt update": + +``` +deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware +``` + +* Make sure the correct kernel headers are installed: + +``` +sudo apt install linux-headers-$(uname -r) +``` + +* Install "nvidia-tesla-470-driver"⁰ (The NVIDIA line-up of programmable "Tesla" devices, used primarily for simulations and large-scale calculations, also require separate driver packages to function correctly compared to the consumer-grade GeForce GPUs that are instead targeted for desktop and gaming usage)¹: + +``` +sudo apt purge 'nvidia-*' +sudo apt install nvidia-tesla-470-driver +``` + +* Black list nouveau since it conflicts with NVIDIA's driver, and regenerate the initramfs "sudo update-initramfs -u": + +``` +echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf +echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf +``` + +* Reboot and test the nvidia drivers: + +``` +sudo reboot +nvidia-smi + +# optional if you want to use nvidia-cuda-toolkit +sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit +``` + +## Issues + +Holding on reboot until I check in with the rest of team regarding some initd raspi hook: + +``` +update-initramfs: Generating /boot/initrd.img-6.1.0-9-amd64 +raspi-firmware: missing /boot/firmware, did you forget to mount it? +run-parts: /etc/initramfs/post-update.d//z50-raspi-firmware exited with return code 1 +dpkg: error processing package initramfs-tools (--configure): + installed initramfs-tools package post-installation script subprocess returned error exit status 1 +Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+deb12u1) ... +Errors were encountered while processing: + initramfs-tools +``` + +Removed the firmware by running: + +``` +sudo apt purge raspi-firmware + +# Configure all packages that are installed but not yet fully configured +sudo dpkg --configure -a + +# Update initramfs since we updated our drivers +sudo update-initramfs -u +``` + +## References + +=> https://us.download.nvidia.com/XFree86/Linux-x86_64/470.129.06/README/supportedchips.html ⁰ Nvidia 470.129.06 Supported Chipsets. +=> https://wiki.debian.org/NvidiaGraphicsDrivers#Tesla_Drivers ¹ Debian Tesla Drivers. +=> https://wiki.debian.org/NvidiaGraphicsDrivers/Configuration ² NVIDIA Proprietary Driver: Configuration. |
