diff options
| author | Pjotr Prins | 2025-10-04 19:58:03 +0200 |
|---|---|---|
| committer | Pjotr Prins | 2026-01-05 11:12:10 +0100 |
| commit | 9afb63ed04f792a1f27cfc1cfed87a7513ec40c7 (patch) | |
| tree | f6c3729e9ea087beffb961f6f8181a45bf625dfb /topics/systems/linux | |
| parent | e829491a214fc17557aec04b623766f870210321 (diff) | |
| download | gn-gemtext-9afb63ed04f792a1f27cfc1cfed87a7513ec40c7.tar.gz | |
Installing L4 GPU
Diffstat (limited to 'topics/systems/linux')
| -rw-r--r-- | topics/systems/linux/GPU-on-balg01.gmi | 119 |
1 files changed, 119 insertions, 0 deletions
diff --git a/topics/systems/linux/GPU-on-balg01.gmi b/topics/systems/linux/GPU-on-balg01.gmi new file mode 100644 index 0000000..d52ec26 --- /dev/null +++ b/topics/systems/linux/GPU-on-balg01.gmi @@ -0,0 +1,119 @@ +# Installing GPU on Balg01 server + +lspci shows the card, an L4 + +=> https://www.techpowerup.com/gpu-specs/l4.c4091 + +``` +lspci|grep NVIDIA +NVIDIA Corporation AD104GL +``` + +The machine had raspi and Tesla support installed (?!), so I removed that: + +``` +apt-get remove firmware-nvidia-tesla-gsp +``` + +Disabled nouveau drivers + +```/etc/modprobe.d/blacklist-nouveau.conf +blacklist nouveau +options nouveau modeset=0 +``` + +``` +dpkg --purge raspi-firmware +update-initramfs -u +reboot (can skip for a bit) +``` + +## Create fallback boot partition + +Well, before rebooting I should have created another fallback boot partitition with a more recent debian. +Unfortunately I had not prepared space on one of the disks (something I normally do). Turned out /dev/sdc on /export3 was not really used lately, so I could move that data and reuse that partition. + +``` +/dev/sdc1 1.8T 552G 1.2T 33% /export3 +``` + +it is a very slow drive (btw), not sure why. I ran badblocks but it does not make a difference. The logs show: + +``` +Oct 04 09:34:37 balg01 kernel: I/O error, dev sdc, sector 23392285 op 0x9:(WRITE_ZEROES) flags 0x8000000 > +O +``` + +but it looks more like a driver problem than an actual disk error. Well, maybe on the new debian install it will be fine. +At this point it is just to install a fallback boot partition, so no real worries. + +On using debootstrap, grub etc. the old partition came back fine and I tested I can also boot into the new Debian install. Especially with remote servers this is a great comfort. + +## CUDA continued + +Now we have a fallback boot partition it is a bit easier to mess with CUDA drivers. + +To install the CUDA drivers you may need to disable 'secure boot' in the bios. + +``` +apt install build-essential gcc make cmake dkms +apt install linux-headers-$(uname -r) +``` + +The debian selector, choose data center and L series: Driver Version:580.95.05 CUDA Toolkit:13.0 Release Date:Wed Oct 01, 2025 File Size:844.44 MB + +Note I installed the nvidia-open drivers. If things are not working we should look at the proprietary stuff. + +``` +balg01:~# nvidia-smi +Sat Oct 4 11:56:19 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA L4 Off | 00000000:81:00.0 Off | 0 | +| N/A 57C P0 29W / 72W | 0MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +``` + +## Testing GPU + + +Using Guix python I ran: + +``` +pip install "gpu-benchmark-tool[nvidia]" +``` + +of course it downloads a ridiculous amount of binaries... But then we can run + +``` +export PATH=/home/wrk/.local/bin:$PATH +gpu-benchmark benchmark --duration=30 +``` + +that did not work. CUDA samples are packaged in Debian and requires building the scripts: + +``` +apt-get install nvidia-cuda-samples nvidia-cuda-toolkit-gcc +cd /usr/share/doc/nvidia-cuda-toolkit/examples/Samples/6_Performance/transpose +export CUDA_PATH=/usr +make +./transpose +> [NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores) +> Compute performance scaling factor = 1.00 +... +Test passed +``` |
