A template repository to set up automated AI training machines using fluence and walrus
This template repository provides a baseline onto which build virtual machine images that can be used to train AI models. They target fluence.dev VMs and use wal.app to host training datasets and output models. By leveraging an image, multiple compute can be set up to scale horizontally and achieve high training throughput. The project itself is not a VM image, but a template to customize and to build upon. A demo application is provided, as well as extra utilities like a notification bot and the walrus nix package.
This started as a repository with a nix configuration (nix flake) that can be used to configure, run (in qemu), test and generate a VM image using nixos-generators. This means it's easy to produce qcow2, raw or kubevirt images with just a flag change. Once that was set up, I had to make walrus work in nixos: since a package was missing (both for it and for its dependency sui), I had to create them. That was done by leveraging ubuntu binaries and fixing the linking for nixos, instead of rebuilding the packages from scratch (which takes quite some time). With those packages, I could push and pull data from the command line. HTTP API was also used for this, but users have now the ability to use their tool of preference. I created some scripts that ease the process of uploading and downloading assets from walrus. I then configured the system to pull the training data upon boot, start a simple machine-learning process with those data and finally to upload the resulting model on walrus storage and notify the user of completion. The repository contains a bunch of related utilities, like a telegram bot to aid with notifications, secrets management with agenix, example caddy configuration to set up a reverse proxy and so on.