Many small GPUs, one big model. Production LLM inference, sharded across consumer cards.
ComputePool is a sharded inference network that turns idle consumer GPUs into production AI infrastructure. A Llama-3 70B model needs ~140 GB of VRAM in fp16; an RTX 4090 has 24. So inference today funnels to three hyperscalers while hundreds of millions of capable consumer cards sit idle — each one alone is too small to host a real model.
We split the model layer-wise across two or more cards. The entry shard holds embeddings plus the first half of transformer blocks; the exit shard holds the second half plus the lm_head. Hidden-state activations stream peer-to-peer over Gensyn's AXL transport; sampled tokens come back the same way. The orchestrator never touches activations. Throughput tracks single-GPU throughput within noise — the per-token network hop is single-digit milliseconds against a forward pass that takes tens.
Operators get paid per second of compute. A user opens a session with one x402 voucher; the orchestrator opens a Superfluid USDCx stream that flows to every operator in the coalition while inference runs, and stops the moment the request closes. KeeperHub workflows drive every state transition — propose, activate, stream start/stop, slash — so the full payment lifecycle is auditable and retry-safe.
To make it work we shipped infrastructure back to the sponsor stacks: CREATE2-deployed verified Superfluid contracts on 0G Galileo (the chain's first per-second money primitive), upstream PRs to KeeperHub for a Superfluid plugin and a Coalition plugin (multi-workflow consensus), and a turnkey AXL deployment pattern with prebuilt NVIDIA + CPU Docker images plus Tailscale-native networking (zero exposed ports). Each pool is also minted as an ERC-7857 INFT with encrypted intelligence on 0G Storage. The orchestrator runs inside a TEE so 0G Compute's signing and attestation flow stays intact.
Live on 0G Galileo testnet. Run
make build && make upto bring up the cluster locally;python scripts/e2e_demo.pyruns the end-to-end payment + sharded-inference demo.
The stack. The control plane is Python (FastAPI + Motor for async MongoDB) running an orchestrator that exposes a REST API plus an OpenAI-compatible streaming router (
orchestrator/api/openai_compat.py). Each worker is a separate Python container running FastAPI with HuggingFace Transformers / PyTorch (bfloat16 on CPU by default, CUDA-ready) plus a Go binary — Gensyn's AXL daemon — running side-by-side as the P2P data plane. The frontend is Next.js 16 / React 19 / Tailwind v4. Smart contracts (PoolINFT.sol,Coalition.sol) are Foundry, deployed to 0G Galileo testnet. Everything is containerized in a multi-stage Dockerfile (Go stage builds AXL; Python stage layers the worker + orchestrator on top).
The token loop. The orchestrator never touches activations. On
/pools/{n}/infer, the entry worker tokenizes the prompt, runsforward_entryover its layer slice (embed + first half of transformer blocks), packs the resulting hidden-state tensor with our own binary framing (4-byte LE header length, JSON header, raw bytes), andaxl.sends it to the exit peer. The exit worker, in a backgroundexit_loopasync task, pollsaxl.recv, runsforward_exitover its slice (second half + lm_head), samples the next token, andaxl.sends it back. Both sides keep a HuggingFaceDynamicCachekeyed byrequest_id, dropped onunloador on receipt of acontrol/endframe. Throughput tracks single-GPU throughput within noise — the per-token AXL hop is single-digit milliseconds against a forward pass that takes tens.Gensyn — AXL. AXL is the entire data plane between shards. Without it we'd need a TURN server or a centralized relay; with it, every operator gets an encrypted, decentralized comms layer via a single binary on localhost. We wrote a thin async HTTP wrapper (
worker/axl_client.py) around AXL's/send,/recv,/topologyendpoints, plus a custom binary frame format (worker/framing.py) that's bf16-safe. For deployment we bundled AXL with prebuilt NVIDIA + CPU Docker images and added a Tailscale Compose overlay (docker-compose.tailscale.yml) so operators expose zero public ports — AXL still gets a routable peer address via the tailnet (100.x.x.x) and the AXL handshake completes peer-to-peer over that. Multi-node by construction:make upbrings up two AXL daemons in separate containers, andscripts/run-remote-worker.shjoins a third worker on a remote host with one command.0G — chain, storage, compute, INFT. ComputePool runs on 0G Galileo testnet (chainId 16602). Three concrete usages of 0G:
- 0G Storage holds encrypted INFT intelligence blobs (
orchestrator/inft/storage_0g.py+crypto.py). Each pool is minted as an ERC-7857 PoolINFT (contracts/src/PoolINFT.sol) whose intelligence lives encrypted on 0G Storage; the on-chain INFT references the content hash and the canonical metadata.- 0G Compute — we register the coalition as an attested provider via
scripts/register_0g_provider.pyagainst the 0G router ABI (scripts/0g_router.abi.json). The orchestrator runs inside a TEE (orchestrator/tee/attestation.py,tee/signer.py); 0G Compute verifies the attestation quote, so a coalition of N consumer GPUs presents to 0G as one attested provider — no protocol downgrade.- 0G Chain — we CREATE2-deployed and source-verified the full Superfluid stack (Host, agreements, factories, GDAv1Forwarder, CFAv1Forwarder, USDCx via SuperTokenFactory). Per-second money streams are now a public primitive on 0G; anyone in the ecosystem can call them.
KeeperHub. Every payment-side state transition runs through a KeeperHub workflow. Five JSON exports live in
keeperhub/:propose,activate-and-pool,stream-start,stream-stop,handle-breach. The orchestrator drives them via KH's MCP/JSON-RPC endpoint (orchestrator/keeperhub.py), and KH calls back via webhooks (orchestrator/webhooks.py+webhook_verifier.pyfor HMAC). On top of using KH we contributed two upstream PRs — a Superfluid plugin so any KH workflow can speakcreatePool,updateMemberUnits,distributeFlownatively, and a Coalition plugin for multi-workflow consensus where N independent workflow runs vote on a result before the keeper acts. Specs atPRD-2-superfluid-plugin.mdandPRD-1-coalition-plugin.md. Agents pay autonomously via x402: the OpenAI-compat router returns HTTP 402 with the challenge, the agent signs and replays.x402 + Superfluid bond. A single signed x402 voucher (EIP-3009
transferWithAuthorizationon a USDC mock) opens the session via our self-hosted facilitator (facilitator/); once settlement clears, the orchestrator fires the KHstream-startworkflow which callsGDA distributeFlowat the negotiated rate, paying every operator in the coalition by the second. On EOS or client disconnect, the orchestrator firesstream-stopand the meter halts to the second.Notable hacks worth mentioning:
- Tied-weight detach on exit shard. Llama / Qwen tie
lm_head.weighttoembed_tokens.weight. The exit shard only loads the back half of the model — but the embed lives on the entry shard. We explicitly clone-and-detach the lm_head weight beforedel-ing the original full model so the exit survives.- TEE-in-Docker. Running an SGX/TDX enclave under Docker Compose for the orchestrator — getting the device passthrough right and the attestation quote round-tripped to 0G Compute's verifier — was the longest single yak-shave of the build.
- CREATE2 Superfluid deploy on 0G. Producing deterministic verified addresses for the entire Superfluid stack on a fresh chain meant generating salts, replaying the canonical bytecode, and re-verifying every upgradeable proxy on the 0G explorer. Worth it: the contracts are now public infrastructure.

