ComputePool

Many small GPUs, one big model. Production LLM inference, sharded across consumer cards.

ComputePool

Created At

Open Agents

Winner of

KeeperHub

KeeperHub - Builder Feedback Bounty

Project Description

ComputePool is a sharded inference network that turns idle consumer GPUs into production AI infrastructure. A Llama-3 70B model needs ~140 GB of VRAM in fp16; an RTX 4090 has 24. So inference today funnels to three hyperscalers while hundreds of millions of capable consumer cards sit idle — each one alone is too small to host a real model.

We split the model layer-wise across two or more cards. The entry shard holds embeddings plus the first half of transformer blocks; the exit shard holds the second half plus the lm_head. Hidden-state activations stream peer-to-peer over Gensyn's AXL transport; sampled tokens come back the same way. The orchestrator never touches activations. Throughput tracks single-GPU throughput within noise — the per-token network hop is single-digit milliseconds against a forward pass that takes tens.

Operators get paid per second of compute. A user opens a session with one x402 voucher; the orchestrator opens a Superfluid USDCx stream that flows to every operator in the coalition while inference runs, and stops the moment the request closes. KeeperHub workflows drive every state transition — propose, activate, stream start/stop, slash — so the full payment lifecycle is auditable and retry-safe.

To make it work we shipped infrastructure back to the sponsor stacks: CREATE2-deployed verified Superfluid contracts on 0G Galileo (the chain's first per-second money primitive), upstream PRs to KeeperHub for a Superfluid plugin and a Coalition plugin (multi-workflow consensus), and a turnkey AXL deployment pattern with prebuilt NVIDIA + CPU Docker images plus Tailscale-native networking (zero exposed ports). Each pool is also minted as an ERC-7857 INFT with encrypted intelligence on 0G Storage. The orchestrator runs inside a TEE so 0G Compute's signing and attestation flow stays intact.

Live on 0G Galileo testnet. Run make build && make up to bring up the cluster locally; python scripts/e2e_demo.py runs the end-to-end payment + sharded-inference demo.

How it's Made

The stack. The control plane is Python (FastAPI + Motor for async MongoDB) running an orchestrator that exposes a REST API plus an OpenAI-compatible streaming router (orchestrator/api/openai_compat.py). Each worker is a separate Python container running FastAPI with HuggingFace Transformers / PyTorch (bfloat16 on CPU by default, CUDA-ready) plus a Go binary — Gensyn's AXL daemon — running side-by-side as the P2P data plane. The frontend is Next.js 16 / React 19 / Tailwind v4. Smart contracts (PoolINFT.sol, Coalition.sol) are Foundry, deployed to 0G Galileo testnet. Everything is containerized in a multi-stage Dockerfile (Go stage builds AXL; Python stage layers the worker + orchestrator on top).

The token loop. The orchestrator never touches activations. On /pools/{n}/infer, the entry worker tokenizes the prompt, runs forward_entry over its layer slice (embed + first half of transformer blocks), packs the resulting hidden-state tensor with our own binary framing (4-byte LE header length, JSON header, raw bytes), and axl.sends it to the exit peer. The exit worker, in a background exit_loop async task, polls axl.recv, runs forward_exit over its slice (second half + lm_head), samples the next token, and axl.sends it back. Both sides keep a HuggingFace DynamicCache keyed by request_id, dropped on unload or on receipt of a control/end frame. Throughput tracks single-GPU throughput within noise — the per-token AXL hop is single-digit milliseconds against a forward pass that takes tens.

Gensyn — AXL. AXL is the entire data plane between shards. Without it we'd need a TURN server or a centralized relay; with it, every operator gets an encrypted, decentralized comms layer via a single binary on localhost. We wrote a thin async HTTP wrapper (worker/axl_client.py) around AXL's /send, /recv, /topology endpoints, plus a custom binary frame format (worker/framing.py) that's bf16-safe. For deployment we bundled AXL with prebuilt NVIDIA + CPU Docker images and added a Tailscale Compose overlay (docker-compose.tailscale.yml) so operators expose zero public ports — AXL still gets a routable peer address via the tailnet (100.x.x.x) and the AXL handshake completes peer-to-peer over that. Multi-node by construction: make up brings up two AXL daemons in separate containers, and scripts/run-remote-worker.sh joins a third worker on a remote host with one command.

0G — chain, storage, compute, INFT. ComputePool runs on 0G Galileo testnet (chainId 16602). Three concrete usages of 0G:

  1. 0G Storage holds encrypted INFT intelligence blobs (orchestrator/inft/storage_0g.py + crypto.py). Each pool is minted as an ERC-7857 PoolINFT (contracts/src/PoolINFT.sol) whose intelligence lives encrypted on 0G Storage; the on-chain INFT references the content hash and the canonical metadata.
  2. 0G Compute — we register the coalition as an attested provider via scripts/register_0g_provider.py against the 0G router ABI (scripts/0g_router.abi.json). The orchestrator runs inside a TEE (orchestrator/tee/attestation.py, tee/signer.py); 0G Compute verifies the attestation quote, so a coalition of N consumer GPUs presents to 0G as one attested provider — no protocol downgrade.
  3. 0G Chain — we CREATE2-deployed and source-verified the full Superfluid stack (Host, agreements, factories, GDAv1Forwarder, CFAv1Forwarder, USDCx via SuperTokenFactory). Per-second money streams are now a public primitive on 0G; anyone in the ecosystem can call them.

KeeperHub. Every payment-side state transition runs through a KeeperHub workflow. Five JSON exports live in keeperhub/: propose, activate-and-pool, stream-start, stream-stop, handle-breach. The orchestrator drives them via KH's MCP/JSON-RPC endpoint (orchestrator/keeperhub.py), and KH calls back via webhooks (orchestrator/webhooks.py + webhook_verifier.py for HMAC). On top of using KH we contributed two upstream PRs — a Superfluid plugin so any KH workflow can speak createPool, updateMemberUnits, distributeFlow natively, and a Coalition plugin for multi-workflow consensus where N independent workflow runs vote on a result before the keeper acts. Specs at PRD-2-superfluid-plugin.md and PRD-1-coalition-plugin.md. Agents pay autonomously via x402: the OpenAI-compat router returns HTTP 402 with the challenge, the agent signs and replays.

x402 + Superfluid bond. A single signed x402 voucher (EIP-3009 transferWithAuthorization on a USDC mock) opens the session via our self-hosted facilitator (facilitator/); once settlement clears, the orchestrator fires the KH stream-start workflow which calls GDA distributeFlow at the negotiated rate, paying every operator in the coalition by the second. On EOS or client disconnect, the orchestrator fires stream-stop and the meter halts to the second.

Notable hacks worth mentioning:

  • Tied-weight detach on exit shard. Llama / Qwen tie lm_head.weight to embed_tokens.weight. The exit shard only loads the back half of the model — but the embed lives on the entry shard. We explicitly clone-and-detach the lm_head weight before del-ing the original full model so the exit survives.
  • TEE-in-Docker. Running an SGX/TDX enclave under Docker Compose for the orchestrator — getting the device passthrough right and the attestation quote round-tripped to 0G Compute's verifier — was the longest single yak-shave of the build.
  • CREATE2 Superfluid deploy on 0G. Producing deterministic verified addresses for the entire Superfluid stack on a fresh chain meant generating salts, replaying the canonical bytecode, and re-verifying every upgradeable proxy on the 0G explorer. Worth it: the contracts are now public infrastructure.
background image mobile

Join the mailing list

Get the latest news and updates