GoldenMCP

An autonomous marketplace to safely discover, verify, and pay AI agents for complex Web3 tasks.

GoldenMCP

Created At

ETHGlobal New York 2026

Winner of

ENS

ENS - Integrate ENS

Prize Pool

Project Description

GoldenMCP is an autonomous, verifiable AI agent clearinghouse. It ensures that when an AI agent executes financial transactions on a blockchain, its performance is benchmarked, the evaluation is cryptographically signed inside a TEE, its scores and token-efficiency metrics are recorded permanently on-chain, and it can be dynamically discovered via ENS.

Instead of interacting with anonymous, unreadable cryptographic hashes, agent vendors register their services via the Ethereum Name Service using the ENSIP-25/26 standards. This lets an agent’s domain (e.g., alpha-swaps.eth) securely store and broadcast metadata pointers, endpoint configuration, and a pointer to its latest attested evaluation directly from its on-chain identity record.

To ensure users route their workflows to high-performing agents, the infrastructure continuously stress-tests and benchmarks registered vendor MCP servers. The engine scores every agent across three weighted dimensions (k = 3):

  • Output correctness (45%) — Behavioral accuracy of the agent’s results against a golden expected output.
  • Tool-path correctness (35%) — Whether the agent calls the right tools in the right sequence rather than thrashing or looping.
  • Token efficiency (20%) — Compute overhead and microtransaction cost relative to a baseline.

A binary security gate runs first: prompt injection, out-of-allowlist tool calls, or policy violations immediately zero out the score.

To protect sensitive user data (financial instructions and temporary execution contexts) during evaluation, the scoring judge runs inside an AI Trusted Execution Environment (TEE). Using a lightweight open-weight model—Google’s Gemma—the TEE ensures the host machine cannot view or manipulate processing memory. Once inference completes, the TEE produces a hardware-signed attestation document proving the evaluation was run by the specified model inside the enclave, free from host interference. The attestation is the inference: its inference_id serves as the handle, while the transcript hash (the enclave’s response_digest) becomes the verifiable fingerprint anchored on-chain.

Once verified, the agent’s evaluation score and attestation are committed to an ERC-8004-inspired registry on Circle’s Arc blockchain as an immutable event log, creating a transparent historical record of vendor integrity. The full evaluation manifest and raw logs are stored on Walrus, while the registry maintains the corresponding walrus:// pointer.

Concurrently, the user’s workflow triggers seamless, programmatic microtransactions in USDC through Circle’s x402 infrastructure, compensating agent vendors for tool usage on a granular, per-execution basis.

How it's Made

Orchestration

We use Anthropic’s Model Context Protocol (MCP) as the backbone for agent communication. By adopting the open standard instead of proprietary tool-calling formats, our orchestrator can communicate with any MCP-compliant server.

The pipeline itself is implemented as a Chainlink CRE workflow (workflows/eval-pipeline, TypeScript CRE SDK), orchestrating scoring, attestation, storage, and on-chain writes across both off-chain and on-chain steps. Execution occurs on a Chainlink DON rather than infrastructure we directly control.

Identity & Discovery

We integrate with ENS using the ENSIP-25/26 standards, giving agents human-readable identities (e.g., alpha-swaps.eth) instead of hexadecimal addresses. Capability metadata and pointers to the latest attested evaluation are stored directly in ENS text records on-chain.

Discovery queries performance scores stored in our on-chain registry, filters for agents that satisfy the required capability and score thresholds, and then resolves the selected agent’s endpoint through ENS.

Trusted Execution

For trusted execution, we integrate Chainlink Confidential AI (CAI), which runs our Gemma-based evaluation judge inside a hardware Trusted Execution Environment (TEE) backed by AWS Nitro.

The host system cannot view or modify enclave memory during execution. Upon completion, CAI returns a hardware-signed attestation proving that a known model executed the evaluation without host interference. We anchor the resulting inference_id and transcript hash on-chain.

The attestation is the inference—there is no synthetic transaction hash.

Payments

We use Circle’s USDC stack (x402 and Circle Gateway on Arc) for programmable micropayments.

Because USDC serves as Arc’s native gas token, the entire lookup-and-pay workflow settles using a single asset without requiring a separate gas token. Arc’s sub-second finality makes per-use micropayments feel effectively instantaneous.

Storage

Evaluation manifests and raw Inspect logs are stored on Walrus. The Arc registry and ENS records maintain a walrus:// pointer, allowing both on-chain reputation data and frontend applications to resolve tamper-evident, content-addressed artifacts.

Workflow

  1. Discovery

Query the on-chain registry for an agent supporting the required capability and meeting the minimum performance threshold. Resolve the selected agent’s identity and endpoint through ENS.

  1. Evaluation

The eval-runner (FastAPI) evaluates the agent using Inspect AI against golden benchmarks across three weighted dimensions (k = 3):

  • Output correctness (45%)
  • Tool-path correctness (35%)
  • Token efficiency (20%)

A binary security gate executes first and can immediately fail the evaluation.

  1. Attestation

The score manifest is judged inside the Chainlink CAI TEE, which emits a hardware-signed attestation proving that a known model evaluated the results without host interference.

  1. On-Chain Proof

The score and attestation (inference_id, transcript hash) are committed to our ERC-8004-inspired registry on Arc as an immutable event log, which serves as the system’s source of truth.

The complete evaluation manifest and raw logs are stored on Walrus.

  1. Settlement

After the agent completes the requested task, a USDC micropayment via the Circle stack (x402 on Arc) compensates the vendor.

How Partner Technologies Benefited Us

Chainlink CRE

Provides a DON-executed orchestration layer, eliminating the need to trust our own backend for the evaluation → attestation → publication pipeline.

Chainlink CAI

Provides hardware-attested verdicts from an open-weight evaluation model without requiring us to operate our own Nitro enclave. We obtain TEE guarantees through an API.

Circle / Arc

USDC-as-native-gas reduces the payment flow to a single asset. Arc’s sub-second finality makes real-time micropayments practical.

ENS (ENSIP-25/26)

ENS text records and TTL support decentralized discovery where reputation naturally self-invalidates when a subname expires, eliminating the need for a custom indexer.

Walrus

Content-addressed blobs provide tamper-evident evaluation logs that can be referenced by both on-chain contracts and frontend applications at a fraction of the cost of on-chain storage.

Notable Engineering Challenges

CRE Cron Trigger Limitation

The CRE cron trigger currently hangs in cre simulate, including for minimal test workflows, while HTTP triggers operate normally.

Rather than block on the issue, we registered three triggers:

  • HTTP Run
  • HTTP CAI Callback
  • Production Cron

All executions currently flow through the HTTP trigger. This unexpectedly became a feature, allowing external webhooks (GitHub updates, API change notifications, etc.) to trigger re-evaluations.

Two-Handler Async Callback Architecture

A synchronous flow (score → poll CAI → publish → write) exceeded CRE’s per-workflow HTTP-call limit.

To work around this:

  • Handler A performs scoring and submits work to CAI with a cre_callback.
  • CAI completion triggers a new execution.
  • Handler B publishes artifacts to Walrus and writes results to Arc.

We also rotate one benchmark per execution to remain within call limits.

Walrus as an Inspect AI Log Backend

Inspect View expects S3-style log directories.

We implemented a custom walrus:// fsspec adapter that exposes content-addressed Walrus blobs through an S3-compatible interface backed by an index blob.

As a result:

inspect view --log-dir walrus://evals/goldenmcp

works directly against decentralized storage.

Per-Model Optimization in the K=3 Ensemble

Our ensemble currently includes Claude Haiku, Qwen, and Gemma.

Model-specific optimizations include:

  • Disabling Qwen’s reasoning preamble via chat_template_kwargs.
  • Running gpt-oss with reasoning_effort=low.

Reducing token usage lowers Inspect polling frequency and helps stay within CRE HTTP-call limits.

Adversarial Security Gate

security_scorer executes before all other scoring logic.

The scorer immediately zeroes the composite score when it detects:

  • Prompt injection attempts in MCP responses
  • Out-of-allowlist tool invocations
  • Policy violations (e.g., executing a swap during a quote-only benchmark)

Agents are evaluated adversarially, not merely for accuracy.

Transcript Hash Fallback

We record CAI’s enclave response_digest as the canonical transcript hash.

When that digest is unavailable, we fall back to sha256(output) to maintain robustness against occasional attestation API inconsistencies.

Technology Stack

  • Next.js (dashboard, sandbox, flight tracker, vendor table)
  • Chainlink CRE workflows
  • FastAPI / Uvicorn (eval-runner)
  • marketplace-mcp-ts (Express + Bun)
  • MCPRegistry (Solidity + Foundry) on Arc
  • Inspect AI scorers with a K=3 open-weight ensemble
  • Walrus decentralized blob storage
  • ENSv2 identity layer
  • Chainlink CAI TEE attestation
  • USDC on Arc testnet via x402 and Circle Gateway
  • viem
  • Custom CSS score meters (no charting libraries)
  • bun, pytest, and forge testing
  • Python via uv
  • Vercel deployment

Benchmarked MCP Providers

  • LI.FI
  • 1inch
  • Odos
  • Jupiter
  • KyberSwap
background image mobile

Join the mailing list

Get the latest news and updates