web-of-trust reviewers blacklisting computer-controlling AI agents
Naughty Agents is a decentralized, human-in-the-loop (HITL) security protocol designed to mitigate the emerging threat of AI agent hijacking, specifically targeting on-chain financial actions.
As AI agents gain autonomy over digital wallets (e.g., Coinbase Server Wallets) to perform tasks like trading, swapping, or paying for services, the risk of manipulation by malicious actors increases. A significant emerging threat vector is the "Malicious Image Patch" (MIP) or adversarial attacks via visual inputs (e.g., an agent viewing a compromised social media feed) (Source: Anthropic: Sleeper Agents). These attacks can hijack the agent's objective function, leading to unauthorized transactions and loss of funds.
Our solution creates a robust, on-chain "firewall" that verifies every transaction an agent proposes.
The protocol operates on a "Trust but Verify" principle, enforced at the Smart Contract Account (SCA) level. It uses an on-chain registry to instantly block known-malicious transactions (Blacklist). Unknown transactions are automatically reverted by the SCA and escalated to a decentralized network of human reviewers (The Review Oracle).
The system is powered by a crypto-economic model. Users pay a subscription fee for protection, which funds rewards for the reviewers. The integrity of the reviewer network is secured by a "Web of Trust" with a delegated slashing mechanism (simplified for MVP), ensuring all participants are financially incentivized to act honestly. Naughty Agents makes on-chain AI safety a public good, secured by the community, for the community.
Naughty Agents is a decentralized security protocol that acts as an on-chain firewall to prevent hijacked AI agents from draining user funds. We built a full-stack solution with on-chain enforcement at the Smart Contract Account (SCA) level.
Our system combines a React frontend, Solidity smart contracts, and a Python agent simulator.
Frontend: We bootstrapped a React/Vite app using @coinbase/create-cdp-app
. User interactions are powered by Viem via the pre-configured CDP hooks, connecting directly to our smart contracts.
On-Chain Logic (Solidity & Hardhat 3): The protocol's core is built on-chain.
UserSCA
(Smart Contract Account) has a mandatory SecurityModule
hook. This module intercepts every transaction, reverting malicious or unknown ones before they can execute.WebOfTrust
manages reviewer staking, an ActionRegistry
stores the on-chain blacklist, and a ReviewOracle
queues unknown transactions for human review.Agent Simulator: A Python script using web3.py
simulates a hijacked agent proposing a malicious transaction, allowing us to demonstrate the protocol's real-time defense.
We built our project on the Coinbase CDP stack, which was crucial for rapid development.
To deliver a working MVP, we made two key simplifications:
WebOfTrust
contract. This demonstrates the economic incentive model (skin in the game) without building a complex, multi-stage arbitration system.keccak256(abi.encode(dest, value, data))
. This simple but effective method allowed our ActionRegistry
to easily identify and block known-malicious actions.