Unvibe Hackathons

Uses LLM-as-a-judge to filter projects before they reach human judges

Unvibe Hackathons

Created At

Open Agents

Project Description

Unvibe Hackathons

Hackathons should be awesome. So why do so many suck?

The spirit of hackathons is collaboration, innovation, and community building. The implementation often fails everyone — friction and bottlenecks for builders, exhausted judges who wish they had more time per project, sponsors whose stack gets evaluated by people who've never used it before.

While I was a DevRel for the Protocol Labs ecosystem, I judged over 1,000 hackathon projects in a single year. Since then I've entered countless blockchain and AI hackathons, mainly as a solo dev. Both sides of the table have the same problem: human judges are trying to make fair calls under time pressure they can't actually meet, and builders get penalized for things that aren't really about their project.

This submission is a working prototype of one piece of that puzzle: an pipeline that does the structured evidence-gathering a fresh judge with sixty seconds per repo can't realistically do, so the humans further down the chain get to spend their attention on the harder questions.


What this submission demonstrates

I initially set out to build a sponsor judge for every track this hackathon offers. Two days in, surveying the ETHGlobal showcase for calibration repos, I discovered that several sponsors had no public showcase projects to calibrate against, and others had only generic ones. Building five thin judges against thin evidence would have produced five unconvincing judges. So I narrowed.

In scope for this submission:

  • A working pipeline orchestrator that walks repos through an ordered sequence of judges, with a contract-defined finding file on disk between each stage.
  • Two repo-screening judges (safe-repo, hackathon-qualified) that run before any sponsor evaluation.
  • One end-to-end sponsor judge for 0G that classifies projects against 0G's two prize tracks and evaluates them against per-track rubrics.
  • A calibration discipline borrowed from ML evaluation practice — expected verdicts written before the judge runs, and a changelog tracking every change to those expected verdicts with a reason.
  • Conventions designed to scale to other sponsors: per-sponsor SKILL.md, calibration/validation repo sets, deterministic prefilter + LLM evaluation as separate phases, on-disk findings as the inter-judge contract.

Intentionally out of scope, roadmapped in : builder-feedback synthesis, prize-pool coordinator, the LLM council that does final stack-ranking. Those depend on having real sponsor judges to feed them — one good sponsor judge is more useful than five bad ones.


Architecture

repo URL
   │
   ▼
┌─────────────┐    ┌─────────────────────┐    ┌─────────────┐
│  safe-repo  │───▶│ hackathon-qualified │───▶│ sponsor-0g  │───▶ (council, future)
└─────────────┘    └─────────────────────┘    └─────────────┘
   public?            within event window?      track fit?
   safe deps?         new-repo rule honored?    rubric fit?
   gitingest          (event.yaml)              components used?
   snapshot

Each judge is a script. The orchestrator (run_pipeline.py) shells out to each one in order, reads the contract-defined finding file from disk, and short-circuits if a judge returns passed: false. Judges are deliberately ignorant of each other; per-judge logic lives inside each judge. The orchestrator is a script, not a DAG framework — run_pipeline.py's docstring spells out the specific conditions that would justify upgrading.

The safe-repo judge captures a gitingest snapshot of the repo and stores it under judges/safe-repo/artifacts/repo-ingest-dump/<slug>.txt. Every downstream judge reads from this snapshot rather than re-cloning, which means evaluations are reproducible against a single point-in-time view of the repo even if the repo is later changed or deleted.


The 0G sponsor judge

The 0G judge is the load-bearing piece. Its full specification — input schema, output schema, failure modes, design rationale — is in . The short version follows.

Two-phase evaluation

The judge separates what can be checked deterministically from what requires judgment.

Phase 1 — Deterministic prefilter. Pure file parsing of the gitingest snapshot. No LLM, no network. Looks for SDK imports, 0G component mentions, contract addresses, demo links, declared tracks, team contact info, and example-agent directories. Produces a structured evidence JSON.

Phase 2 — LLM evaluation. Two calls, both fed the prefilter's findings as established ground truth so the model doesn't redo deterministic work.

  • Call A — track classification. Given the README, prefilter findings, package manifests, and source excerpts around any 0G SDK imports, classify whether the project fits the framework track, the agents track, both, or neither. Also identify which of 0G's four named components (Storage, DA, Compute, Chain) are meaningfully integrated — invoked at runtime in the project's primary flow, not merely listed in a manifest.
  • Call B — per-track rubric. Runs once per applicable track (zero, one, or two times). Scores each track-specific rubric item from the prize page as pass, fail, needs-human-verification, or n-a.

A vocabulary that distinguishes failure modes

Per-track verdicts use a four-value vocabulary:

| Verdict | Meaning | | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | pass | Track shape matches and rubric is satisfied. | | fail-rubric | Track shape matches but one or more required rubric items failed. Builder feedback should focus on the missing items. | | fail-no-integration | Track shape matches but the named components aren't actually used in the way the track requires. More fundamental than a polish issue. | | not-classified | Project doesn't fit this track's shape. The rubric was not applied. |

not-classified is not a failure — it's the right answer when a project's shape doesn't map onto a given track. A project can have deep 0G integration and still be not-classified for both tracks; that's a feature, not a bug, and the calibration set has an example.


Calibration as evidence

The 0G judge has a c of three repos drawn from the public ETHGlobal showcase. For each, I wrote my expected verdict before running the judge — what tracks I thought it qualified for, what 0G components I thought it used, what the rubric verdict should be. This file () is the validation harness: the judge's job is to reproduce these intuitions, and any disagreement gets investigated.

There's a strict rule attached: the expected verdicts cannot be updated to match the judge's output after the fact. Updating a verdict because the judge disagreed and its reasoning sounded convincing is the judge training me, not me validating the judge. Every change to an expected verdict is logged in the changelog with a reason that has to be a fact about the repo, not about the judge.

Something interesting happened

When I wrote my initial expected verdicts, I did what a judge with sixty seconds per repo does: I skimmed each README, formed an impression, and wrote it down.

I was wrong on all three.

The deterministic Phase 1 prefilter — running before any LLM call — surfaced evidence I had missed in every single case. Two of the three projects had substantially deeper 0G integration than I'd given them credit for; one had production 0G Storage in a code path I'd dismissed as scaffolding. The changelog captures every revision:

| 2026-05-01 | petprotect | Components used | none → Compute     | Project uses @0glabs/0g-serving-broker, 0G's Compute SDK,
|            |            |                 |                    | imported and invoked in source code with hardcoded 0G
|            |            |                 |                    | testnet RPC. Manual review missed this entirely.
| 2026-05-01 | zenagent   | Components used | none → Storage,    | Closer reading found real 0G Storage integration in the
|            |            |                 |       Compute      | web app's check-in flow (lib/0g.ts uses Indexer for
|            |            |                 |                    | production uploads). Both 0G SDKs are imported and used
|            |            |                 |                    | in real code paths.
| 2026-05-01 | beepm      | Components used | Chain/Compute →    | Closer reading of gateway/server.js confirmed Compute is
|            |            |                 | Compute, Chain     | via OpenAI-compatible HTTP proxy to a 0G endpoint, not
|            |            |                 |                    | via the SDK. HTTP-proxy use still counts toward
|            |            |                 |                    | components_used per the rubric.

(Full changelog in .)

This is the thesis of the project. Without deep product knowledge, the quick-sweep evaluation that a fresh judge can realistically do penalizes builders for the kind of repo legibility they don't have time to optimize for during a hackathon. A deterministic prefilter run before human review can refocus the question from "is this a good project for this track at all" — which a quick-sweeping judge will get wrong — to "how well does this fit the track" — which is the question that actually matters.

What petprotect's clean end-to-end run looks like

{
  "judge": "sponsor-0g",
  "repo": "https://github.com/AtlasVIA/petprotect",
  "passed": true,
  "verdict": "passed",
  "inferred_tracks": ["agents"],
  "components_used": ["Chain"],
  "tracks": {
    "framework": {
      "verdict": "not-classified",
      "reasoning": "Project does not fit the framework track shape; rubric was not applied."
    },
    "agents": {
      "verdict": "pass",
      "reasoning": "The project uses @0glabs/0g-serving-broker; the README provides a clear description of the tech stack and use of 0G for document analysis. Items like 'agent-communication-explanation' and 'inft-link-and-proof' are not applicable to a single end-user agent.",
      "rubric_items": {
        "project-name-and-description": "pass",
        "contract-deployment-addresses": "pass",
        "public-repo-with-setup": "pass",
        "demo-video-link": "needs-human-verification",
        "live-demo-link": "needs-human-verification",
        "protocol-features-explained": "pass",
        "team-contact-info": "needs-human-verification",
        "agent-communication-explanation": "n-a",
        "inft-link-and-proof": "n-a"
      }
    }
  },
  "confidence": "high"
}

A few things worth noting in this output:

  • The judge correctly identifies that the project fits the agents track but not the framework track, and applies the rubric only to the applicable track.
  • needs-human-verification is used for items that aren't repo-derivable (demo videos, live demos, team contact). The judge doesn't penalize these — it flags them for the human-review stage where someone with more context can resolve them.
  • n-a is used for items that genuinely don't apply (a single agent has no agent-to-agent communication to explain).
  • confidence: high is the model's own assessment of its classification certainty, which is surfaced for the human reviewer rather than collapsed into the pass/fail.

A note on the other two calibration runs

Calibration runs on beepm and zenagent exposed an instruction-following limitation in the local model used during development (qwen2.5:14b via Ollama): on those two repos the model returned valid JSON of the wrong shape. The judge handled this gracefully — wrote a finding file, marked tracks as not-classified with a reasoning string explaining the parse failure, and the pipeline continued without crashing. The failure mode is documented in t, which specifies this as the intended behavior. The expected verdicts for both repos remain in expected-verdicts.md; running these on a stronger model is a calibration follow-up, not a structural issue.


Running it

Full command reference is in . The minimum:

# Run the full pipeline on a single repo:
uv run python run_pipeline.py https://github.com/AtlasVIA/petprotect

# Run the calibration set:
uv run python run_pipeline.py --repos-file judges/sponsor-0g/test-repos/calibration.txt

Phase 2 LLM evaluation defaults to a local Ollama model (qwen2.5:14b). Set MODEL in your environment to use something else; see for the supported backends.


Repository layout

.
├── run_pipeline.py              # orchestrator
├── event.yaml                   # event policy: dates, new-repo rule
├── judges/
│   ├── safe-repo/               # public + safe-deps + gitingest snapshot
│   ├── hackathon-qualified/     # event-policy compliance check
│   ├── sponsor-0g/              # the load-bearing sponsor judge
│   │   ├── SKILL.md             # full spec — read this for depth
│   │   ├── prompts/             # Call A and Call B prompt templates
│   │   ├── scripts/             # prefilter, llm_evaluation, judge wrapper
│   │   ├── test-repos/          # calibration.txt, validation.txt,
│   │   │                        # expected-verdicts.md
│   │   └── tests/
│   ├── sponsor-ens/             # SKILL.md scaffolds for future judges
│   ├── sponsor-gensyn/
│   ├── sponsor-keeperhub/
│   └── sponsor-uniswap/
├── lib/llm.py                   # model-backend dispatch + call recording
└── planning-artifacts/          # roadmap docs (judges-needed, research notes)

Builder feedback

The judge's outputs are designed to feed a builder-feedback layer — per-project markdown reports written to the team whose project was evaluated, not about them. The reports synthesize the judge's verdict, the rubric items that passed, items flagged for human review, and forward-looking suggestions, all drawn from artifacts the judge already produces (the summary finding, per-call LLM records, and deterministic prefilter output).

A worked example for the petprotect calibration repo is at . It demonstrates the tone and structure: warm, specific, honest about the judge's own uncertainty (the petprotect report flags a component-classification mistake the model made, even though the verdict was pass), and grounded in evidence the builder can verify against their own repo.

Persistence on 0G Storage

The petprotect feedback report is persisted on 0G Storage on the Galileo testnet as a working demonstration of the encrypted-feedback-delivery roadmap item from :

  • Root hash: 0xdeceb19965cde26ebae7ef67be6c6574e0d1c6f09d172ffdc482b10982a75bd9
  • Storage explorer: s
  • Chain explorer: t
  • Flow contract: 0x22E03a6A89B950F1c82ec5e74F8eCa321a105296
  • Uploader: 0xCD8Bcd9A793a7381b3C66C763c3f463f70De4e12

The upload tool lives at and is built on @0gfoundation/0g-storage-ts-sdk. It is sponsor-agnostic: any future feedback layer can persist its outputs to 0G Storage by shelling out to it the same way the orchestrator shells out to judges. See for usage.

What's still ahead

The exemplar report and its persistence to 0G Storage are what's shipping in this submission. Two pieces are roadmapped:

  • A generator that produces these reports automatically from on-disk artifacts. The hybrid design — deterministic structure for the factual sections, LLM synthesis for the forward-looking prose — is straightforward to implement against the existing call records and prefilter JSON.
  • An encrypted-delivery flow. Reports are persisted on 0G Storage today, but they're not encrypted, and there's no redemption-token flow to deliver them privately to builders. Per the roadmap, the next step is to encrypt each report and issue a redemption token to the team via email or another out-of-band channel — which gives 0G Storage a use case (encrypted, content-addressed feedback artifacts) that benefits directly from the network's properties.

The handwritten exemplar sets the bar. The generator should produce reports at least this thoughtful, or it shouldn't ship.

What's next

The pieces I'd build next, with the intentionally-out-of-scope parts spelled out in :

  • More sponsor judges. The conventions are in place — SKILL.md template, calibration/validation repo sets, the two-phase pattern. Adding a sponsor is mostly writing the rubric and the prompts.
  • LLM Council. Synthesizes sponsor-judge findings into a stack rank for human reviewers. This is the layer that turns "did this pass each individual judge" into "where should the limited human attention go first."
  • Prize-pool coordinator. Distributes a sponsor-defined prize pool across qualifying projects that didn't win a named track prize.

A note on the underlying observation

The work in this repo points at something I think is true and worth saying out loud:

A lot of what looks like project quality during hackathon judging is actually project legibility — how easy the project is for a fresh evaluator to read in sixty seconds. Those are different things. Builders shipping real integration get penalized when their READMEs don't optimize for skimmability. Builders shipping shallow integration with polished READMEs get over-credited. Both are unfair, in opposite directions.

A pipeline that does structured evidence-gathering before human review doesn't replace human judgment. It changes what humans get asked. Instead of "is this a good project at all," a question that's hard to answer well in sixty seconds, the human gets "the prefilter says this project uses 0G Compute via the SDK at runtime — does that meet the track's bar?" That's a question a busy human can answer well.

That's the bet. This submission is the smallest working version of it I could build end-to-end in the time available.

How it's Made

I tried to avoid infrastructure just for the sake of having it so, what you're seeing here is mostly Python scripts that run the deterministic checks and LLM calls with a dash of TypeScript on top to integrate 0G storage for our feedback report outputs.

Each judge is packaged like an agent skill to include:

  • SKILL.md file
  • individual directories for artifacts, findings, scripts, prompts, and tests
  • deterministic scripts held out from the SKILL.md file
  • agent outputs currently written to findings for ease of observability

A simple llm.py shim handles model calling which is easily configured by adding or removing models from .env. At the time of submission, Claude API keys were misbehaving so calibration was limited to Qwen2.5:14b pulled and run over Ollama.

I would like to add addtional 0G integrations here like:

  • calling the agents over 0G compute
  • agentic identity inside the pipeline gated by 0G but, I am out of the time I have to finish.
background image mobile

Join the mailing list

Get the latest news and updates