Developer copilot

Bytebell: Developer copilot that unifies scattered code/docs with verified citations.

Developer copilot

Created At

ETHOnline 2025

Project Description

Bytebell: Developer Copilot - Project Description

What We're Building

Bytebell is a developer copilot that unifies scattered technical knowledge into one searchable place with verified answers.

The Problem

Engineering teams lose massive productivity because critical information is scattered:

Code lives in GitHub/GitLab
Documentation in Notion, Google Docs, internal wikis
Discussions in Slack, Discord, Telegram
Technical content in blogs, forums, PDFs, research papers

Developers spend 23% of their time searching for information that already exists somewhere. That's $47,000 lost per developer annually.

New engineers take 3-6 months to become productive because context is everywhere and nowhere.

Our Solution

Unified Knowledge Graph: We connect all your technical sources and build a living knowledge graph that captures relationships between code, documentation, discussions, and research materials in real-time.

Answers with Receipts: Every answer includes exact citations:

File path and line numbers
Branch and commit hash
Release version
Author and timestamp
Related discussions and PRs

If we can't verify an answer with real sources, we refuse to answer. This reduces hallucination from 45% to <1%.

Works Everywhere:

IDE (via MCP protocol): Cursor, Claude Desktop, Windsurf
Slack: Query directly from channels
CLI: Terminal-based access
Web: Full conversational interface
Widget: Embeddable in documentation sites

How It Works

Ingest: Continuously sync from GitHub/GitLab, documentation, blogs, forums, Slack, PDFs, and research papers
Build Graph: Create relationships—commits link to discussions, bugs connect to fixes, decisions tie to research papers
Multi-Agent Retrieval: Specialized agents search different sources in parallel (2-4 second response time)
Deliver with Proof: Every answer cites exact sources with file, line, branch, release

What Makes Us Different

Receipt Moat: Verifiable answers, not hallucinations. Repeated questions cost zero.

Graph Moat: Learns your patterns, not generic facts. Gets smarter with every query.

Workflow Moat: Embedded everywhere you work—high switching cost.

Trust Moat: Permission-aware, audit-ready, enterprise-secure from day one.

Key Features

Version-aware context (query by branch, release, or commit)
Permission inheritance from existing systems
Admin analytics showing knowledge gaps and usage patterns
Full audit trails for compliance
Flexible deployment (cloud, private cloud, or on-premises)
Real-time updates (5-minute sync cycles)

Traction

$6K MRR in 3 months (0 → $6K)
5 paying customers at $1,200-1,500/month
99% answer accuracy with source verification
<1 week time to value
Customer impact: 5+ hours saved per developer per week

Technical Architecture

Multi-agent framework (VoltAgent) for intelligent retrieval
Knowledge graph database for relationships
Vector embeddings for semantic search
Version control integration for Git-native understanding
Supports multiple LLMs (Claude, GPT-4, Llama)

Market Opportunity

$96.8B TAM in context-aware computing, growing 34% annually.

15,000+ companies with 500+ developers need this solution.

The Vision

Build the context backend for AI. As models commoditize, trusted context becomes the permanent infrastructure for organizational coherence.

Core principle: No matter how smart AI becomes, it will always need trusted context—and context will always come from graphs that connect real sources.

How it's Made

Architecture Overview We built Bytebell as a multi-layered system with three core components: the ingestion pipeline that brings data in, the knowledge graph engine that understands relationships, and the multi-agent retrieval system that delivers accurate answers.

Ingestion Pipeline - The Data Foundation Source Connectors For GitHub and GitLab integration, we use PyGithub and python-gitlab libraries to access their APIs. We set up webhook listeners using FastAPI endpoints to catch updates in real-time. The tricky part was parsing code from different programming languages, so we used Tree-sitter which gives us a universal way to extract structure from any codebase. We pull everything - files, commits, pull requests, issues, code reviews, and release tags. The system runs incremental syncs every 5 minutes, only processing what changed since the last sync. The biggest challenge here was GitHub's rate limits. If you're not careful, you'll hit the limit and have to wait. We solved this with smart caching and conditional requests using ETags, which lets GitHub tell us "nothing changed" without counting against our rate limit. For documentation, we connect to Notion using their official API, recursively traversing through pages. Google Docs required implementing OAuth2 flow with their Drive API v3. Markdown files get parsed with a custom parser that preserves the document structure - headers, code blocks, links all stay intact. PDFs are trickier; we use PyMuPDF to extract text while maintaining layout analysis so we don't lose semantic structure. For blogs and forums, we use Beautiful Soup combined with Trafilatura to extract clean content from messy HTML. On the communication side, we use Slack Bolt SDK with Events API subscriptions for Slack, discord.py for Discord message history, and python-telegram-bot for Telegram groups. We don't ingest everything though - only technical channels based on configurable regex patterns to filter out noise. The First Hacky Part: Intelligent Chunking Standard retrieval systems chunk documents naively, just cutting every 512 tokens regardless of what's in there. We built something smarter - a semantic chunking system that understands what it's chunking. For code, we use the AST (Abstract Syntax Tree) to chunk by function and class boundaries. You never want to cut a function in half - it loses all context. For documentation, we respect markdown header hierarchy and preserve context around each chunk. For discussions, we chunk by conversation threads so you get complete exchanges, not random messages. This approach improved our retrieval accuracy by about thirty-five percent. The difference is huge - if you chunk code mid-function, the retrieval system has no idea what that code fragment is doing.
Knowledge Graph Engine Storage Architecture We use three databases working together. Neo4j handles the knowledge graph because relationships are first-class citizens there, not afterthoughts. PostgreSQL stores metadata, user data, and audit logs. Redis caches frequently accessed nodes and query results to keep things fast. The graph schema has nodes representing different types of entities - code files with their path, language, repo, branch, and commit hash; functions with their name, signature, and line numbers; documentation with title, source, and version; discussions tagged with platform, channel, timestamp, and author; commits with hash, message, and author; and releases with version tags and dates. The edges are where the magic happens - they represent relationships like "this commit implements this issue," "this discussion is about this code file," "this documentation explains this function," "this commit fixes this bug," "this discussion references this documentation," and "this version evolved from that version." The Second Hacky Part: Version-Aware Graph Snapshots Here's a problem we had to solve: how do you answer "How did authentication work in version 2.3?" when the graph is constantly updating with new code and documentation? We implemented temporal graph snapshots. Every node and edge gets tagged with when it was valid - a valid_from timestamp or release when it was added, a valid_until timestamp when it was modified or deleted, and which branch it belongs to. This lets us query the graph at any point in time without storing complete snapshots of everything, which would explode our storage costs. When someone asks about an old version, we just query for nodes and edges that were valid during that time period on that specific branch. It's like time-traveling through your codebase history. Vector Embeddings We started with OpenAI's text-embedding-3-large and now also support Cohere's embed-v3. For the vector store, we tried Pinecone first but switched to Qdrant because their filtering capabilities worked better for our specific use case. Our embedding strategy is more sophisticated than just embedding raw chunks. We embed the chunk content itself, then we embed the chunk plus surrounding context from previous and next chunks, and we embed metadata separately like file paths and function names. We store all three versions because different query types benefit from different embeddings.
Multi-Agent Retrieval System (VoltAgent) This is our secret sauce. Instead of having one large language model try to do everything, we orchestrate specialized agents that are each good at specific tasks. Agent Architecture The Supervisor Agent is the orchestrator. When a query comes in, it uses GPT-4o to break down complex questions. For example, "How does authentication work?" gets decomposed into finding auth code files, finding auth documentation, and finding auth discussions. Then it routes these subtasks to specialized agents and runs them in parallel using asyncio. We have specialized sub-agents for different sources. The Code Agent searches our vector store filtered specifically for code, uses graph traversal to find related functions, and understands code structure through the AST. The Documentation Agent searches docs with semantic similarity and retrieves hierarchical context, pulling in parent sections when relevant. The Discussion Agent searches Slack and Discord with temporal weighting - recent discussions get ranked higher. The PDF Agent does OCR plus layout analysis for technical papers and extracts diagrams separately. The Third Hacky Part: Confidence Scoring and Refusal Generic retrieval systems return the top-k results regardless of whether they're actually good answers. We built a confidence threshold system that actually refuses to answer when it's not sure. We calculate confidence from multiple factors - the semantic similarity scores from the vector search, whether multiple chunks agree with each other (cross-reference verification), and the quality of the source (code is more reliable than random discussions). We weight these factors and combine them into a final confidence score. If that score is below 0.7, we refuse to answer instead of making something up. This is why our hallucination rate is less than one percent compared to forty-five percent for generic chatbots. It's better to say "I don't know" than to confidently state something wrong. Iterative Refinement Loop Our agents don't just search once and call it done. They iterate. First pass retrieves the top twenty chunks. The LLM reads those chunks and identifies gaps in the information. Then we re-query with refined search terms based on what's missing. We cross-reference the new results with the old ones. If confidence is still low, we ask the user for clarification instead of guessing.
MCP (Model Context Protocol) Integration Anthropic released the Model Context Protocol in late 2024, and we were early adopters. We built a FastAPI server that implements the MCP protocol, exposing tools that IDEs can call. MCP Server Implementation Our MCP server exposes tools for searching the knowledge graph and getting code context around specific file locations. When Cursor, Claude Desktop, or Windsurf connect via MCP, they can call these tools during conversations. We return formatted results with citations, and the IDE renders those citations as clickable links. The Fourth Hacky Part: MCP History Persistence There's a problem with MCP - sessions are stateless. Users would lose all context between IDE restarts, which feels terrible for user experience. We built what we call "shadow history" - a server-side history of all MCP interactions. We store every query and response, tag them with user_id and session_id, and on a new session, we preload the last ten interactions. This gives users continuity even after restarting their IDE. This wasn't in the MCP specification - we just built it ourselves because it was necessary for good UX.
Permission and Security Layer Permission Inheritance The challenge here is simple but critical - don't show users content they shouldn't see. Our permission layer filters results based on the source. For GitHub content, we check if the user has access to that repo through GitHub's API. For Notion content, we check Notion page permissions. Same for Google Drive, Slack channels, everything. We cache permission checks in Redis with a five-minute TTL to avoid hitting APIs on every single query, which would be expensive and slow. Audit Logging Every query gets logged - user ID, the query text, what chunks were retrieved with their sources, the final answer we gave, timestamp, and user feedback like thumbs up or down. Everything goes into PostgreSQL with configurable retention policies per customer. This is essential for enterprise compliance requirements.
Cost Optimization Hacks Aggressive Caching Embedding costs add up fast. If you have a thousand documents with five hundred chunks each, that's five hundred thousand embeddings to compute. We implemented smart deduplication - before embedding anything, we hash the chunk content and check if that hash already exists in our cache. If yes, we reuse the embedding. If no, we compute it and cache it. This cut our embedding costs by about sixty percent. Model Routing Not all queries need the most expensive model. For simple questions like "Where is function X?" we route to Claude Haiku, which is cheap and fast. For medium complexity like "How does authentication work?" we use Claude Sonnet. For really complex stuff like "Design a new architecture for this system," we use GPT-4o. This routing saves a ton on API costs. Narrow Retrieval Instead of dumping fifty chunks into the context window, we retrieve the top twenty chunks, use a small model to rank them, and only send the top five to the final LLM. This reduces token usage by seventy percent without hurting answer quality.
Real-Time Updates Webhook Architecture GitHub and Slack send webhooks when events happen - new commits, new messages, whatever. When a webhook hits our FastAPI endpoint, we validate the signature, queue the event in Kafka, and workers process it asynchronously. They update the graph, re-embed content if needed, and invalidate relevant caches. We have separate Kafka topics for different event types - github commits, slack messages, docs updates. We run ten workers per topic consuming events in parallel. The result is that new commits are searchable within five minutes.
Deployment Architecture Infrastructure We run on AWS, primarily in us-east-1 with eu-west-1 as backup. Kubernetes on EKS handles container orchestration. Application Load Balancer handles SSL termination, and CloudFront serves static assets. Microservices We split functionality into separate services - API Gateway, Ingestion Workers, MCP Server, Graph Query Service, Vector Search Service, and Auth Service. They all communicate internally via gRPC because it's faster than REST. Monitoring We use Prometheus and Grafana for metrics, ELK stack for logs, Jaeger for distributed tracing, and PagerDuty for alerts when things break. Notable Technologies Used Core Stack Backend is Python 3.11 with FastAPI and asyncio for handling concurrent operations. Frontend is Next.js 14 with React, TypeScript, and Tailwind CSS. The database stack includes Neo4j for the graph, PostgreSQL for relational data, Redis for caching, and Qdrant for vector search. Kafka handles message queuing, and we use Elasticsearch as a fallback for full-text search. AI and ML For large language models, we support OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Claude 3 Haiku. Embeddings come from OpenAI's text-embedding-3-large and Cohere's embed-v3. Tree-sitter handles universal code parsing across languages. PyMuPDF and Tesseract do OCR for scanned PDFs. Developer Tools PyGithub and python-gitlab interface with Git platforms. GitHub Actions and ArgoCD handle CI/CD. Pytest and Playwright run our tests. Sentry tracks errors in production. The Hackiest Hack: Function Call Chaining Large language models can't naturally make multiple tool calls and reason between them in one go. We built a tool call chain executor that does multi-hop reasoning. When a query comes in, we execute a chain of steps. First, we find relevant code files. Then for each file, we get the associated documentation. Next, we search for related discussions, using the files we found to narrow the search. Finally, we synthesize an answer using all that context together. This multi-hop approach gets us better answers than single-shot retrieval. What We Learned First, chunking matters way more than which embedding model you use. Stupid chunks equal bad retrieval no matter how good your model is. Second, confidence scoring isn't optional - it's mandatory. Better to refuse to answer than to hallucinate. Third, caching is your best friend - it saved us fifteen thousand dollars a month in API costs. Fourth, graphs are better than vectors for understanding relationships. Vector search finds similar text, but graphs find connected concepts. Fifth, MCP is genuinely a game-changer for IDE integration, much better than browser extensions. Current Challenges We still have some issues to work through. Cold start is one - the first query takes about ten seconds while caches warm up. We're working on predictive caching to solve this. Large repos with over a hundred thousand files slow down graph traversal, so we're implementing graph sharding. Our Python parser works great but the Rust parser needs improvement for multilingual codebases. And diagram understanding is tough - OCR misses relationships in architecture diagrams. We're exploring multimodal models to fix this.

Developer copilot