Welcome to Gosu Evals
Determining which models and AI agents work best together through rigorous testing
Testing Methodology
The goal is to measure how well each model works with each agent. I built an evaluation set specifically designed to test real-world AI coding assistant performance through rigorous, multi-layered assessment.
Goal & Reasoning
My approach mirrors the rigorous testing methodologies I admire from GPU reviewers like GamersNexus and Level1Tech. Just as they test hardware under controlled, repeatable conditions to give you reliable performance data, I test AI coding assistants with standardized scenarios to measure real-world capability.
The reasoning: eliminate variables, control for consistency, and provide you with objective data to make informed decisions about which AI coding tools will actually work best for your specific needs.
Task Design & Examples
I created 20 different tasks with varying complexity, generating high and low quality examples for each. These examples train the judge to score within my desired ranges and provide clear benchmarks for what constitutes quality output.
Unit Testing & Specifications
For each high-quality example, I wrote comprehensive unit tests and created detailed prompts specifying exactly what should be created - including each file, functionality, and documentation requirements. Agents receive only the specifications, never the unit tests, ensuring authentic problem-solving.
LLM Judge Calibration
I developed specialized prompts for Claude 3.7 Sonnet Thinking using good and bad examples to achieve consistent grading. Each prompt is meticulously crafted to minimize variance - a time-intensive process critical for reliable scoring.
Scoring Formula
My weighted formula combines judge scores, unit test results, and deterministic checks (file existence, tool failures, app launches, crash detection). On average, each evaluation includes 8 unit tests and 30 non-deterministic checks, with tool failures counted from logs or manual observation.
Full Automation Testing
I test all agents in their built-in coding modes on full auto, letting them run to completion (hitting continue when needed). This simulates real-world usage where you provide detailed specifications and let the AI complete the task independently.
Statistical Rigor
Each model-agent combination undergoes 5 runs. Results falling outside one standard deviation are discarded to account for AI inconsistency. This approach tests how well agents execute detailed, thorough plans - closely matching day-to-day AI coding tool usage.
Evaluation Security
I do not release my evaluation suite to prevent LLMs and agents from optimizing specifically against my test cases, ensuring genuine performance measurement that reflects real-world capabilities.
Community & Support
The cost of running evaluations, along with the time for manually managing them, is high - making it challenging to keep everything totally up-to-date.
Request Re-runs
Want to see a specific model-agent combination tested? Request re-runs or suggest new evaluations on Discord:
Join Discord CommunityHelp Fund Testing
Consider subscribing to help cover AI expenses for testing. My goal is to have ad revenue offset the costs of running these comprehensive evaluations:
Subscribe on YouTubeExplore Our Evaluations
Overall LLM Leaderboard
Complete performance rankings across all language models and AI agents tested
View Overall Results →Best AI Coding Agents
Focused evaluation of top AI coding assistants with subset of models and specialized tests
Compare Agents →Top 3 Recommendations
My personal top picks: Claude Code, Augment Code, and Roo Code - detailed analysis and recommendations
See My Picks →