Welcome to Gosu Evals

Determining which models and AI agents work best together through rigorous testing

Testing Methodology

The goal is to measure how well each model works with each agent. I built an evaluation set specifically designed to test real-world AI coding assistant performance through rigorous, multi-layered assessment.

Goal & Reasoning

My approach mirrors the rigorous testing methodologies I admire from GPU reviewers like GamersNexus and Level1Tech. Just as they test hardware under controlled, repeatable conditions to give you reliable performance data, I test AI coding assistants with standardized scenarios to measure real-world capability.

The reasoning: eliminate variables, control for consistency, and provide you with objective data to make informed decisions about which AI coding tools will actually work best for your specific needs.

Task Design & Examples

I created 20 different tasks with varying complexity, generating high and low quality examples for each. These examples train the judge to score within my desired ranges and provide clear benchmarks for what constitutes quality output.

Unit Testing & Specifications

For each high-quality example, I wrote comprehensive unit tests and created detailed prompts specifying exactly what should be created - including each file, functionality, and documentation requirements. Agents receive only the specifications, never the unit tests, ensuring authentic problem-solving.

LLM Judge Calibration

I developed specialized prompts for Claude 3.7 Sonnet Thinking using good and bad examples to achieve consistent grading. Each prompt is meticulously crafted to minimize variance - a time-intensive process critical for reliable scoring.

Scoring Formula

My weighted formula combines judge scores, unit test results, and deterministic checks (file existence, tool failures, app launches, crash detection). On average, each evaluation includes 8 unit tests and 30 non-deterministic checks, with tool failures counted from logs or manual observation.

Full Automation Testing

I test all agents in their built-in coding modes on full auto, letting them run to completion (hitting continue when needed). This simulates real-world usage where you provide detailed specifications and let the AI complete the task independently.

Statistical Rigor

Each model-agent combination undergoes 5 runs. Results falling outside one standard deviation are discarded to account for AI inconsistency. This approach tests how well agents execute detailed, thorough plans - closely matching day-to-day AI coding tool usage.

Evaluation Security

I do not release my evaluation suite to prevent LLMs and agents from optimizing specifically against my test cases, ensuring genuine performance measurement that reflects real-world capabilities.

Community & Support

The cost of running evaluations, along with the time for manually managing them, is high - making it challenging to keep everything totally up-to-date.

Request Re-runs

Want to see a specific model-agent combination tested? Request re-runs or suggest new evaluations on Discord:

Join Discord Community

Help Fund Testing

Consider subscribing to help cover AI expenses for testing. My goal is to have ad revenue offset the costs of running these comprehensive evaluations:

Subscribe on YouTube

Explore Our Evaluations

Overall LLM Leaderboard

Complete performance rankings across all language models and AI agents tested

View Overall Results →

Best AI Coding Agents

Focused evaluation of top AI coding assistants with subset of models and specialized tests

Compare Agents →

Top 3 Recommendations

My personal top picks: Claude Code, Augment Code, and Roo Code - detailed analysis and recommendations

See My Picks →