Scoring System

How Model Kombat evaluates and ranks AI model outputs using a transparent, rubric-based system.

How Scoring Works

Model Kombat uses a rubric-based scoring system where AI judges evaluate each model's output against specific criteria you define. All outputs are anonymized before judging to ensure completely unbiased evaluation.

Key Principle

Judges see outputs labeled only as "Model A", "Model B", etc. They never know which AI model (GPT-4, Claude, Gemini, etc.) produced each response until you reveal identities.

The 0-100 Scoring Scale

Each criterion is scored on a 0-100 scale for maximum granularity

0-29
PoorFails to meet the criterion or has critical flaws
30-49
Below AveragePartially meets with significant issues
50-64
AverageMeets basic expectations adequately
65-79
GoodExceeds expectations with minor issues
80-94
ExcellentStrong performance with comprehensive coverage
95-100
ExceptionalOutstanding, near-perfect execution

Rubric Criteria & Weights

Each rubric has multiple criteria with customizable weights

A rubric consists of multiple criteria, each with a name, description, and weight. Weights must sum to 100%.

Example: Code Review Rubric
35%
weight
Accuracy

Correctly identifies all bugs, security issues, and anti-patterns in the code

25%
weight
Completeness

Covers security vulnerabilities, logic errors, and best practice violations

25%
weight
Actionability

Provides clear, specific, implementable fixes for each issue found

15%
weight
Clarity

Well-organized, easy to understand, with clear explanations

How Final Scores Are Calculated
Final Score = Σ (Criterion Score × Criterion Weight)

Example: If Model A scores 85 on Accuracy (35%), 70 on Completeness (25%), 90 on Actionability (25%), and 80 on Clarity (15%):
(85 × 0.35) + (70 × 0.25) + (90 × 0.25) + (80 × 0.15) = 29.75 + 17.5 + 22.5 + 12 = 81.75/100

Judge Reasoning

AI judges don't just output a number—they provide detailed reasoning for each score:

  • Per-criterion scores with specific justification
  • Strengths identified - what the model did well
  • Weaknesses noted - areas for improvement
  • Overall assessment summarizing the evaluation

This transparency lets you understand why each model received its score, not just the final number.

Fairness Controls

Multiple safeguards ensure unbiased evaluation

Anonymous Labels

Models are assigned random labels (A, B, C...) - judges never see model names until you reveal them

Blind Judging

AI judges evaluate outputs without knowing which model produced them, eliminating brand bias

Consistent Prompts

All models receive the exact same prompt and system instructions

Locked Parameters

Temperature, max tokens, and other settings are identical across all participants

Rubric-Based Scoring

Judges score against specific criteria, not subjective preferences

Full Score Range

Judges are instructed to use the entire 0-100 range to differentiate performance

Tournament Flow

1
Generate

Each model produces its initial response to your prompt

2
Critique

Models anonymously critique each other's outputs, identifying strengths and weaknesses

3
Refine

Models improve their responses based on critiques received (tests adaptability)

4
Judge

AI judges score each output (0-100) against your rubric criteria

5
Reveal

Finalize results and reveal which model (GPT-4, Claude, etc.) produced each output

Ready to run a tournament?

Put the scoring system to the test with your own prompts and custom rubrics.