Scoring System

How Model Kombat evaluates and ranks AI model outputs using a transparent, rubric-based system.

How Scoring Works

Model Kombat uses a rubric-based scoring system where AI judges evaluate each model's output against specific criteria you define. All outputs are anonymized before judging to ensure completely unbiased evaluation.

Key Principle

Judges see outputs labeled only as "Model A", "Model B", etc. They never know which AI model (GPT-4, Claude, Gemini, etc.) produced each response until you reveal identities.

The 0-100 Scoring Scale

Each criterion is scored on a 0-100 scale for maximum granularity

0-29

PoorFails to meet the criterion or has critical flaws

30-49

Below AveragePartially meets with significant issues

50-64

AverageMeets basic expectations adequately

65-79

GoodExceeds expectations with minor issues

80-94

ExcellentStrong performance with comprehensive coverage

95-100

ExceptionalOutstanding, near-perfect execution

Rubric Criteria & Weights

Each rubric has multiple criteria with customizable weights

A rubric consists of multiple criteria, each with a name, description, and weight. Weights must sum to 100%.

Example: Code Review Rubric

35%

weight

Accuracy

Correctly identifies all bugs, security issues, and anti-patterns in the code

25%

weight

Completeness

Covers security vulnerabilities, logic errors, and best practice violations

25%

weight

Actionability

Provides clear, specific, implementable fixes for each issue found

15%

weight

Clarity

Well-organized, easy to understand, with clear explanations

How Final Scores Are Calculated

Final Score = Σ (Criterion Score × Criterion Weight)

Example: If Model A scores 85 on Accuracy (35%), 70 on Completeness (25%), 90 on Actionability (25%), and 80 on Clarity (15%):
(85 × 0.35) + (70 × 0.25) + (90 × 0.25) + (80 × 0.15) = 29.75 + 17.5 + 22.5 + 12 = 81.75/100

Judge Reasoning

AI judges don't just output a number—they provide detailed reasoning for each score:

Per-criterion scores with specific justification
Strengths identified - what the model did well
Weaknesses noted - areas for improvement
Overall assessment summarizing the evaluation

This transparency lets you understand why each model received its score, not just the final number.

Fairness Controls

Multiple safeguards ensure unbiased evaluation

Anonymous Labels

Models are assigned random labels (A, B, C...) - judges never see model names until you reveal them

Blind Judging

AI judges evaluate outputs without knowing which model produced them, eliminating brand bias

Consistent Prompts

All models receive the exact same prompt and system instructions

Locked Parameters

Temperature, max tokens, and other settings are identical across all participants

Rubric-Based Scoring

Judges score against specific criteria, not subjective preferences

Full Score Range

Judges are instructed to use the entire 0-100 range to differentiate performance

Tournament Flow

Generate

Each model produces its initial response to your prompt

Critique

Models anonymously critique each other's outputs, identifying strengths and weaknesses

Refine

Models improve their responses based on critiques received (tests adaptability)

Judge

AI judges score each output (0-100) against your rubric criteria

Reveal

Finalize results and reveal which model (GPT-4, Claude, etc.) produced each output

Ready to run a tournament?

Put the scoring system to the test with your own prompts and custom rubrics.