LLM Scaling Laws and Benchmarks¶

Understanding how model size relates to performance and how to compare models objectively. Essential for choosing the right model for a task and predicting whether bigger = better for your use case.

Chinchilla Scaling Law¶

The relationship between model parameters and training data is roughly linear:

Core principle: parameters and training tokens should scale proportionally
Practical rule: doubling parameters requires doubling training data to fully utilize the extra capacity
Reverse: if doubling training data, need double the parameters to absorb it effectively

Parameters	Optimal Training Tokens	Example
1B	~20B tokens	Small research models
7B	~140B tokens	Llama-class models
70B	~1.4T tokens	Large open-source models
175B	~3.5T tokens	GPT-3 class

Implication for model selection: a well-trained 7B model can outperform a poorly-trained 13B model. Training data quality and quantity matter as much as parameter count.

Standard Benchmarks¶

Seven widely-used benchmarks for comparing LLMs:

Benchmark	Measures	Format
ARC	Scientific reasoning	Multiple choice questions (grade school + challenge)
DROP	Language comprehension	Reading + extraction (counting, sorting, arithmetic)
HellaSwag	Common sense reasoning	Sentence completion (harder contexts)
MMLU	Multi-domain knowledge	57 subjects, multiple choice. Somewhat superseded by MMLU-Pro
TruthfulQA	Accuracy under adversarial conditions	Model resists giving popular but false answers
WinoGrande	Ambiguity resolution	Pronoun disambiguation in confusing contexts
GSM8K	Mathematical reasoning	Grade school + middle school math word problems

Reading Benchmark Numbers¶

Benchmarks are typically reported as percentage accuracy
No single benchmark captures overall capability - look at the profile
MMLU-Pro has largely replaced MMLU due to concerns about question quality
Arena-style evaluations (Chatbot Arena / LMSYS) use human preference votes and are considered more reliable for conversational quality

Benchmark Limitations¶

Models can be trained to game specific benchmarks (teaching to the test)
Benchmark contamination: test questions may appear in training data
Multiple choice format tests recognition, not generation ability
Real-world task performance often diverges from benchmark rankings
Small models can beat larger ones on specific benchmarks while being worse overall

Practical Model Selection¶

Instead of chasing benchmark numbers, evaluate on YOUR task:

# Simple evaluation framework
def evaluate_model(model, test_cases):
    results = []
    for case in test_cases:
        response = model.generate(case['prompt'])
        score = assess_quality(response, case['expected'])
        results.append({
            'input': case['prompt'],
            'output': response,
            'score': score,
            'cost': calculate_cost(case['prompt'], response)
        })

    return {
        'accuracy': np.mean([r['score'] for r in results]),
        'avg_cost': np.mean([r['cost'] for r in results]),
        'cost_per_correct': sum(r['cost'] for r in results) / max(1, sum(r['score'] for r in results))
    }

Decision framework: 1. Define your specific task evaluation (not general benchmarks) 2. Test 2-3 model tiers (small/medium/large) on your task 3. Calculate cost-per-correct-answer, not just accuracy 4. Choose the smallest model that meets your quality threshold

Gotchas¶

A 7B model against GPT-4 is not a fair comparison. Frontier models have 10-100x more parameters. Small models excel at narrow, well-defined tasks but struggle with general reasoning. Set expectations accordingly.
Benchmark numbers are snapshots. Model rankings change with every release. Check the date on any benchmark comparison - results from 6 months ago may be irrelevant.
"State of the art" on one benchmark does not mean best overall. A model optimized for coding (HumanEval) may underperform on reasoning (ARC). Always check the benchmark relevant to your use case.

Cross-References¶

model optimization - quantization, pruning, distillation
frontier models - GPT-4, Claude, Gemini capabilities
fine tuning - when benchmarks suggest fine-tuning can help
agent evaluation - evaluating agent systems vs raw models