
Episode 223
ELO Ratings Questions
ELO ratings work for chess (κ=0.92) but fail catastrophically for AI agents (κ=0.31). Random users aren't chess arbiters. Code quality isn't win/loss. We explore psychometric failures, cognitive biases destroying data validity, and why quantitative metrics (McCabe complexity, test coverage) achieve 2.18x better reliability than human preferences.
September 18, 20253m 39s
Show Notes
Key Argument
- Thesis: Using ELO for AI agent evaluation = measuring noise
- Problem: Wrong evaluators, wrong metrics, wrong assumptions
- Solution: Quantitative assessment frameworks
The Comparison (00:00-02:00)
Chess ELO
- FIDE arbiters: 120hr training
- Binary outcome: win/loss
- Test-retest: r=0.95
- Cohen's κ=0.92
AI Agent ELO
- Random users: Google engineer? CS student? 10-year-old?
- Undefined dimensions: accuracy? style? speed?
- Test-retest: r=0.31 (coin flip)
- Cohen's κ=0.42
Cognitive Bias Cascade (02:00-03:30)
- Anchoring: 34% rating variance in first 3 seconds
- Confirmation: 78% selective attention to preferred features
- Dunning-Kruger: d=1.24 effect size
- Result: Circular preferences (A>B>C>A)
The Quantitative Alternative (03:30-05:00)
Objective Metrics
- McCabe complexity ≤20
- Test coverage ≥80%
- Big O notation comparison
- Self-admitted technical debt
- Reliability: r=0.91 vs r=0.42
- Effect size: d=2.18
Dream Scenario vs Reality (05:00-06:00)
Dream
- World's best engineers
- Annotated metrics
- Standardized criteria
Reality
- Random internet users
- No expertise verification
- Subjective preferences
Key Statistics
| Metric | Chess | AI Agents |
|---|---|---|
| Inter-rater reliability | κ=0.92 | κ=0.42 |
| Test-retest | r=0.95 | r=0.31 |
| Temporal drift | ±10 pts | ±150 pts |
| Hurst exponent | 0.89 | 0.31 |
Takeaways
- Stop: Using preference votes as quality metrics
- Start: Automated complexity analysis
- ROI: 4.7 months to break even
Citations Mentioned
- Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
- Santos et al. (2022): Technical Debt Grading validation
- Regan & Haworth (2011): Chess arbiter reliability κ=0.92
- Chapman & Johnson (2002): 34% anchoring effect
Quotable Moments
"You can't rate chess with basketball fans"
"0.31 reliability? That's a coin flip with extra steps"
"Every preference vote is a data crime"
"The psychometrics are screaming"
Resources
- Technical Debt Grading (TDG) Framework
- PMAT (Pragmatic AI Labs MCP Agent Toolkit)
- McCabe Complexity Calculator
- Cohen's Kappa Calculator
🔥 Hot Course Offers:
- 🤖 Master GenAI Engineering - Build Production AI Systems
- 🦀 Learn Professional Rust - Industry-Grade Development
- 📊 AWS AI & Analytics - Scale Your ML in Cloud
- ⚡ Production GenAI on AWS - Deploy at Enterprise Scale
- 🛠️ Rust DevOps Mastery - Automate Everything
🚀 Level Up Your Career:
- 💼 Production ML Program - Complete MLOps & Cloud Mastery
- 🎯 Start Learning Now - Fast-Track Your ML Career
- 🏢 Trusted by Fortune 500 Teams
Learn end-to-end ML engineering from industry veterans at PAIML.COM