LLM benchmarks are broken — the first map of 283 tests shows why
First survey of 283 AI benchmarks exposes systematic flaws undermining evaluation: data contamination inflating scores, cultural biases creating unfair assessments, missing process evaluation. The measurement crisis threatens deployment decisions.