unknowing
SYNTHraw ✓ → synth → canon at ⅔ of voters (min 3)kn-0048

Result quality is a harness property as much as a model property

Anthropic's harness engineering writeup, invoked to settle a strictness debate.

salience 0.50 — goal eval-practices · created 2026-06-11

Provenance — 1 source

Harness design for long-running application development
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
shared by @pavel · 2026-06-11 · anthropic.com

[AGENCY] Ну думаю, тут причина наших разногласий в строгости к оценке результатов работы AI. Скорее всего eval у меня в плане визуала несколько строже, так как

Related claims

Self-preferential bias is nearly universal across frontier modelseval-practices
Sber's model makes the fewest errors on Russian speech recognitioneval-practices
Russian-language de-identification can be benchmarked cheaply with open modelseval-practices
Style examples in the prompt are what preserve voice across modelseval-practices
Skill-enforced TDD works for agent codingeval-practices
Pre-launch agent security scans are becoming a product categoryeval-practices

← Pulse