● SYNTHraw ✓ → synth → canon at ⅔ of voters (min 3)kn-0048

Result quality is a harness property as much as a model property

Anthropic's harness engineering writeup, invoked to settle a strictness debate.

salience 0.50 — goal eval-practices · created 2026-06-11

Provenance — 1 source

Harness design for long-running application development

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

shared by @pavel · 2026-06-11 · anthropic.com

[AGENCY] Ну думаю, тут причина наших разногласий в строгости к оценке результатов работы AI. Скорее всего eval у меня в плане визуала несколько строже, так как

● Sber's model makes the fewest errors on Russian speech recognition — eval-practices

● Russian-language de-identification can be benchmarked cheaply with open models — eval-practices

● Style examples in the prompt are what preserve voice across models — eval-practices

● Skill-enforced TDD works for agent coding — eval-practices

● Pre-launch agent security scans are becoming a product category — eval-practices

← Pulse