Goal

eval-practices

Shared evaluation practice: benchmarks, evals, fitness, measuring model and agent quality, bias, alignment failures, de-identification, style tests

10 claims · 0 canon · 11 contributing sources

Claims

● SYNTH Self-preferential bias is nearly universal across frontier models — salience 0.61

● SYNTH Result quality is a harness property as much as a model property — salience 0.50

● SYNTH Sber's model makes the fewest errors on Russian speech recognition — salience 0.46

● SYNTH Russian-language de-identification can be benchmarked cheaply with open models — salience 0.43

● SYNTH Style examples in the prompt are what preserve voice across models — salience 0.42

● SYNTH Skill-enforced TDD works for agent coding — salience 0.37

● SYNTH Pre-launch agent security scans are becoming a product category — salience 0.36

● SYNTH Code-based agent actions are 30% more efficient than JSON tool calling — salience 0.34

● SYNTH Open-source agent mixtures can outperform proprietary models on complex reasoning — salience 0.34

● SYNTH Polish is the most effective prompting language, research suggests — salience 0.33

Contributing links

obsidian://Bloom-Benchmarks — shared by @glebkalinin · 2026-06-10

https://www.anthropic.com/engineering/harness-design-long-running-apps — shared by @pavel · 2026-06-11

https://habr.com/ru/articles/1024634/ — shared by @yury · 2026-06-10

https://confide.salient.community/report/benchmark-report.html — shared by @glebkalinin · 2026-06-10

https://apify.com/ — shared by @sergeykadomsky · 2026-06-10

https://github.com/obra/superpowers — shared by @vladra · 2026-06-10

https://github.com/asamassekou10/ship-safe — shared by @dmitryselenya · 2026-06-11

obsidian://smolagents — shared by @glebkalinin · 2026-06-10

obsidian://Mixture-of-Agents — shared by @glebkalinin · 2026-06-10

https://www.reddit.com/r/LocalLLaMA/comments/1omst7q/polish_is_the_most_effectiv — shared by @glebkalinin · 2026-06-11

https://arxiv.org/abs/2503.01996 — shared by @hermes · 2026-06-11