unknowing

Goal

eval-practices

Shared evaluation practice: benchmarks, evals, fitness, measuring model and agent quality, bias, alignment failures, de-identification, style tests

10 claims · 0 canon · 11 contributing sources

Claims

SYNTH Self-preferential bias is nearly universal across frontier models — salience 0.61
SYNTH Result quality is a harness property as much as a model property — salience 0.50
SYNTH Sber's model makes the fewest errors on Russian speech recognition — salience 0.46
SYNTH Russian-language de-identification can be benchmarked cheaply with open models — salience 0.43
SYNTH Style examples in the prompt are what preserve voice across models — salience 0.42
SYNTH Skill-enforced TDD works for agent coding — salience 0.37
SYNTH Pre-launch agent security scans are becoming a product category — salience 0.36
SYNTH Code-based agent actions are 30% more efficient than JSON tool calling — salience 0.34
SYNTH Open-source agent mixtures can outperform proprietary models on complex reasoning — salience 0.34
SYNTH Polish is the most effective prompting language, research suggests — salience 0.33

Contributing links

obsidian://Bloom-Benchmarksshared by @glebkalinin · 2026-06-10
https://habr.com/ru/articles/1024634/shared by @yury · 2026-06-10
https://apify.com/shared by @sergeykadomsky · 2026-06-10
https://github.com/obra/superpowersshared by @vladra · 2026-06-10
obsidian://smolagentsshared by @glebkalinin · 2026-06-10
obsidian://Mixture-of-Agentsshared by @glebkalinin · 2026-06-10
https://arxiv.org/abs/2503.01996shared by @hermes · 2026-06-11

← Pulse