We shipped prompt improvements against a broken scoreboard

Harmonica runs asynchronous group conversations: participants answer questions in their own time, and an AI facilitator follows up, asks clarifying questions, and tries to surface what’s actually worth exploring further. We knew from host feedback that the facilitation wasn’t consistently working well — summaries that read as boilerplate, closing turns that wrapped up rather than checked whether anything was left unresolved, AI-flavored phrases that no human facilitator would write.

Identifying the problems was straightforward; hosts were reporting them directly. Knowing whether a prompt edit actually fixed one was harder. A change might look better in the conversations you read by hand, but you can’t read all of them, and the ones you pick to check aren’t random. We set out to build a measurement system that could run at scale.

The system runs each of Harmonica’s five system prompts against a rubric — criteria like “does the facilitator ask one question per turn?” and “does the closing turn check whether the participant is satisfied, rather than just summarizing?” — and uses a second LLM to read each output and score it against those criteria. We call this scorer the judge. We ran it across two types of test cases: synthetic scenarios we wrote ourselves, and real production transcripts with personal information removed. Experiment tracking ran through Braintrust.

The baseline results, from late May, were unflattering. The session design prompt scored 31% on “no AI tropes” — roughly two-thirds of generated designs still contained language a human facilitator would never write. The session recap prompt scored 31% on both “explains the experience rather than the mechanics” and “specific rather than generic,” matching what hosts had been reporting. The end-of-session summary prompt scored 53% on “no hallucinated stances,” meaning roughly half of generated summaries attributed a position to a participant who hadn’t taken one.

The rubrics drew on Hamel’s eval methodology writing and Braintrust’s judges documentation as the primary sources. Some axes drew on facilitation literature — Schwarz’s Skilled Facilitator, Kaner’s Diamond of Participation — where we had framework familiarity but not first-source access; those were flagged for review by expert facilitators before being used as hard gates.

Two prompt edits followed. The session design prompt got a voice-rules section naming specific forbidden phrases — “Welcome to” and “As an AI” as explicit trip-wires, with the instruction “if you find yourself writing either, stop.” The facilitation prompt got a “One question per turn” rule and sections grounding the Schwarz diagnostic cycle and the Kaner Diamond of Participation in the prompt’s own language.

The early scores looked good. “Participation zone fit” went from 45% to 87.5%. “No AI tropes” went from 31% to 68.8%.

Then we fixed the judge.

The judge was failing silently on long inputs. When the model’s response included a markdown code fence or a prose preamble before the JSON score, the parse step failed and returned a default score of 0.5 — no error, no log entry. On session-design-length outputs, which run 5–10,000 characters, this was happening about half the time, meaning every axis had a noise floor near 0.5 regardless of what the prompt was actually doing.

Switching from chat() + JSON.parse() to generateObject() — which enforces the JSON structure before it reaches the parse step — cut the failure rate to genuinely rare. When we re-ran the sweep without changing any prompts, the scores shifted. “Prompt includes template instructions” jumped from 62.5% to 87.5%: that target had already been met, but the old judge hadn’t been able to see it. “No AI tropes,” which had read 68.8% after the quality pass, corrected down to 50% — the improvement from the prompt edit was real, but smaller than it had looked. Three of eight test cases were still producing AI-flavored output.

Some axes that looked improved weren’t. Some that looked like they were failing were fine. The honest baseline was the one after the judge fix.

The re-sweep also surfaced a problem with the summary prompt. After the judge fix, its axes all read near 100%, which looked like good news until we read the judge’s reasoning on individual sessions. The criteria — “grounds claims in the transcript,” “doesn’t attribute positions to participants who didn’t take them” — pass whenever a well-structured summary cites the transcript and doesn’t invent, including summaries from sessions participants rated 1 out of 5. The rubric wasn’t distinguishing good facilitation from merely honest reporting. The 53% reading before the judge fix had been the parse-failure floor. The honest signal is near 100%, which means the rubric needs sharpening before any further iteration on the summary prompt will tell us anything useful.

While the eval infrastructure was being built, a different quality problem got a different kind of fix.

When a participant gives two consecutive answers with no concrete anchor — no quoted phrase, no proper noun, no specific detail — the conversation has stalled. They’re engaging, but nothing is landing. Drilling further on the same question usually makes this worse; the right move is to advance to the next question and come back if there’s a natural opening.

This can’t be a rubric-and-judge approach. The fix has to happen during the conversation, not after it, and a scoring call per message would add too much latency. Instead we built a deterministic detector: it checks for the presence of quoted phrases, proper nouns, or checkable details. One practical wrinkle — mobile keyboards (iOS Smart Punctuation, Gboard autocorrect) replace typed quotation marks with curly Unicode variants, so the check had to cover both ASCII and Unicode forms, or a smart-quoted response would falsely register as no-anchor and the stall counter would increment incorrectly. When the detector fires on two consecutive no-anchor turns, it appends a specific directive to that turn’s system prompt telling the facilitator to advance.

The facilitation prompt already had a general instruction to advance when a conversation stalls. The detector doesn’t replace that instruction; it fires the specific decision when the signal is present. The two compose deliberately: the prompt sets the philosophy, the runtime check triggers the action.

A separate post-conversation layer added four soft signals — digression handling, a story-worthiness score per core question, instances of the facilitator requesting links or artifacts, instances of the facilitator praising participant content instead of facilitating — running as a single small LLM call after each conversation ends. These surface in the “Review my session” tab.

The methodology correction came at execution time.

The facilitation prompt’s close-turn behavior was the next edit target. The current prompt consistently summarized and closed rather than checking whether the participant was satisfied. On real production fixtures, “diagnostic before intervention” scored 20%: 12 of 15 test cases with the same pattern in the judge’s reasoning — “summarizes and closes the conversation without checking whether the user is satisfied with stopping or whether the summary accurately captures their intent.”

The plan was to edit the prompt and re-run the eval to confirm the improvement. Before shipping, we read line 43 of the eval:

task: async (input: FixtureRow) => input.expected ?? ''

input.expected is the historical production turn saved in the fixture — the actual close turn the old prompt had generated, preserved in the test. The eval was scoring that saved output, not a new turn generated from the edited prompt. Any change to the prompt would produce the same fixture, the same saved turn, the same score. It was like grading a revision by checking the original draft.

A new eval takes the real fixtures, drops the final assistant turn, generates a fresh close turn from the current prompt, and scores that. Unlike the first design, this actually detects whether a prompt changed. When the close-turn edit was run through the new eval, the target score didn’t move — and the rubric, not the prompt, turned out to be why: “diagnostic before intervention” rewards mid-conversation hypothesis-testing, which a closing turn shouldn’t be doing. The edit was pulled rather than shipped, and the closing behavior is waiting on a criterion written for it.

The lesson, now written down: before prescribing an eval gate, read the task function. An eval that replays saved outputs and one that regenerates from the current prompt look identical in the test runner UI and produce identical experiment structures. Only one of them can measure what a prompt edit does.

The three eval shapes in the system look identical in the test runner; only the task function differs:

Eval shape	What its `task` returns	What it can gate
Identity-replay	the saved historical turn, unchanged	nothing — rubric calibration only
Synthetic	a fresh turn generated from the edited prompt	opening-turn behavior
Real-replay regenerative	a fresh final turn, regenerated over a truncated real transcript	closing and late-turn behavior

The eval-shapes confusion was one version of a more basic question that kept coming up: what kind of check does a given behavior need? Three kinds showed up in this work, and reaching for the wrong one was a recurring mistake in itself. The no-anchor stall wanted a deterministic detector — the signal is mechanical and the fix has to land mid-conversation, so a scoring call was both unnecessary and too slow. A behavior you can write a rubric for and want measured across hundreds of transcripts wants the judge. A rule that only fires deep in a conversation — probe a vague answer for a concrete instance, acknowledge frustration before drilling further, answer “what happens to my responses?” — wants a real-LLM smoke, because no eval reaches it: the synthetic cases stop after the opening turn, and the replay cases score one fixed historical turn. A rule that triggers on the third participant message is invisible to both.

A smoke test sounds manual: a person in a browser, typing answers and reading the replies. For those rules, that’s what we assumed it needed. But the participant can be scripted. A short harness feeds a fixed sequence of participant messages through the edited prompt against the real model and prints what the facilitator does at each trigger, with no browser and no database. The three rules above were checked that way in a single headless run, and the script stayed in the repo, so the next mid-conversation edit has a gate ready instead of a session to run by hand.

A smoke is weaker than the judge — you read the turns yourself instead of getting a number — but it’s the honest gate when no number is trustworthy yet. That was the situation with the summary prompt, whose rubric scores near 100% and can’t yet separate a good summary from a merely accurate one. The smoke sits above the fixture tests, which only prove the code runs, and below a scored eval, which proves the rule holds at scale. When no eval can measure a behavior yet, watching a real model do it is the honest substitute.

Where things stand: six prompts have honest baselines under a working judge. Two have quality-pass iterations shipped. The summary prompt’s rubric needs sharpening before further iteration adds anything. The 20-point noise discipline is now standard — run-to-run variance on middle-distribution axes sits at ±15–25 points on the same prompt with the same judge, so moves smaller than 20 points aren’t worth claiming.

The rubric, the test fixtures, the judge transport, and the eval design are four separate layers that each need to be right before a score means what it appears to mean. We spent more time on those layers than on the prompts themselves. That was probably the right call.