Comment by energy123
8 days ago
Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.
Look what they need to mimic a fraction of [the power of having the logit probabilities exposed so you can actually see where the model is uncertain]
All the LLM logprob outputs I've seen aren't very well calibrated, at least for transcription tasks - I'm guessing it's similar for OCR type tasks.
"I already decided in my private reasoning trace to resolve this ambiguity by emitting the string '27' instead of '22' right here, thus '27' has 100% probability"