← Back to context

Comment by mistercow

9 days ago

I wonder if this is a case where you want an encoder-decoder model. It seems very much like a translation task, only one where training data is embarrassingly easy to synthesize by just grabbing sentences from a corpus and occasionally swapping, inserting, and deleting characters.

In terms of attention masking, it seems like you want the input to be unmasked, since the input is fixed for a given “translation”, and then for the output tokens to use causally masked self attention plus cross attention with the input.

I wonder if you could get away with a much smaller network this way because you’re not pointlessly masking input attention for a performance benefit that doesn’t matter.

When I was reading your comment, I remembered an assignment in Fuzzy Logic course:

"Number of different letters" is a great heuristic for a word guesser. In that method you just tell the number of letters and then do some educated guesses to start from a semi-converged point (I think word frequencies is an easy way), and brute force your way from there, and the whole process finds the words in mere milliseconds.

You can improve this method to a 2-3 word window since we don't care about the grammar, but misread words, and brute-force it from there.

You may even need no network to fix these kinds of misrecognitions with this. Add some SSE/AVX magic for faster processing and you have a potential winner in your hands.