Comment by andai

1 year ago

Fascinating. I think the process could be automated, though I don't know if it's been invented yet. You would want to use the existing autocomplete tech (probabilistic models based on Levenshtein distance and letter proximity on keyboard?) in combination with actually understanding the context of the article and using that to select the right correction. Actually, it sounds fairly trivial to slap those two together, and the 2nd half sounds like something a humble BERT could handle? (I've heard people getting great results with BERTs in current year, though they usually fine-tune them on their particular domain.)

I actually think even BERT could be overkill here -- I have a half-baked prototype of a keyword expansion system that should do the trick here. The idea is is to construct a data structure of keywords ahead of time (e.g. by data-mining some portion of Common Crawl), where each keyword has "neighbors" -- words that often appear together and (sometimes, but not always) signal relatedness. I didn't take the concept very far yet, but I give it better than even odds! (Especially if the resulting data structure is pruned by a half-decent LLM -- my initial attempts resulted in a lot of questionable "neighbors" -- though I had a fairly small dataset so it's likely I was largely looking at noise.)

1 comment

andai

sdesol 1 year ago

> I think the process could be automated

It can definitely be automated in my opinion, if you go with a supermajority workflow. Something that I've noticed with LLMs is it's very unlikely for all high-quality LLM models to be wrong at the same time. So if you go by a supermajority, the changes are almost certainly valid.

Having said all of that, I still believe we are not addressing the root cause of bad searches which is "garbage in, garbage out". I strongly believe the true calling for LLM will be to help us curate and manage data, at scale.