Comment by krick

2 years ago

Yeah, this really should be emphasized more. Reading the headline I was absolutely amazed, this would be like stumbling upon an evidence of previously unknown and not yet described physics (or rather linguistics, in this case) law.

But given your quotation this is actually even quite intuitive IMO. What is classification of texts in a completely unknown language? What would it be to ask you to classify texts in Kirundi language? You have no idea, what they mean, the best you can do it is to find out the frequency of some words (char sequences) and try to group texts with similar frequency fingerprints together. You still would have no clue what these texts actually mean, but it might (and turns out that it does) get you somewhere better than random. Well, good news: that's exactly what gzip+KNN do, it's their bread and butter, it's literally the only thing they live for.

Reading (trying to understand, predicting the next character) these texts gets you pretty much nowhere. As a sensible human being, you wouldn't even try that, because it's just hopeless, you don't speak the language, what's more to say about that… Well, unfortunately, it's exactly what BERT does. The only thing it knows to do. We can congratulate it with getting more use out of it than a typical (and not quite typical too, I suppose) human would.