Comment by marcodiego
1 day ago
How expensive is a "brute force" approach to decode it? I mean, how about mapping each unknown word by a known word in a known language and improve this mapping until a 'high score' is reached?
1 day ago
How expensive is a "brute force" approach to decode it? I mean, how about mapping each unknown word by a known word in a known language and improve this mapping until a 'high score' is reached?
This seems to assume that a 1:1 mapping between words exists, but I don't think that's true for languages in general. Compound words, for example, won't map cleanly that way. Not to mention deeper semantic differences between languages due to differences in culture.
Correct
Mapping words 1:1 is not going to lead you anywhere (especially for a text that has stood undecoded for so long time)
It kiiiinda works for very close languages (think Dutch<>German or French<>Spanish) and even then.
I don't think that's likely possible. How would you determine the score? Where would you get your corpus of medieval words? How would you deal with the insane computational complexity?
Pecularities in Voynich also suggest that one to one word mappings are very unlikely to result in well described languages. For instance there's cases of repeated word sequences you don't really see in regular text. There's a lack of extremely common words that you would expect would be neccessary for a word based structured grammar, there's signs that there's at least two 'languages', character distributions within words don't match any known language, etc.
If there still is a real unencoded language in here, it's likely to be entirely different from any known language.
That’s a really interesting question — and one I’ve been circling in the back of my head, honestly. I’m not a cryptographer, so I can’t speak to how feasible a brute-force approach is at scale, but the idea of mapping each Voynich “word” to a real word in another language and optimizing for coherence definitely lines up with some of the more experimental approaches people have tried.
The challenge (as I understand it) is that the vocabulary size is pretty massive — thousands of unique words — and the structure might not be 1:1 with how real language maps. Like, is a “word” in Voynich really a word? Or is it a chunk, or a stem with affixes, or something else entirely? That makes brute-forcing a direct mapping tricky.
That said… using cluster IDs instead of individual word (tokens) and scoring the outputs with something like a language model seems like a pretty compelling idea. I hadn’t thought of doing it that way. Definitely some room there for optimization or even evolutionary techniques. If nothing else, it could tell us something about how “language-like” the structure really is.
Might be worth exploring — thanks for tossing that out, hopefully someone with more awareness or knowledge in the space see's it!
It might be a good idea for a SETI@home like project.
Like I said in another post (sorry for repeating) since this was during 1500s, the main thing people would've been encrypting back then was biblical text (or any other religion).
Maybe a version of scripture that had been "rejected" by some King, and was illegal to reproduce? Take the best radiocarbon dating, figure out who was King back then, and if they 'sanctioned' any biblical translations, and then go to the version of the bible before that translation, and this will be what was perhaps illegal and needed to be encrypted. That's just one plausible story. Who knows, we might find out the phrase "young girl" was simplified to "virgin", and that would potentially be a big secret.
Is this grey cause it talks about religion? That stuff was bigger in 1500 than 2000, from that lense as religious text seems a reasonable track to follow.
5 replies →