Comment by SlinkyOnStairs
6 hours ago
The other comment got the answer already, but yes. It's a cost problem.
LLMs are designed this way so they could be trained off unstructured text, which critically can be obtained by just scraping things off the internet.
The moment you change anything about this, you incur the trillion dollar cost of needing to manually curate the training data.
There's some attempts to get around this problem with synthetic data, but they're running into problems with model collapse (Maybe severe performance degradation is worth the security tradeoff?) and the politics of AI; All major AI companies highly restrict using their systems for synthetic data & AI training, and they're too busy themselves to investigate exotic approaches.
Hence: Realistically, this is just a problem AI will have for the foreseeable future. There's no fine tuning that can fix this, nor can a new model be easily trained with these properties. The costs are just enormous right now.
This might sound crazy but I think embodying the AI will be the long term solution here. When AI robots use language to relate their experiences and make predictions about the real world they are walking around in, it will prevent the model collapse problem. Their language might diverge from human language, but since we live in the same world translation should be possible.
Edit: Actually, I think that with a fairly small amount of auxilliary data, it could be ensured they keep the ability to speak English.