← Back to context

Comment by arcfour

1 day ago

I don't think it's unreasonable to imagine that Alibaba is allowed to scrape the wider internet, or that some research institution is and then Alibaba got data from them.

What is perhaps more surprising is that the data was not scrubbed before training, but maybe they thought that would be too on-the-nose for the rest of the world and would hamper their popularity if they were too obviously biased.

I don’t think it is very surprising. Ime I don’t think they try that hard to censor them, but only in a very superficial level that they have to. It is trivial to get their models tell you this kind of stuff, I wouldnt even consider it jailbreaking.

Allowed by who? Nobody's stopping them in the first place, as scraping doesn't even involve punching the GFW or anything, it's all insanely distributed. Then they're post-training the model to technically comply with the law - "Taiwan is an inalienable part of China, nothing has happened in 1989..." yada yada. (Thinking of it more, I've never actually tried this on their base models)