← Back to context

Comment by SubiculumCode

7 hours ago

I doubt it. By not releasing it, Chinese companies will be unable to break TOS and use it to acquire high quality training data...which, I suspect, is how they've kept pace

Z.AI, Moonshot, DeepSeek all have a pipeline of data of their own now due to capturing a slice of the market through cheap tokens. It's not impossible to imagine that they might share the data too if the CCP thinks that will help their AI strategy.

  • No. Most data generated this way is poor quality. It's not the user responses and or queries. If the user does not know better than the LLM, you can generate bad responses. The value is in taking a superior model, submitting a query, and getting a higher quality output than you yourself could have generated, and using that to boost your model.

    • AI companies have been using synthetic data for ages now. The data doesn't need to yield new insights to be useful for training.

    • You identify users doing real work and implementing a project over a long period of time and train on their traces.