Comment by 1vuio0pswjnm7
4 hours ago
Why does OpenAI collect and retain for 30 days^1 chats that the user wants to be deleted
It was doing this prior to being sued by the NYT and many others
OpenAI was collecting chats even when the user asked for deletion, i.e., the user did not want them saved
That's why a lawsuit could require OpenAi to issue a hold order, retain these chats for longer and produce them to another party in discovery
If OpenAI was not collecting these chats in the ordinary course of its business before being sued by the NYT and many others, then there would be no "deleted chats" for OpenAI to be compelled by court order to retain and produce to the plaintiffs
1. Or whatever period OpenAI decides on. It could change at any time for any reason. However OpenAI cannot change their retention policy to some shortened period after being sued. Google tried this a few years ago. It began destroying chats between employees after Google was on notice it was going to be sued by the US government and state AGs
I'm not commenting on the core point of your comment, only the "why retain for 30 days" question.
Im an age of automated backups and failovers, deleting can be really hard. Part of the answer could simply be that syncing a delete across all the redundancies (while ensuring those redundancies are reliable when a disaster happens and they need to recover or maintain uptime) may take days to weeks. Also the 30 days could be the limit, as oppose to the average or median time it takes.
The most likely explanation is whatever storage solution they’re using has a built in “recycle bin” functionality and deleted data stays the for 30 days before it’s actually deleted. I see this a lot in very large databases. The recycle bin functionality is built in to the data store product.
That sounds very plausible.
What is the standard way of being forced to restore from backup while ensuring deleted data does not also become restored? Is every delete request stored so that it can be replayed against any restore?
I have only had to manage this in a startup context with relatively low stakes and it was hard and messy. I don't know what best practice is at the scale that openai operates, but from my limited experience I have an intuition that the challenge is not trivial.
Also I suspect there is a big gap between best practice and common practice. My guess is common practice is dysfunctional. I would also suspect there is no standard way, but there are established practices within different technology stacks that vary between performative, barely compliant and effective at scale.
In one case I saw there was a substantial manual effort to load snapshots into instances run the delete and then save new snapshots. This was over 10 years ago though and it was more of a "we just need to get this done" than a "what's the most elegant way to do this at scale"
Maybe an append only data store where actual hard deletes only happen as an async batch job? Still 30 days seems really long for this.