Comment by nine_k
11 hours ago
> DELETEs are likely fairly rare by volume for many use cases
All your other points make sense, given this assumption.
I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
> Undoing is really easy
Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.
In short, there are cases when soft-deletion works well, and is a good approach. In other cases it does not, and is not. Analysis is needed before adopting it.
If only 50-70% of your data is dead and causing issues then you probably have an underlying indexing issue anyhow (because scaling to 2x-3x customers would cause the same issues by magnitude).
That said, we've had soft-deletes and during discussions of keeping it on one argument was that it was really only a half-assed measure (data lost due to updates rather than deletes aren't really saved)
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
I think we largely need support for "soft deletes" to be baked into SQL or its dialects directly and treated as something transparent (selecting soft deleted rows = special case, regular selects skip those rows; support for changing regular DELETE statements into doing soft deletes under the hood).
https://news.ycombinator.com/item?id=41272903
And then make dynamically sharding data by deleted/not deleted really easy to configure.
You soft deleted a few rows? They get moved to another DB instance, an archive/bin of sorts. Normal queries wouldn't even consider it, only when you explicitly try to select soft deleted rows would it be reached out to.
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
Depending on your use-case, having soft-deletes doesn't mean you can't clean out old deleted data anyway. You may want a process that grabs all data soft-deleted X years ago and just hard-delete it.
> Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.
Yes but this is no more complex than the current situation, where you have to always create the audit records.
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
At that point you should probably investigate partitioning or data warehousing.
What would be the benefit of data warehousing in this case?
Agreed. And if deletes are soft, you likely really just wanted a complete audit history of all updates too (at least that's for the cases I've been part of). And then performance _definitely_ would suffer if you don't have a separate audit/archive table for all of those.
I mean, yes, growth forever doesn't tend to work.
I've seen a number of apps that require audit histories work on a basis where they are archived at a particular time, and that's when the deletes occurred and indexes fully rebuilt. This is typically scheduled during the least busy time of the year as it's rather IO intensive.
Oldest I've worked with was a project started in ~1991. I don't recall when they started keeping history and for how long and they might have trimmed history after some legal period that's shorter but, I worked on it ~15 years after that. And that's like what, 15,..., 20 years ago by now and I doubt they changed that part of the system. You've all likely bought products that were administered through this system.
FWIW, no "indexes fully rebuilt" upon "actual deletion" or anything like that. The regular tables were always just "current" tables. History was kept in archive tables that were always up-to-date via triggers. Essentially, current tables never suffered any performance issues and history was available whenever needed. If history access was needed for extensive querying, read replicas were able to provide this without any cost to the main database but if something required "up to the second" consistency, the historic tables were available on the main database of course with good performance (as you can tell from the timelines, this was pre-SSDs, so multi-path I/O over fibre was what they had at the time I worked with it with automatic hot-spare failover between database hosts - no clouds of any kind in sight). Replication was done through replicating the actual SQL queries modifying the data on each replica (multiple read replicas across the world) vs. replicating the data itself. Much speedier, so that the application itself was able to use read replicas around the globe, without requiring multi-master for consistency. Weekends used to "diff" in order to ensure there were no inconsistencies for whatever reason (as applying the modifying SQL queries to each replica does of course have the potential to have the data go out of sync - theoretically).
Gee, I'm old, lol!
50-70% as the worst case isn't even necessarily that bad.
(Again, a lot is O(log n) right?)