Comment by nine_k

1 month ago

> DELETEs are likely fairly rare by volume for many use cases

All your other points make sense, given this assumption.

I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

> Undoing is really easy

Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.

In short, there are cases when soft-deletion works well, and is a good approach. In other cases it does not, and is not. Analysis is needed before adopting it.

19 comments

nine_k

whizzter 1 month ago

If only 50-70% of your data is dead and causing issues then you probably have an underlying indexing issue anyhow (because scaling to 2x-3x customers would cause the same issues by magnitude).

That said, we've had soft-deletes and during discussions of keeping it on one argument was that it was really only a half-assed measure (data lost due to updates rather than deletes aren't really saved)

KronisLV 1 month ago

> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

I think we largely need support for "soft deletes" to be baked into SQL or its dialects directly and treated as something transparent (selecting soft deleted rows = special case, regular selects skip those rows; support for changing regular DELETE statements into doing soft deletes under the hood).

https://news.ycombinator.com/item?id=41272903

And then make dynamically sharding data by deleted/not deleted really easy to configure.

You soft deleted a few rows? They get moved to another DB instance, an archive/bin of sorts. Normal queries wouldn't even consider it, only when you explicitly try to select soft deleted rows would it be reached out to.

indigo945 1 month ago
Well, Microsoft SQL Server has built-in Temporal Tables [1], which even take this one step further: they track all data changes, such that you can easily query them as if you were viewing them in the past. You can not only query deleted rows, but also the old versions of rows that have been updated.
(In my opinion, replicating this via a `validity tstzrange` column is also often a sane approach in PostgreSQL, although OP's blog post doesn't mention it.)
[1]: https://learn.microsoft.com/en-us/sql/relational-databases/t...
- warpspin 1 month ago
  
  MariaDB has system-versioned tables, too, albeit a bit worse than MS SQL as you cannot configure how to store the history, so they're basically hidden away in the same table or some partition: https://mariadb.com/docs/server/reference/sql-structure/temp...
  This has, at least with current MariaDB versions, the annoying property that you really cannot ever again modify the history without rewriting the whole table, which becomes a major pain in the ass if you ever need schema changes and history items block those.
  Maria still has to find some proper balance here between change safety and developer experience.
pif 1 month ago
> I think we largely need support for "soft deletes" to be baked into SQL
I think web and GUI programmers must stop expeting the database to contain the data already selected and formatted for their nice page.
- KronisLV 1 month ago
  
  > I think web and GUI programmers must stop expeting the database to contain the data already selected and formatted for their nice page.
  So a widespread, common and valid practice shouldn't be made better supported and instead should rely on awkward hacks like "deleted_at" where sooner or later people or ORMs will forget about those semantics and will select the wrong thing? I don't think I agree. I also don't think that it has much to do with how or where you represent the data. Temporal tables already do something similar, just with slightly different semantics.
  
  2 replies →

da_chicken 1 month ago

> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

At that point you should probably investigate partitioning or data warehousing.

nextaccountic 1 month ago
What would be the benefit of data warehousing in this case?
- da_chicken 1 month ago
  
  The reason to soft delete is to preserve the deleted data for later use. If you need to not query that data for a significant amount of the system use that 75% soft deletes is a performance problem, then you either need to move the soft deleted data out of the way inside the table (partition) or to another table entirely.
  The correct thing to do if your retention policy is causing a performance problem is to sit down and actually decide what the data is truly needed for, and if you can make some transformations/projections to combine only the actual data you really use to a different location so you can discard the rest. That's just data warehousing.
  Data warehouse doesn't only mean "cube tables". It also just means "a different location for data we rarely need, stored in a way that is only convenient for the old data needs". It doesn't need to be a different RDBMS or even a different database.
redman25 1 month ago
Exactly, partition the table vertically by month. Surprised no one else seems to be mentioning this.
- whatevaa 1 month ago
  
  This only works if the data is actually historical. Not everything is "montly".

tharkun__ 1 month ago

Agreed. And if deletes are soft, you likely really just wanted a complete audit history of all updates too (at least that's for the cases I've been part of). And then performance _definitely_ would suffer if you don't have a separate audit/archive table for all of those.

pixl97 1 month ago
I mean, yes, growth forever doesn't tend to work.
I've seen a number of apps that require audit histories work on a basis where they are archived at a particular time, and that's when the deletes occurred and indexes fully rebuilt. This is typically scheduled during the least busy time of the year as it's rather IO intensive.
- tharkun__ 1 month ago
  
  Oldest I've worked with was a project started in ~1991. I don't recall when they started keeping history and for how long and they might have trimmed history after some legal period that's shorter but, I worked on it ~15 years after that. And that's like what, 15,..., 20 years ago by now and I doubt they changed that part of the system. You've all likely bought products that were administered through this system.
  FWIW, no "indexes fully rebuilt" upon "actual deletion" or anything like that. The regular tables were always just "current" tables. History was kept in archive tables that were always up-to-date via triggers. Essentially, current tables never suffered any performance issues and history was available whenever needed. If history access was needed for extensive querying, read replicas were able to provide this without any cost to the main database but if something required "up to the second" consistency, the historic tables were available on the main database of course with good performance (as you can tell from the timelines, this was pre-SSDs, so multi-path I/O over fibre was what they had at the time I worked with it with automatic hot-spare failover between database hosts - no clouds of any kind in sight). Replication was done through replicating the actual SQL queries modifying the data on each replica (multiple read replicas across the world) vs. replicating the data itself. Much speedier, so that the application itself was able to use read replicas around the globe, without requiring multi-master for consistency. Weekends used to "diff" in order to ensure there were no inconsistencies for whatever reason (as applying the modifying SQL queries to each replica does of course have the potential to have the data go out of sync - theoretically).
  Gee, I'm old, lol!

scott_w 1 month ago

> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

Depending on your use-case, having soft-deletes doesn't mean you can't clean out old deleted data anyway. You may want a process that grabs all data soft-deleted X years ago and just hard-delete it.

> Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.

Yes but this is no more complex than the current situation, where you have to always create the audit records.

paulddraper 1 month ago

50-70% as the worst case isn't even necessarily that bad.

(Again, a lot is O(log n) right?)