Comment by buremba
2 days ago
All you need is Postgres until you scale into TBs of data. We use Postgresql as a durable workflow engine, vector search, time-series data, BM25 search, OLTP/OLAP engine, and a queue. It's basically the only dependency we have for https://lobu.ai
The main benefit is centralizing all the data in one place so we don't need to worry about copying data in between multiple systems. Once something becomes the bottleneck, you can eventually migrate to a purpose specific tool to scale out.To be honest, LISTEN/NOTIFY in my opinion is the most fragile part of PG but it's fine as start until you scale out.
But when you hit that wall, it is hard to stop and convince people to use different patterns and systems. I've seen so many tables go from "it will only be a few thousand rows" to suddenly several TB and then people are looking confused when performance and db admin tasks get really difficult.
I'm working at a scale where almost every day I have to ask people "are you use you need to treat that as relational data? It doesn't seem relational"
> But when you hit that wall, it is hard to stop and convince people to use different patterns and systems. I've seen so many tables go from "it will only be a few thousand rows" to suddenly several TB and then people are looking confused when performance and db admin tasks get really difficult.
It's much, much worse in my experience to have to develop for the opposite -- working on a system that was designed for an imagined "infinite" scale that in reality like 100GB and a few transactions a minute.
[dead]
[dead]
> are you use you need to treat that as relational data?
Is this intended to be "you sure you need..."?
Obviously, yes
Use different “databases” besides public at the very start. No joins between them. You will be in a good position to just split the postgres instance by those at a later date. They will have different usage patterns than the merged version you have now, and will be easier to optimize and will buy you some time. And time is all you need.
"public" is not a database, it is a schema within a database.
apropos bad naming, postgresql authors are not forgiven for naming all the databases on a single host a "cluster". I mean __really__.
Listen/notify is poised to become much better in PG 18 and 19
Why’s that?
In pg19 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... will land, which significantly improves NOTIFY performance. Right now LISTEN/NOTIFY doesn't scale to very busy instances because a `NOTIFY` within a transaction takes a global lock.
2 replies →
Just an fyi, when I try to sign in with google for your app I get the message: "The app is requesting access to sensitive info in your Google Account. Until the developer (*reka*kc*@gmail.com) verifies this app with Google, you shouldn't use it."
Ahh, sorry about that. It should be fixed in an hour, looks like we mixed the permissions. I just tried and confirmed other login methods work if you would like to try out.
I'm in the same camp. Do you use any specific extensions? Especially for OLAP and time series (partitioned tables + related extensions work fine, but curious if you use anything else)
The native extensions are fine but I don't have good experience with any third party extensions, so far tried Timescale, pg_lake, citus, and pgvectorscale. They look very appealing but it's usually a trap as you can't get the value without using the vendor's cloud offerings.
I think if you grow enough to look for these extensions, it's usually better to bet on purpose-specific tooling. For example, I use DuckDB/Iceberg combination extensively for columnar data and connect DuckDB to PG when I need it.
Fair enough. How do you do BM25?
From experience, I'd suggest using ClickHouse beyond a few billion rows of timeseries data in Postgres.
Nice thing about our use case is that its not strictly analytics, but looking at most recent raw data. ClickHouse is definitely the powerhouse for analytics
1 reply →
conversely, startups that start scaling for tbs of data never make it to needing tbs of data. They burn too much energy scaling when they don't yet have a product people want yet.
Yep. I've also seen systems that were slow with <10GB data of because of bad application of patterns that were supposedly "scalable" (pulling entire tables out of the database to implement joins in application code because "nosql is faster" is not actually fast).
I don't see logs mentioned. I agree with most those applications but would keep my OLAP stuff (metrics, logs, traces) in a separate store like VictoriaMetrics, both for capacity and read activity.
pg_timescale can take you pretty far for metrics and would be Good Enough for almost all users. Totally agree on raw, high-volume logs though.
Yeah I have logs in Sentry, which also uses Postgresql.
Sentry stores logs in ClickHouse - https://blog.sentry.io/how-sentry-queries-unstructured-data-...
1 reply →