> It’s also common to find long-running idle queries in PostgreSQL. Configuring timeouts like idle_in_transaction_session_timeout is essential to prevent them from blocking autovacuum.
Idle transactions have been a huge footgun at $DAYJOB… our code base is full of “connect, start a transaction, do work, if successful, commit.” It means you’re consuming a connection slot for all work, even while you’re not using the database, and not releasing it until you’re done. We had to bump the Postgres connection limits by an order of magnitude, multiple times, and before you know it Postgres takes up more RAM than anything else just to support the number of connections we need.
The problem permeated enough of our (rust) codebase that I had to come up with a compile time check that makes sure you’re not awaiting any async functions while a Postgres connection is in your scope. Using the .await keyword on an async function call, but not passing the pg connection to that function, ends up being a nearly perfect proxy for “doing unrelated work while not releasing a connection”. It worked extremely well, the compiler now just straight up tells us where we’re doing it wrong (in 100+ places in fact.)
Actually getting away from that pattern has been the hard part, but we’re almost rid of every place we’re doing it, and I can now run with a 32-connection pool in load testing instead of a 10,000 connection pool and there’s no real slowdowns. (Not that we’d go that low in production but it’s nice to know we can!)
Just decreasing the timeout for idle transactions would have probably been the backup option, but some of the code that holds long transactions is very rarely hit, and it would have taken a lot of testing to eliminate all of it if we didn’t have the static check.
Why don’t you change the order to “do work, if successful, grab a connection from the Postgres connection pool, start a transaction, commit, release the connection to the connection pool”?
That’s what we should do, yes. The problem is that we were just sorta careless with interleaving database calls in with the “work” we were doing. So that function that calls that slow external service, also takes a &PgConnection as an argument, because it wants to bump a timestamp in a table somewhere after the call is complete. Which means you need to already have a connection open to even call that function, etc etc.
If the codebase is large, and full of that kind of pattern (interleaving db writes with other work), the compiler plugin is nice for (a) giving you a TODO list of all the places you’re doing it wrong, and (b) preventing any new code from doing this while you’re fixing all the existing cases.
One idea was to bulk-replace everything so that we pass a reference to the pool itself around, instead of a checked-out connection/transaction, and then we would only use a connection for each query on-demand, but that’s dangerous… some of these functions are doing writes, and you may be relying on transaction rollback behavior if something fails. So if you were doing 3 pieces of “work” with a single db transaction before, and the third one failed, the transaction was getting rolled back for all 3. But if you split that into 3 different short-lived connections, now only the last of the 3 db operations is rolled back. So you can’t just find/replace, you need to go through and consider how to re-order the code so that the database calls happen “logically last”, but are still grouped together into a single transaction as before, to avoid subtle consistency bugs.
We have a similar check in our Haskell codebase, after running into two issues:
1. Nested database transactions could exhaust the transaction pool and deadlock
2. Same as you described with doing eg HTTP during transactions
We now have a compile time guarantee that no IO can be done outside of whitelisted things, like logging or getting the current time. It’s worked great! Definitely a good amount of work though.
I figured it’d be Haskell that is able to do this sort of thing really well. :-D
I had this realization while writing the rustc plugin that this is basically another shade of “function coloring”, but done intentionally. Now I wish I could have a language that lets me intentionally “color” my functions such that certain functions can only be called from certain blessed contexts… not unlike how async functions can only be awaited by other async functions, but for arbitrary domain-specific abstractions, in particular database connections in this case. I want to make it so HTTP calls are “purple”, and any function that gets a database connection is “pink”, and make it so purple can call pink but not vice-versa.
The rule I ended up with in the lint, is basically “if you have a connection in scope, you can only .await a function if you’re passing said connection to that function” (either by reference or by moving it.) It works with rust’s knowledge of lifetimes and drop semantics, so that if you call txn.commit() (which moves the connection out of scope, marking the storage as dead) you’re now free to do unrelated async calls after that line of code. It’s not perfect though… if you wrap the connection in a struct and hold that in your scope, the lint can’t see that you’re holding a connection. Luckily we’re not really doing that anywhere: connections are always passed around explicitly. But even if we did, you can also configure the lint with a list of “connection types” that will trigger the lint.
It’s a compile-time check, and yeah it’s a lint rule. In fact it goes a little deeper than a lint can go, because it uses data from earlier compiler phases (in order to get access to what the borrow checker knows.) The correct terminology is a “rustc driver” from what I’ve heard. Lints like clippy run as a “LateLintPass”, which doesn’t have access to certain mir data that is intentionally deleted in earlier phases to lower the memory requirements.
Hopefully it’s something I can open source soon (I may upstream it to the sqlx project, as that is what we’re using for db connections.)
Regarding schema changes and timeouts - while having timeouts in place is good advice, you can go further. While running the schema rollout, run a script alongside it that kills any workload conflicting with the aggressive locks the schema change is trying to take. This will greatly reduce the pain caused by lock contention, and prevent you from needing to repeatedly rerun statements on high-throughput tables.
This would be a particularly nice-to-have feature for Postgres - the option to have heavyweight locks just proactively cancel any conflicting workload. For any case where you have a high-throughput table, the damage of the heavyweight lock sitting there waiting (and blocking all new traffic) is generally much larger than just cancelling some running transactions.
Doesn't Postgres support transactional schema changes already? Why would you want to proactively kill work that's just going to complete after the schema change is done? Load balancing, throttling etc. is a different matter that has little to do with what you're proposing.
It supports transactional schema changes, but that's not what I'm talking about. Most schema changes require heavyweight locks on the tables they're altering. The locks might be short lived (for example, just a catalogue update to add a column), but they are nevertheless heavyweight and both block and are blocked by other work.
SELECT or DML operations take a lightweight lock on the table that doesn't block most other work, but it does block these schema changes. While the schema change is waiting to acquire the table lock, all new operations (like new SELECTs, for example) get blocked until the schema change completes.
So the following scenario can be pretty disastrous:
* Start a long-running SELECT operation on table
* Attempt to apply schema change to the table
* All new work on the table is blocked until the SELECT completes and the schema change can apply.
* Production outage
What the ChatGPT folks do is set a lock timeout when applying the schema change to make it 'give up' after a few seconds. This works to avoid truly excessive impact (in their case, they may have micro-outages of up to 5s while trying to apply schema), but has problems - firstly, they then need to retry, which may lead to more micro-outages, and secondly there's no guarantee on a system with mixed workload that they will be able to force the change through, and the schema change just ends up getting starved out.
A better alternative for most workloads is to build a system that detects what workload is blocking your schema change and kills it, allowing the schema change to go through quickly and unblock all the work behind it. You'd still use a lock timeout with this to be on the safe side, but it shouldn't be necessary in most cases.
Side note on transactional DDL - for Postgres systems with high throughput, most people just use autocommit. Table level locks that get taken to perform the schema change get held for the duration of the transaction, and you generally want to really minimize the amount of time you hold them for.
Cool! I'd love to know a bit more about the replication setup. I'm guessing they are doing async replication.
> We added nearly 50 read replicas, while keeping replication lag near zero
I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.
If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543
You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.
The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.
But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.
I don’t get it. This whole thing says single writer does not scale, so we stopped writing as much and removed reads away from it, so it works ok and we decided that’s enough. I guess thats great.
I always wondered what kind of instance companies at that level of scalability are using. Anyone here have some ideas? How much cpu/ram? Do they use the same instance types available to everyone, or does AWS and co offer custom hardware for these big customers?
The major hyperscalers all offer a plethora of virtual machines SKUs that are essentially one entire two-socket box with many-core CPUs.
For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Past those "general purpose" VMs you get the enormous machines with 8, 16, or even 32 sockets.[1] These are almost exclusively used for SAP HANA in-memory databases or similar ERP workloads.
Azure Standard_M896ixds_24_v3 provides 896 cores, 32 TB of memory, and 185 Gbps Ethernet networking. This is generally available, but you have to allocate the quota through a support ticket and you may have to wait and/or get your finances "approved" by Microsoft. Something like this will set you back [edited] $175K per month[/edited]. (I suspect OpenAI is getting a huge effective discount.)
Personally, I'm a fan of "off label" use of the High Performance Compute (HPC) sizes[2] for database servers.
The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory. That's similar to the E-series VM above, but with a higher compute-to-memory ratio. The memory throughput is also way better because it has some HBM chips for L3 (or L4?) cache. In my benchmarks it absolutely smoked the general-purpose VMs at a similar price point.
> For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Is a well-stocked Dell Server going for ~50 - 60K capex without storage before the RAM prices exploded. I"m wondering a bit about the CPU in there, but the Storage + RAM is fairly normal and nothing crazy. I'm pretty sure you could have that in a rack for 100k hardware pricing.
This is why I love Postgres. It can get you to being one of the largest websites before you need to reconsider your architecture just by throwing CPU and disk at it. At that point you can well afford to hire people who are deep experts at sharding etc.
PostgreSQL actually supports sharding out of the box, it's just a matter of setting up the right table partitioning and using Foreign Data Wrapper (FDW) to forward queries to remote databases. I'm not sure what the post is referencing when they say that sharding requires leaving Postgres altogether.
This is specifically what they said about sharding
> The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years
> At that point you can well afford to hire people who are deep experts at sharding etc.
Can you, though? OpenAI is haemorrhaging money like it is going out of style and, according to the news cycle over the last couple of days, will likely to be bankrupt by 2027.
And typically the bigger the company gets, the harder it is to migrate to a new data model.
You suddenly have literally thousands of internal users of a datastore, and "We want to shard by userId, nobody please don't do joins on user Id anymore" becomes an impossible ask.
They don't care. Azure has a revenue higher than GCP, losing only to AWS. It's Microsoft's new baby, and they love it, no matter what you want to run there. Also, they're still the 4th largest company by market cap.
Honestly, only us nerds in Hacker News care about this kind of stuff :) (and that's why I love it here).
edit: also, the article cites OpenAI did adopt Azure Cosmos DB for new stuff they want to shard. Still shows how far you can take PostgreSQL though.
That ship sailed a long time ago, as Microsoft has offered Linux VMs in Azure for 14 years, and today, about 2/3 of VMs running there are Linux. In the public cloud era, owning the infrastructure and customer base is far more important than licenses.
Azure offers Postgres “DBaaS”, so I’m pretty sure they are no where near that stage. It’s more likely that we should watch out for the Microsoft E-E-E strategy.
Amen to that.. those tripple-E bastards are likely to use that playbook again. Best advise is to seek fertile grounds where freedom grows. I can't wait for Europe's cloud offering, I believe they're gonna serve as the middle ground between greedy tech-bros and china's fake free as in free beer products. Pack up your bags IT HOBBITS, we're moving to middle earth.
Running this on Azure Postgresql, even migrating to CosmosDB, cannot be cheap. I know that OpenAI have to deal/relationship with Microsoft, but still, this has to be expensive.
This is however the most down to earth: How we scale Postgresql I've read in a long time. No weird hacker, no messing around with the source code or tweaking the Linux kernel. Running on Azure Postgresql it's not like OpenAI have those options anyway, but still it seems a lot more relatable than: We wrote our own drive/filesystem/database-hack in Javascript.
> If joins are necessary, we learned to consider breaking down the query and move complex join logic to the application layer instead.
We often try to leverage the power of the DB to optimize joins on our behalf to avoid having to create them. At a certain point, I guess you wind up having to pull this back to your layer to optimize "the one job" of the database.
I jest, but only slightly. We don't just want to persist data, but link it for different purposes, the "relational" part of RDBMS. Good to know there's still room to grow here, for PostgreSQL and the DB industry.
From what I understand they basically couldn't scale writes in PostgreSQL to their needs and had to offload what they could to Azure's NoSQL database.
I wonder, is there another popular OLTP database solution that does this better?
> For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB.
> Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning.
That would mean it improved somewhat. We always got better write performance from mysql vs postgres, however that is a while ago; we then tried tidb to go further but it was basically rather slow. Again, a while ago.
When did you get your results, might be time to re-evaluate.
I was thinking about the same paragraph because write-amplification is exactly the problem solved by LSM trees _and_ they already have a solution for that in-house - one of the first acquisitions that OpenAI made is Rockset - a company that actually built the RocksDb at scale.
So, this is the part that actually made me left wondering why.
Postgres can really scale well vertically (and horizontally for read-only workloads) as the post shows.
However, I'm still surprised about the reasons for not sharding. They have been mentioned before, but I haven't seen a substantial rationale.
Sharding is almost only analyzed from the perspective of write scaling. But sharding may not only be about write scaling, but a path to reducing blast radius. And this is something that triggers much earlier than write scaling needs (especially given how well Postgres scales vertically and reads).
When you shard your database, you end up having N clusters (for HA purposes, each "shard" must be a primary-replica(s) cluster itself), each holding 1/Nth of the data.
There are certain scenarios which, while unlikely, may hit you: data corruption in the WAL replication stream, a problem during a major upgrade, a situation that requires a whole cluster restore from a backup, you name it. For those cases, the whole cluster may experience notable downtime.
If you have a single cluster, 100% of your users experience downtime. If you sharded into N clusters, only 1/Nth of your users experience downtime. For a company servicing 800M users the difference from both scenarios is dramatically different. Even for much much smaller companies.
I'm puzzled why this is not perceived as a risk, and if it is not, how it is mitigated.
While I wouldn't advocate to shard "too early", given that it comes with notable caveats, I believe more and more in sharding your workloads when possible more earlier than later. Way before truly needing it from a write scaling perspective. Because apart from reducing the blast radius, it applies implicitly the principle of "divide-and-conquer", and your problems become much more manageable (your number of connections per cluster decreases at will, backup restore times can be a fraction of the time, logical replication can be considered as a true option for replication/upgrades/etc if you keep shards relatively small and many other operational procedures are greatly simplified if now you have much smaller databases, even if you have many more of them).
Microsoft originally bought CitusData and rebadged it as Azure CosmosDb for Postgres Cluster. Microsoft have been recommending partners to now avoid that product. It does not and will not support Entra federated workload identities (passwordless).
The replacement will be Azure Database for Postgres with Elastic Clusters. I think it is still in preview.
Again it’s Citus based, but without the CosmosDb badge and it will support federated workload identities.
Did I miss it, or did they not say why they picked CosmoDB? Postgres has also sharding, so instead of moving to a different DB they could have added a new postgres instance with sharding for the new requests.
At a guess CosmosDb NoSql was a good choice for dumping user contexts sharded rather than use Postgres schema or JSONB. Citus is the obvious choice for this with Postgres but Azure had poor support until recently 2024 they have preview Azure Database for Postgres with Elastic Clusters, which is basically Azure Database for Postgres Flexible Server with Citus extension installed. In the end they aren’t paying Microsoft what usual customer are, so even though CosmosDb NoSql is expensive (RU based) and the SDK is horrible, it probably served as a good stopgap until elastic clusters is fully out of preview. That’s my guess anyway.
Might've had something to do with explaining behavior to their app teams. "Use this new DB product where it's sharded" might be easier than "here's a new postgres endpoint like the old one but now if you join on user ID it's inconsistent".
(not that that's an excuse, but i've seen similar things before)
I honestly don't understand such negative response tone from the comments. Yes, it does promote Azure, but that's to be expected from a company with is part owned by Microsoft :).
The main point of the article is that it's actually not that hard to live with a single primary Postgres for your transactional workloads (emphasis on _transactional_), and if OpenAI with their 800M+ users can still survive on a single primary (with 50(!) read replicas), so could you, especially before you've reached your first 100M users.
Any non-distributed database or setup is orders of magnitude easier to design for, and it's also typically much more cost efficient too, both in terms of hardware and software too.
There are some curious details, e.g.:
- you can ship WAL to 50 read replicas simultaneously from a single primary and be fine
- you can even be using an ORM and still get decent performance
- schema changes are possible, and you can just cancel a slow ALTER to prevent production impact
- pgbouncer is ok even for OpenAI scale
There are so many things that contradict current "conventional wisdom" based on the experience from what was possible with the hardware 10+ (or even 20+) years ago. Times finally changed and I really welcome articles like these that show how you can greatly simplify your production setup by leveraging the modern hardware.
"However, some read queries must remain on the primary because they’re part of write transactions. "
if there is a read replica that has reached required snapshot - it is usually enough (depends on your task of course) for it to be the snapshot that was at the start of your transaction - and if the read query doesn't need to read your transaction uncommitted data, then that replica can serve the read query.
Out of pure boredom and tired of all these Chat websites selling my data and with ChatGPT's new update on ads - I decided enough was enough and created my own Chat application for privacy. Like every other architect, I searched for a good database and eventually gave up on specialized ones for chat because they were either too expensive to host or too complex to deal with. So, I simply just used PostgreSQL. My chat app has basic RAG, not ground breaking or anything - but the most important feature I made was ability to add different chat models into one group chat. So, when you ask for opinions on something - you are not relying on just a single model and you can get a multi-model view of all the possible answers. Each model can have its own unique prompt within the group chat. So basically, a join table.
Months passed by since this application was developed (a simple Phoenix/Elixir backend), and yesterday I was casually checking my database to see how many rows it had - about 500,000+ roughly. I didn't notice a single hint of the volume the Postgres was handling, granted - I'm the only user, but there's always a lot going on - RAG, mostly that requires searching of the database for context before multiple agents send you a response (and respond amongst themselves). Absolutely zero performance degradation.
I'm convinced that Postgres is a killer database that doesn't get the attention it deserves over the others (for chat). Already managing some high traffic websites (with over 500M+ requests) with no issues, so I am extremely unsurprised that it works really well for chat apps at scale too.
They could've just sharded it; their users are not interconnected, it would be easy to just have 128 shards and then assign user to one by org/user hash
Nice write up! It is cool to see that PostgreSQL is still standing. Adyen has some nice blog posts about squeezing the max out of PostgreSQL https://medium.com/adyen/all?topic=postgres
How do they store all the other stuff related to operating the service? This must be a combination of several components? (yes, including some massdata storage, Id guess?)
This would be cool to understand, as Ive absolutely no idea how this is done (and could be done :-)
> Acknowledgements Special thanks to Jon Lee, Sicheng Liu, Chaomin Yu, and Chenglong Hao, who contributed to this post, and to the entire team that helped scale PostgreSQL. We’d also like to thank the Azure PostgreSQL team for their strong partnership.
I was confused when I saw elsewhere a "goose with a knife" meme asking someone if they had 800M users, and then if they had donated to PostgreSQL. Now I understand.
They literally answer that in the post. They started with a single instance and realized sharding the existing tables will be too much work (they'll slowly migrate to new tables instead).
Article has so much fluff and only some very coarse information like (we sharded writes, yay!). Almost no detail just keywords for SEO, or whatever they’re aiming for.
There’s also a lot of repetition. Maybe it was AI generated…?
I remember coming across an article from NYCMesh which looked interesting ("Administrating the Mesh" - https://www.nycmesh.net/blog/datadog/) which made sense all the way until they put Datadog on top of everything, and I asked myself:
> What the hell, how is using a centralized service for managing a decentralized mesh a suitable solution? Did the author get employed by Datadog or what happened?
Then I got curious and lo and behold; the author was indeed hired by Datadog (and still works there AFAIK), effectively compromising the entire article and the project itself, because of their new employer.
The 'single primary with read replicas' pattern scaling to 800M users is the real insight here. Most startups reach for sharding or distributed databases way too early, adding complexity for scale they don't have. If OpenAI can serve hundreds of millions from one Postgres primary by offloading reads and pushing new write-heavy features elsewhere, that's a strong argument for simplicity.
I like your point, but it also says that this isn't easy:
> It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple.
And it also says that this approach has cornered them into a solution that isn't trivial to change. They now use different database deployments (the single primary one that is the focus of the post and *multiple* other systems, such as Azure CosmosDB, to which some of the write traffic is being directed).
> To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimising application logic to minimise unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.
I wonder how easy it is for developers to maintain and evolve this solution of miscellaneous database systems.
So yes, you can go far with a single primary, but you can also potentially never easily get away from it.
Counterpoint: if they had reached for sharding early, they would have avoided the technical debt of having to refactor their existing database. I don't think sharding is necessarily that complex either, especially for a SaaS style app like ChatGPT where users are mostly siloed in.
When people spend their entire careers in AWS land, it's easy to forget just how much power a single beefy bare metal server brings to bear. You can scale far and wide simply by getting a bigger server.
True. But “big beef” is complicated and difficult to make reliable. Horizontal scaling of unreliable servers is dirt simple to stay up through almost anything except sudden load spikes. And then it’s largely a matter of configuring your auto scaling and retries.
That said big beef is so simple to start with. And this story is a strong example that YAGNI is a practical reality for almost everybody wrt “distributed everything”.
If you need so many tricks to support the infra, it will eventually come back to bite you. I am pretty sure that Google in year 2000 could have supported their workloads with existing technologies (Yahoo could, and it was a much larger company). But they did GFS and Bigtable, and the rest is history. Other companies struggled to catch up due to inferior infrastructure. A visionary company needs to be prepared and should not be hindered by infrastructure. Can you scale the single primary system another 10x or more? Because their CEO said that they will scale their revenue by that much within just a couple of years.
But they didn't really stay single primary. They moved a lot of load off to alternate database systems. So they did effectively shard, but to different databases rather than postgres.
Quite possibly they would have been better off staying purely postgres but with sharing. But impossible to know.
> It’s also common to find long-running idle queries in PostgreSQL. Configuring timeouts like idle_in_transaction_session_timeout is essential to prevent them from blocking autovacuum.
Idle transactions have been a huge footgun at $DAYJOB… our code base is full of “connect, start a transaction, do work, if successful, commit.” It means you’re consuming a connection slot for all work, even while you’re not using the database, and not releasing it until you’re done. We had to bump the Postgres connection limits by an order of magnitude, multiple times, and before you know it Postgres takes up more RAM than anything else just to support the number of connections we need.
The problem permeated enough of our (rust) codebase that I had to come up with a compile time check that makes sure you’re not awaiting any async functions while a Postgres connection is in your scope. Using the .await keyword on an async function call, but not passing the pg connection to that function, ends up being a nearly perfect proxy for “doing unrelated work while not releasing a connection”. It worked extremely well, the compiler now just straight up tells us where we’re doing it wrong (in 100+ places in fact.)
Actually getting away from that pattern has been the hard part, but we’re almost rid of every place we’re doing it, and I can now run with a 32-connection pool in load testing instead of a 10,000 connection pool and there’s no real slowdowns. (Not that we’d go that low in production but it’s nice to know we can!)
Just decreasing the timeout for idle transactions would have probably been the backup option, but some of the code that holds long transactions is very rarely hit, and it would have taken a lot of testing to eliminate all of it if we didn’t have the static check.
Why don’t you change the order to “do work, if successful, grab a connection from the Postgres connection pool, start a transaction, commit, release the connection to the connection pool”?
That’s what we should do, yes. The problem is that we were just sorta careless with interleaving database calls in with the “work” we were doing. So that function that calls that slow external service, also takes a &PgConnection as an argument, because it wants to bump a timestamp in a table somewhere after the call is complete. Which means you need to already have a connection open to even call that function, etc etc.
If the codebase is large, and full of that kind of pattern (interleaving db writes with other work), the compiler plugin is nice for (a) giving you a TODO list of all the places you’re doing it wrong, and (b) preventing any new code from doing this while you’re fixing all the existing cases.
One idea was to bulk-replace everything so that we pass a reference to the pool itself around, instead of a checked-out connection/transaction, and then we would only use a connection for each query on-demand, but that’s dangerous… some of these functions are doing writes, and you may be relying on transaction rollback behavior if something fails. So if you were doing 3 pieces of “work” with a single db transaction before, and the third one failed, the transaction was getting rolled back for all 3. But if you split that into 3 different short-lived connections, now only the last of the 3 db operations is rolled back. So you can’t just find/replace, you need to go through and consider how to re-order the code so that the database calls happen “logically last”, but are still grouped together into a single transaction as before, to avoid subtle consistency bugs.
1 reply →
"Work" might require a transaction which reads with lock, computes, (launch missiles), and then updates.
This is really tough in a large organization with features that cross across product domains.
We have a similar check in our Haskell codebase, after running into two issues:
1. Nested database transactions could exhaust the transaction pool and deadlock 2. Same as you described with doing eg HTTP during transactions
We now have a compile time guarantee that no IO can be done outside of whitelisted things, like logging or getting the current time. It’s worked great! Definitely a good amount of work though.
I figured it’d be Haskell that is able to do this sort of thing really well. :-D
I had this realization while writing the rustc plugin that this is basically another shade of “function coloring”, but done intentionally. Now I wish I could have a language that lets me intentionally “color” my functions such that certain functions can only be called from certain blessed contexts… not unlike how async functions can only be awaited by other async functions, but for arbitrary domain-specific abstractions, in particular database connections in this case. I want to make it so HTTP calls are “purple”, and any function that gets a database connection is “pink”, and make it so purple can call pink but not vice-versa.
The rule I ended up with in the lint, is basically “if you have a connection in scope, you can only .await a function if you’re passing said connection to that function” (either by reference or by moving it.) It works with rust’s knowledge of lifetimes and drop semantics, so that if you call txn.commit() (which moves the connection out of scope, marking the storage as dead) you’re now free to do unrelated async calls after that line of code. It’s not perfect though… if you wrap the connection in a struct and hold that in your scope, the lint can’t see that you’re holding a connection. Luckily we’re not really doing that anywhere: connections are always passed around explicitly. But even if we did, you can also configure the lint with a list of “connection types” that will trigger the lint.
1 reply →
How did you implement this runtime check? Is it a lint rule, or using the type system?
It’s a compile-time check, and yeah it’s a lint rule. In fact it goes a little deeper than a lint can go, because it uses data from earlier compiler phases (in order to get access to what the borrow checker knows.) The correct terminology is a “rustc driver” from what I’ve heard. Lints like clippy run as a “LateLintPass”, which doesn’t have access to certain mir data that is intentionally deleted in earlier phases to lower the memory requirements.
Hopefully it’s something I can open source soon (I may upstream it to the sqlx project, as that is what we’re using for db connections.)
2 replies →
Regarding schema changes and timeouts - while having timeouts in place is good advice, you can go further. While running the schema rollout, run a script alongside it that kills any workload conflicting with the aggressive locks the schema change is trying to take. This will greatly reduce the pain caused by lock contention, and prevent you from needing to repeatedly rerun statements on high-throughput tables.
This would be a particularly nice-to-have feature for Postgres - the option to have heavyweight locks just proactively cancel any conflicting workload. For any case where you have a high-throughput table, the damage of the heavyweight lock sitting there waiting (and blocking all new traffic) is generally much larger than just cancelling some running transactions.
Doesn't Postgres support transactional schema changes already? Why would you want to proactively kill work that's just going to complete after the schema change is done? Load balancing, throttling etc. is a different matter that has little to do with what you're proposing.
It supports transactional schema changes, but that's not what I'm talking about. Most schema changes require heavyweight locks on the tables they're altering. The locks might be short lived (for example, just a catalogue update to add a column), but they are nevertheless heavyweight and both block and are blocked by other work.
SELECT or DML operations take a lightweight lock on the table that doesn't block most other work, but it does block these schema changes. While the schema change is waiting to acquire the table lock, all new operations (like new SELECTs, for example) get blocked until the schema change completes.
So the following scenario can be pretty disastrous:
* Start a long-running SELECT operation on table
* Attempt to apply schema change to the table
* All new work on the table is blocked until the SELECT completes and the schema change can apply.
* Production outage
What the ChatGPT folks do is set a lock timeout when applying the schema change to make it 'give up' after a few seconds. This works to avoid truly excessive impact (in their case, they may have micro-outages of up to 5s while trying to apply schema), but has problems - firstly, they then need to retry, which may lead to more micro-outages, and secondly there's no guarantee on a system with mixed workload that they will be able to force the change through, and the schema change just ends up getting starved out.
A better alternative for most workloads is to build a system that detects what workload is blocking your schema change and kills it, allowing the schema change to go through quickly and unblock all the work behind it. You'd still use a lock timeout with this to be on the safe side, but it shouldn't be necessary in most cases.
Side note on transactional DDL - for Postgres systems with high throughput, most people just use autocommit. Table level locks that get taken to perform the schema change get held for the duration of the transaction, and you generally want to really minimize the amount of time you hold them for.
First OpenAI Engineering blog? I'm definitely interested in seeing more and how they handled the rapid growth.
There was a lot of downtime...
I think they handled the massive growth by a lot of 2am emergencies and editing config files directly in production in the hope of fixing fires.
Cool! I'd love to know a bit more about the replication setup. I'm guessing they are doing async replication.
> We added nearly 50 read replicas, while keeping replication lag near zero
I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.
> would slow down the primary since it has to wait for the TCP acks
Other than keeping around more WAL segments not sure why it would slow down the primary?
If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543
You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.
The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.
But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.
5 replies →
"... If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL...."
So it is not really scaling too much now, rather maintaining current state of things and new features go to a different DB?
Azure CosmosDB is insanely expensive. I can't imagine anybody using it unless you have OpenAI money.
We don't know the profit margins on it... Might not be very expensive if you're an internal user as openAI effectively is for microsoft.
*Microsoft's money
I don’t get it. This whole thing says single writer does not scale, so we stopped writing as much and removed reads away from it, so it works ok and we decided that’s enough. I guess thats great.
This article has very little useful information...
There's nothing novel about optimizing queries, sharding and using read replicas.
It has one piece of useful info: their main data store even for 800M users is a single instance of postgres (for writes) without sharding.
5 replies →
> scaled up by increasing the instance size
I always wondered what kind of instance companies at that level of scalability are using. Anyone here have some ideas? How much cpu/ram? Do they use the same instance types available to everyone, or does AWS and co offer custom hardware for these big customers?
The major hyperscalers all offer a plethora of virtual machines SKUs that are essentially one entire two-socket box with many-core CPUs.
For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Past those "general purpose" VMs you get the enormous machines with 8, 16, or even 32 sockets.[1] These are almost exclusively used for SAP HANA in-memory databases or similar ERP workloads.
Azure Standard_M896ixds_24_v3 provides 896 cores, 32 TB of memory, and 185 Gbps Ethernet networking. This is generally available, but you have to allocate the quota through a support ticket and you may have to wait and/or get your finances "approved" by Microsoft. Something like this will set you back [edited] $175K per month[/edited]. (I suspect OpenAI is getting a huge effective discount.)
Personally, I'm a fan of "off label" use of the High Performance Compute (HPC) sizes[2] for database servers.
The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory. That's similar to the E-series VM above, but with a higher compute-to-memory ratio. The memory throughput is also way better because it has some HBM chips for L3 (or L4?) cache. In my benchmarks it absolutely smoked the general-purpose VMs at a similar price point.
[1] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
[2] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
> Something like this will set you back $30K-$60K per year
lol, no, cloud is nowhere near that good value. It’s $3.5M annually.
> The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory
This one is $124k per year.
1 reply →
On the AWS side there are "HANA certified" instances that max out at 1920 cores and 32 TB RAM - u7inh-32tb.480xlarge
https://docs.aws.amazon.com/sap/latest/general/sap-hana-aws-...
1 reply →
Interestingly enough,
> For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Is a well-stocked Dell Server going for ~50 - 60K capex without storage before the RAM prices exploded. I"m wondering a bit about the CPU in there, but the Storage + RAM is fairly normal and nothing crazy. I'm pretty sure you could have that in a rack for 100k hardware pricing.
Are there any pictures around of these 8, 16, 32 socket boards? Just curious how they look like.
1 reply →
This is why I love Postgres. It can get you to being one of the largest websites before you need to reconsider your architecture just by throwing CPU and disk at it. At that point you can well afford to hire people who are deep experts at sharding etc.
PostgreSQL actually supports sharding out of the box, it's just a matter of setting up the right table partitioning and using Foreign Data Wrapper (FDW) to forward queries to remote databases. I'm not sure what the post is referencing when they say that sharding requires leaving Postgres altogether.
This is specifically what they said about sharding
> The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years
15 replies →
Shameless plug: https://github.com/mkleczek/pgwrh automates it quite a bit.
> At that point you can well afford to hire people who are deep experts at sharding etc.
Can you, though? OpenAI is haemorrhaging money like it is going out of style and, according to the news cycle over the last couple of days, will likely to be bankrupt by 2027.
And typically the bigger the company gets, the harder it is to migrate to a new data model.
You suddenly have literally thousands of internal users of a datastore, and "We want to shard by userId, nobody please don't do joins on user Id anymore" becomes an impossible ask.
The article is basically "we use PostgreSQL, it works, but we had to do some optimization to make it scale".
I don't really get the point here. What is novel and great? It feels they followed the first " how to scale pg" article.
Someone ask Microsoft what does it feel to be bested by an open source project on their very own cloud platform!!! Lol.
They don't care. Azure has a revenue higher than GCP, losing only to AWS. It's Microsoft's new baby, and they love it, no matter what you want to run there. Also, they're still the 4th largest company by market cap.
Honestly, only us nerds in Hacker News care about this kind of stuff :) (and that's why I love it here).
edit: also, the article cites OpenAI did adopt Azure Cosmos DB for new stuff they want to shard. Still shows how far you can take PostgreSQL though.
That ship sailed a long time ago, as Microsoft has offered Linux VMs in Azure for 14 years, and today, about 2/3 of VMs running there are Linux. In the public cloud era, owning the infrastructure and customer base is far more important than licenses.
And the same for Linux boxes on Azure - they dominate Windows servers by a huge margin.
Are you saying this because OpenAI didnt choose SQL Server?
In 2026 is SQL Server ever the answer?
9 replies →
Azure offers Postgres “DBaaS”, so I’m pretty sure they are no where near that stage. It’s more likely that we should watch out for the Microsoft E-E-E strategy.
Amen to that.. those tripple-E bastards are likely to use that playbook again. Best advise is to seek fertile grounds where freedom grows. I can't wait for Europe's cloud offering, I believe they're gonna serve as the middle ground between greedy tech-bros and china's fake free as in free beer products. Pack up your bags IT HOBBITS, we're moving to middle earth.
Running this on Azure Postgresql, even migrating to CosmosDB, cannot be cheap. I know that OpenAI have to deal/relationship with Microsoft, but still, this has to be expensive.
This is however the most down to earth: How we scale Postgresql I've read in a long time. No weird hacker, no messing around with the source code or tweaking the Linux kernel. Running on Azure Postgresql it's not like OpenAI have those options anyway, but still it seems a lot more relatable than: We wrote our own drive/filesystem/database-hack in Javascript.
I was surprised to see this:
> If joins are necessary, we learned to consider breaking down the query and move complex join logic to the application layer instead.
We often try to leverage the power of the DB to optimize joins on our behalf to avoid having to create them. At a certain point, I guess you wind up having to pull this back to your layer to optimize "the one job" of the database.
I jest, but only slightly. We don't just want to persist data, but link it for different purposes, the "relational" part of RDBMS. Good to know there's still room to grow here, for PostgreSQL and the DB industry.
From what I understand they basically couldn't scale writes in PostgreSQL to their needs and had to offload what they could to Azure's NoSQL database.
I wonder, is there another popular OLTP database solution that does this better?
> For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB.
> Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning.
Tidb should handle it nice. I've wrote 200к inserts / sec for hour in peak. Underlying lsm works better for writes
That would mean it improved somewhat. We always got better write performance from mysql vs postgres, however that is a while ago; we then tried tidb to go further but it was basically rather slow. Again, a while ago.
When did you get your results, might be time to re-evaluate.
1 reply →
I was thinking about the same paragraph because write-amplification is exactly the problem solved by LSM trees _and_ they already have a solution for that in-house - one of the first acquisitions that OpenAI made is Rockset - a company that actually built the RocksDb at scale.
So, this is the part that actually made me left wondering why.
Postgres can really scale well vertically (and horizontally for read-only workloads) as the post shows.
However, I'm still surprised about the reasons for not sharding. They have been mentioned before, but I haven't seen a substantial rationale.
Sharding is almost only analyzed from the perspective of write scaling. But sharding may not only be about write scaling, but a path to reducing blast radius. And this is something that triggers much earlier than write scaling needs (especially given how well Postgres scales vertically and reads).
When you shard your database, you end up having N clusters (for HA purposes, each "shard" must be a primary-replica(s) cluster itself), each holding 1/Nth of the data.
There are certain scenarios which, while unlikely, may hit you: data corruption in the WAL replication stream, a problem during a major upgrade, a situation that requires a whole cluster restore from a backup, you name it. For those cases, the whole cluster may experience notable downtime.
If you have a single cluster, 100% of your users experience downtime. If you sharded into N clusters, only 1/Nth of your users experience downtime. For a company servicing 800M users the difference from both scenarios is dramatically different. Even for much much smaller companies.
I'm puzzled why this is not perceived as a risk, and if it is not, how it is mitigated.
While I wouldn't advocate to shard "too early", given that it comes with notable caveats, I believe more and more in sharding your workloads when possible more earlier than later. Way before truly needing it from a write scaling perspective. Because apart from reducing the blast radius, it applies implicitly the principle of "divide-and-conquer", and your problems become much more manageable (your number of connections per cluster decreases at will, backup restore times can be a fraction of the time, logical replication can be considered as a true option for replication/upgrades/etc if you keep shards relatively small and many other operational procedures are greatly simplified if now you have much smaller databases, even if you have many more of them).
Microsoft originally bought CitusData and rebadged it as Azure CosmosDb for Postgres Cluster. Microsoft have been recommending partners to now avoid that product. It does not and will not support Entra federated workload identities (passwordless).
The replacement will be Azure Database for Postgres with Elastic Clusters. I think it is still in preview.
Again it’s Citus based, but without the CosmosDb badge and it will support federated workload identities.
https://techcommunity.microsoft.com/blog/adforpostgresql/pos...
https://learn.microsoft.com/en-us/azure/postgresql/elastic-c...
Did I miss it, or did they not say why they picked CosmoDB? Postgres has also sharding, so instead of moving to a different DB they could have added a new postgres instance with sharding for the new requests.
At a guess CosmosDb NoSql was a good choice for dumping user contexts sharded rather than use Postgres schema or JSONB. Citus is the obvious choice for this with Postgres but Azure had poor support until recently 2024 they have preview Azure Database for Postgres with Elastic Clusters, which is basically Azure Database for Postgres Flexible Server with Citus extension installed. In the end they aren’t paying Microsoft what usual customer are, so even though CosmosDb NoSql is expensive (RU based) and the SDK is horrible, it probably served as a good stopgap until elastic clusters is fully out of preview. That’s my guess anyway.
Might've had something to do with explaining behavior to their app teams. "Use this new DB product where it's sharded" might be easier than "here's a new postgres endpoint like the old one but now if you join on user ID it's inconsistent".
(not that that's an excuse, but i've seen similar things before)
I honestly don't understand such negative response tone from the comments. Yes, it does promote Azure, but that's to be expected from a company with is part owned by Microsoft :).
The main point of the article is that it's actually not that hard to live with a single primary Postgres for your transactional workloads (emphasis on _transactional_), and if OpenAI with their 800M+ users can still survive on a single primary (with 50(!) read replicas), so could you, especially before you've reached your first 100M users.
Any non-distributed database or setup is orders of magnitude easier to design for, and it's also typically much more cost efficient too, both in terms of hardware and software too.
There are some curious details, e.g.:
- you can ship WAL to 50 read replicas simultaneously from a single primary and be fine - you can even be using an ORM and still get decent performance - schema changes are possible, and you can just cancel a slow ALTER to prevent production impact - pgbouncer is ok even for OpenAI scale
There are so many things that contradict current "conventional wisdom" based on the experience from what was possible with the hardware 10+ (or even 20+) years ago. Times finally changed and I really welcome articles like these that show how you can greatly simplify your production setup by leveraging the modern hardware.
[dead]
"However, some read queries must remain on the primary because they’re part of write transactions. "
if there is a read replica that has reached required snapshot - it is usually enough (depends on your task of course) for it to be the snapshot that was at the start of your transaction - and if the read query doesn't need to read your transaction uncommitted data, then that replica can serve the read query.
I like the way of thinking. Instead of migrating to another database, they keep that awesome one running and found smart workaround to push limits.
It is what mature engineering does. Migrations are not fun.
Out of pure boredom and tired of all these Chat websites selling my data and with ChatGPT's new update on ads - I decided enough was enough and created my own Chat application for privacy. Like every other architect, I searched for a good database and eventually gave up on specialized ones for chat because they were either too expensive to host or too complex to deal with. So, I simply just used PostgreSQL. My chat app has basic RAG, not ground breaking or anything - but the most important feature I made was ability to add different chat models into one group chat. So, when you ask for opinions on something - you are not relying on just a single model and you can get a multi-model view of all the possible answers. Each model can have its own unique prompt within the group chat. So basically, a join table.
Months passed by since this application was developed (a simple Phoenix/Elixir backend), and yesterday I was casually checking my database to see how many rows it had - about 500,000+ roughly. I didn't notice a single hint of the volume the Postgres was handling, granted - I'm the only user, but there's always a lot going on - RAG, mostly that requires searching of the database for context before multiple agents send you a response (and respond amongst themselves). Absolutely zero performance degradation.
I'm convinced that Postgres is a killer database that doesn't get the attention it deserves over the others (for chat). Already managing some high traffic websites (with over 500M+ requests) with no issues, so I am extremely unsurprised that it works really well for chat apps at scale too.
"This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads."
Sure, but choosing from the start a DB that can scale with ease would have taken far less time and effort.
You can bend any software into doing anything, but is it worth it?
They could've just sharded it; their users are not interconnected, it would be easy to just have 128 shards and then assign user to one by org/user hash
Nice write up! It is cool to see that PostgreSQL is still standing. Adyen has some nice blog posts about squeezing the max out of PostgreSQL https://medium.com/adyen/all?topic=postgres
I would be super curious about:
How do they store all the other stuff related to operating the service? This must be a combination of several components? (yes, including some massdata storage, Id guess?)
This would be cool to understand, as Ive absolutely no idea how this is done (and could be done :-)
Why does the [Azure PostgreSQL flexible server instance] link point to Chinese Azure?
All names are Asian and mostly Chinese
> Author Bohan Zhang
> Acknowledgements Special thanks to Jon Lee, Sicheng Liu, Chaomin Yu, and Chenglong Hao, who contributed to this post, and to the entire team that helped scale PostgreSQL. We’d also like to thank the Azure PostgreSQL team for their strong partnership.
Bohan Zhang, the article's author, is likely Chinese.
e: and the link points to en-us at time of writing. I frankly don't see the value in your comment.
Why does everyone make a "how we scaled PostgreSQL" article.
Anyone have any idea of what their caching strategy may be? I think that may be the most interesting and impactful thing here.
ChatGPT is definitely the snappiest web UI of any LLM.
I was confused when I saw elsewhere a "goose with a knife" meme asking someone if they had 800M users, and then if they had donated to PostgreSQL. Now I understand.
Why a single postgres? Why not shard by users?
They literally answer that in the post. They started with a single instance and realized sharding the existing tables will be too much work (they'll slowly migrate to new tables instead).
Right, should have read it
Weird, I'd imagine for kind of use they are getting it would be easy to shard the infrastructure to entirely separate instances
.
Uh, they scaled PostgreSQL by offloading a lot of it to Azure CosmosDB.
I'm not sure that's the answer people are looking for.
ai written blog, its very generic and same context is repated many times
Yeah that's an ad for Azure Cosmos DB
Article has so much fluff and only some very coarse information like (we sharded writes, yay!). Almost no detail just keywords for SEO, or whatever they’re aiming for.
There’s also a lot of repetition. Maybe it was AI generated…?
Could even be seen as a disguised ad for their infrastructure partner too.
Those sort of articles suck big time.
I remember coming across an article from NYCMesh which looked interesting ("Administrating the Mesh" - https://www.nycmesh.net/blog/datadog/) which made sense all the way until they put Datadog on top of everything, and I asked myself:
> What the hell, how is using a centralized service for managing a decentralized mesh a suitable solution? Did the author get employed by Datadog or what happened?
Then I got curious and lo and behold; the author was indeed hired by Datadog (and still works there AFAIK), effectively compromising the entire article and the project itself, because of their new employer.
Yeah, right - when PostgreSQL starts to struggle, Microsoft Azure CosmoDB[tm] comes to the rescue (mentioned 3x).
Lol, yes, it was very vague. Very generic paragraphs on Caching, Connection pooling, Query Optimization wrt Joins, etc.
I think it could even backfire as a piece of corporate promotion.
[dead]
Ah yes, OpenAI is sharding now. Surprise surprise.
I mentioned that as a right solution to the problem last time they posted about Postgres performance issues:
https://news.ycombinator.com/item?id=44074702
for people not burning billions of VC $ sharding Postgres is not a bad option.
The 'single primary with read replicas' pattern scaling to 800M users is the real insight here. Most startups reach for sharding or distributed databases way too early, adding complexity for scale they don't have. If OpenAI can serve hundreds of millions from one Postgres primary by offloading reads and pushing new write-heavy features elsewhere, that's a strong argument for simplicity.
I like your point, but it also says that this isn't easy:
> It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple.
And it also says that this approach has cornered them into a solution that isn't trivial to change. They now use different database deployments (the single primary one that is the focus of the post and *multiple* other systems, such as Azure CosmosDB, to which some of the write traffic is being directed).
> To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimising application logic to minimise unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.
I wonder how easy it is for developers to maintain and evolve this solution of miscellaneous database systems.
So yes, you can go far with a single primary, but you can also potentially never easily get away from it.
Counterpoint: if they had reached for sharding early, they would have avoided the technical debt of having to refactor their existing database. I don't think sharding is necessarily that complex either, especially for a SaaS style app like ChatGPT where users are mostly siloed in.
When people spend their entire careers in AWS land, it's easy to forget just how much power a single beefy bare metal server brings to bear. You can scale far and wide simply by getting a bigger server.
True. But “big beef” is complicated and difficult to make reliable. Horizontal scaling of unreliable servers is dirt simple to stay up through almost anything except sudden load spikes. And then it’s largely a matter of configuring your auto scaling and retries.
That said big beef is so simple to start with. And this story is a strong example that YAGNI is a practical reality for almost everybody wrt “distributed everything”.
If you need so many tricks to support the infra, it will eventually come back to bite you. I am pretty sure that Google in year 2000 could have supported their workloads with existing technologies (Yahoo could, and it was a much larger company). But they did GFS and Bigtable, and the rest is history. Other companies struggled to catch up due to inferior infrastructure. A visionary company needs to be prepared and should not be hindered by infrastructure. Can you scale the single primary system another 10x or more? Because their CEO said that they will scale their revenue by that much within just a couple of years.
But they didn't really stay single primary. They moved a lot of load off to alternate database systems. So they did effectively shard, but to different databases rather than postgres.
Quite possibly they would have been better off staying purely postgres but with sharing. But impossible to know.
This account's comment history is pure slop. 90% sure its all AI generated. The structure is too blatant.
TL;DR There is no secret sauce, it's the same set of techniques you’ve seen in most PostgreSQL scaling guides. Those techniques do work.
They could’ve used mongodb which is web scale NoSQL database because SQL is 1990’s era legacy technology.
/s
[flagged]