We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.
Scaled much, much further than I would’ve guessed at the time when I called it a short-term solution :) — now I have much more confidence in Postgres ;)
> We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.
That's very refreshing to hear. In a previous role I was in a similar situation than yours, but I pushed for RabbitMQ instead of postgres due to scaling concerns, with hypothetical seilings smaller than the ones you faced. My team had to make a call without having hard numbers to support any decision and no time to put together a proof of concept. The design pressures were the simplicity of postgres vs paying for the assurance of getting a working message broker with complexity. In the end I pushed for the most conservative approach and we went with RabbitMQ, because I didn't wanted to be the one having to explain why we had problems getting a RDBMS to act as a message broker when we get a real message broker for free with a docker pull.
I was always left wondering if that was the right call, and apparently it wasn't, because RabbitMQ also put up a fight.
If there were articles out there showcasing case studies of real world applications of implementing message brokers over RDBMS then people like me would have an easier time pushing for saner choices.
Those don't have money to fund studies about industry best practices. So you don't get many.
Almost everything you see on how to use a DBMS is an amateur blog or one of those studies. One of those is usually dismissed on any organization with more than one layer of management.
> Those don't have money to fund studies about industry best practices. So you don't get many.
Your comment reads like a strawman. I didn't needed "studies". It was good enough if there was a guy with a blog saying "I used postgres as a message broker like this and I got these numbers", and they had a gitlab project page providing the public with the setup and benchmark code.
Just out of curiosity (as someone who hasn't done a lot of this kind of operational stuff) how does this approach to queueing with Postgres degrade as scale increases? Is it just that your job throughput starts to hit a ceiling?
Throughput is less of an issue then queue size—Postgres can handle a truly incredible amount of throughput as long as the jobs table is small enough that it can safely remain in memory for every operation. We can handle 800k jobs/hr with postgres, but if you have more than 5k or 10k jobs in the table at any given time, you're in dangerous territory. It's a different way of thinking about queue design then some other systems, but it's definitely worth it if you're interested in the benefits Postgres can bring (atomicity, reliability, etc)
With Postgres, you also need to worry a lot about tombstoning and your ability to keep up with the vacuums necessary to deal with highly mutable data. This can depend a lot on what else is going on with the database and whether you have more than one index on the table.
One strategy for mitigating vacuum costs would be to adopt an append-only strategy and partition the table. Then you can just drop partitions and avoid the vacuum costs.
Really depends on the needs but this can unlock some very impressive and sustainable throughputs.
That's the original problem, but then there are the secondary effects. Some of the people who made decision on that basis write blog posts about what they did, and then those blog posts end up on StackOverflow etc, and eventually it just becomes "this is how we do it by default" orthodoxy without much conscious reasoning involved - it's just a safe bet to do what works for everybody else even if it's not optimal.
My hobby project does ~1.5M jobs per day enqueued into Postgres, no sweat. I use https://github.com/bensheldon/good_job which uses PG's LISTEN/NOTIFY to lower worker poll latency.
Briefly, it spins up a background thread with a dedicated database connection and makes a blocking Postgres LISTEN query until results are returned, and then it forwards the result to other subscribing objects.
I can't speak for how they do it, but when your worker polls the table and finds no rows, you will sleep. While sleeping, you can also LISTEN on a channel (and if you get a message, you abort your sleep).
Then, whenever you write a new job to the queue, you also do a NOTIFY on the same channel.
This lets you keep latency low while still polling relatively infrequently.
NOTIFY is actually transactional which makes this approach even better (the LISTENer won't be notified until the NOTIFY transaction commits)
A few millions a days is a few dozens per second; we currently have a service running this order of magnitude of jobs with a SELECT/SKIP LOCKED pattern and no issue at all on a medium AWS box.
In other SQL databases an 'in memory' table could be a candidate. It looks like Postgres only has session specific temporary tables, but does have an UNLOGGED https://www.postgresql.org/docs/13/sql-createtable.html table type which has desirable properties for temporary data that must be shared.
a well-tuned bare metal box in a master-slave config should easily handle (being conservative here) 10k/s... I assume a purpose-built box could handle 100k/s without breaking a sweat
I've used Postgres to handle 60M jobs per second (using FOR UPDATE SKIP LOCKED) in production, for two years, on a single dual core 8GB GCE VM. Postgres goes far.