I run a bunch of mailing lists. Most of them are very active, not too many members, and very chatty. They are also very low-maintenance. One of my mailing lists is an announcement list for our neighborhood association, and has 500+ members, many of them elderly. It gets about 3-4 posts a month, and it’s very high maintenance (see my previous remark about elderly members). A large number of those members (over 50 of them) have rochester.rr.com email addresses. RoadRunner was the first broadband in our neighborhood, offered by Time Warner. Sometime after I went around the streets knocking on doors trying to convince people to sign up for an independent fiber internet company (Greenlight), Time Warner became Spectrum.
I guess some of their infrastructure was owned by Charter? I don’t know, it’s just that the MX for rochester.rr.com points to a charter.net server. Probably due to an accquistion or something. Not relevant.
For my sins, I upgraded my server a few months ago to Debian 13, while simultaneously upgrading my mailing lists from Mailman 2 to Mailman 3. And a few days ago I discovered that since the change, a large proportion of rochester.rr.com users were getting booted off the list because their mail was bouncing. I found this out because a sweet little old lady got the message saying her subscription was being removed because of these bounces, and responded to the bounce message saying “but I don’t want to be removed”.
Since that time, I’ve been trying to diagnose and fix the problem. What I saw in the logs was that RoadRunner gives you a code when it defers or bounces, which you can look up
https://www.spectrum.net/support/internet/understanding-email-error-codes
And most of the deferrals and bounces were getting a 1300 or 1370 code, both of which mean too many concurrent connections, or too many recipients in one connection.
The first thing I found was that Mailman was VERPing every message, which obviously makes it easier for Mailman to determine who bounced, but also means that Postfix is making 500+ simultaneous outgoing connections. I decided it would be better if Mailman just passed the whole shebang off to Postfix and let Postfix pick that apart. That took a bit of doing, including adding a configuration parameter to my mailman.cfg file that `mailman conf` told me was already the default. *Sigh*
Ok, once I had all 500 members coming as one block to Postfix, I set up a separate transport “slow_smtp” just for rochester.rr.com. For that one, I set upĀ
slow_smtp_destination_concurrency_limit = 1
slow_smtp_destination_recipient_limit = 5
which I thought would mean it would make one connection at a time, and send 5 messages each time. Turns out the concurrency_limit wasn’t doing what I thought it would do – ie make sure that there’s only one connection at a time. I don’t actually know what it does because it looked like there were several connections at once. I showed my configuration on the Postfix Users mailing list, and Wietse Venema rather defensively said “The _destination_concurrency_limit and _smtp_destination_recipient_limit features are implemented by decades-old code that has not changed $forever.” And we all know that old code never has bugs in it. Besides, I was assuming I was configuring it wrong, not that the code was wrong.
Anyway, after some back and forths with Venema and another guy who thought I didn’t know how to read a log file (turns out he was partially right), I added
slow_smtp_destination_rate_delay = 5
which did seem to meant that it would delay starting a new connection for 5 seconds after the last one, and if I’m understanding this correctly, means that as long as the previous connection is processed in 5 seconds, the next one won’t be simultaneous. In practice, what I seem to be seeing is that the first batch of 5 gets sent, the second batch of 5 gets sent, and the rest get “status=deferred”. Some time later (about 8 minutes?) it send the third set of 5 and the forth set of 5, and deferred the rest. After 4 sets of retries, the last 19 users got “status=bounced” instead of “status=deferred”. I have no idea why they suddenly decided to start bouncing. I suspect it’s just Time Warner because arseholes.
I’m still searching for the magic configuration which will allow the non-RoadRunner users to keep going as normal, and RoadRunner users to trickle through in whatever configuration it takes.