Fixing Federation
Just a general discussion. This isn't about switching protocols, instead this discussion asks an equally important question: "What can we fix in our own implementation right now?"
Jonne Haß Mon 5 Nov 2012 8:37PM
The first requirement before anything else is http://loom.io/discussions/612. Just sayin'
Sean Tilley Mon 5 Nov 2012 8:57PM
Definitely agreed, however to do so, we'll need an attack plan.
Lost in LA Mon 5 Nov 2012 9:07PM
I'm good with plans. How can I help?
Rich Sat 4 Oct 2014 8:14PM
(two year bump!)
I hope this isn't tl;dr so please be patient and stick with it :)
So, Sean's original question asked,
“What can we fix in our own implementation right now?”
One thing I see asked about and discussed countless times is the federation retry issue.
Our Wiki states:
Will a pod eventually receive federated posts that it misses while being offline/down?
Possibly. We retry the delivery three times at one hour intervals.
#WTF?!
We only try to resend a message/information/etc three times at one hour intervals? THREE TIMES? AT ONE HOUR INTERVALS?!??
No wonder there are so many questions and complaints about posts being missed :(
Now here on Loomio there are countless (literally, I didn't count them there are that many) about some huge changes which could be made to make the federation more reliable and they are indeed fantastic - but the reality is they are very long term goals in terms of implementation.
What I would like to suggest is a very simple change to the retry functionality.
Once an hour, for only three hours, is completely unrealistic in todays Diaspora ecosystem. So many pods, so many different connection types. You only need look at podupti.me to see just how much actual downtime there actually is.
I would like to see the retry intervals not only be made more frequent in the short term, but the longevity of the retries massively increased - to be something more like the average SMTP protocol in terms of re-trying to deliver the message. The SMTP RFC 5321 states:
Retries continue until the message is transmitted or the sender gives up; the give-up time generally needs to be at least 4-5 days
Why on earth do we give up after just three hours?
Is there a technical reason why we couldn't (easily!) implement message delivery retries along the lines of:
1) Retry every 5 mins for six attempts (30 mins)
2) Then retry every 1 hour for six attempts (6 hrs)
3) Then retry every 3 hours for four attempts (12hrs)
4) Then retry every 6 hours for four attempts (24hrs)
5) Then retry every 12 hours for two attempts (24hrs)
6) Then retry every 24 hours for one attempts (24hrs)
I've just pulled these numbers out of the air, there is no science behind them and they are simply a starting point for discussion :)
Jonne Haß Sat 4 Oct 2014 10:28PM
Note that the three times in one hour interval is only for message delivery. All other potentially recoverable failed jobs are retried with an exponential back-off, the formula for that is (count ** 4) + 15 + (rand(30)*(count+1))
with count being the number of attempts made so far[1]. We default to a maximum of 10 attempts but this is configurable by the pod maintainer[2]. This results in a retry approximately 4 hours after the first try.
This is already too much for joindiaspora.com to handle, Max lowered the number of retries for these comparably light jobs to just three[3].
Now is the job to deliver messages a really heavy and long running one. My pod knows about 575 pods, more than half of it are gone, most of the gone ones simply timeout. We have relatively high timeout of 25 seconds to accommodate slowly responding setups which improved federation stability significantly in the past[4]. Now requests to other pods happen in parallel, but we have to limit number of concurrent connections since more parallel connections mean drastic spikes in memory usage, the default currently is 20[5]. Lowering memory usage is one of my personal focuses here since it benefits both ends of deployments, big pods as well as small pods.
So yes, the default retry strategies are very conservative but this is to accommodate running costs. Look at the pricing for the Redis database you need to use on Heroku (which joindiaspora.com is deployed to) alone[6]. We can't just pile up jobs for weeks in it. Note also that with increasing the delivery time we also need to retry trying to process successfully received comments, likes etc. for which we never got the parent. This in sum increases the number of jobs to process a lot, which means bigger deployments need to scale up more and thus significantly increase their running costs.
This issue might seem simple at first sight, but there are many variables and stakeholders involved. And nobody actually running one of the big deployments is actively contributing. I'm rather happy how good it works currently and that we have some defaults that seem to work for most people.
Rich Sun 5 Oct 2014 9:38PM
Hey Jonne.
Note that the three times in one hour interval is only for message delivery
Is message delivery not the most important part of the federation concept though?
I think every task is too much for joindiaspora.com to handle. The server is on its knees :(
Again, longer term, it would be good to be able to either cleanly remove a pod from the ecosystem and let all other pods know to stop trying it. It would also be good if pods could expire other pods too, so if not connection can be made to that pod for perhaps 1 week, then never try again (or something). But again, that's long term stuff.
Short term, three attempts/hours to deliver a message is still completely unacceptable.
Can you imagine if SMTP gave up after three hours?
So how about putting a setting in diaspora.yml to allow podmins to scale the attempts, according to their infrastructure?
I am sure that many (many) podmins would increase this retry value.
Rich Sun 5 Oct 2014 9:38PM
Ps. Thanks for the details response my friend! Very grateful :)
Jonne Haß Sun 5 Oct 2014 9:42PM
Well, I guess a config option can't hurt.
Maciek Łoziński Thu 9 Oct 2014 9:52AM
Or maybe instead 1h/1h/1h intervals, just make this something like 0.5h/2h/6h/24h, or even 1h/4h/24h? It should not put too much strain on servers.
Sean Tilley · Mon 5 Nov 2012 8:34PM
Talking to Mike Macgirvin of the Friendica project, I came away with these notes of things we need as far as federation is concerned:
From Mike:
"Decentralised communications have a tendency to cause "fanout" or a hundred/thousand deliveries for one message injected into the system. This is the nature of the beast. Ilya create a nice batching protocol for public posts. That helped.
The decentralised social web needs every trick in the book to fight fanout. Batching and prioritisation are the keys - and are the only things you can change. Sure you can try and work on performance, but that just reduces fanout linearly. One needs to find clever ways to reduce it logarithmically, because you're dealing with an exponentially expanding input."