Hi,
In that case, I do not think that this change realistically changes anything for anyone.
In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
-Michael
On 8 Jan 2021, at 17:33, Jonatan Schlag jonatan.schlag@ipfire.org wrote:
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
- unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen