Hi,

I will try to provide some explanations to the questions.

Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:

Hello,

On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:

On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:

On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:

On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server
a timeout of 376 msec is assumed. This works well in a lot of situations,
but they mention in their documentation that this could be way too low.
They recommend a timeout of 1126 msec for satellite connections
(https://nlnetlabs.nl/documentation/unbound/unbound.conf).

A small nit, they actually suggest 1128 ... and that's indeed what
the patch has:

+    unknown-server-time-limit: 1128

But that's trivial. The point:

I am not entirely sure what this is supposed to fix.

It is possible that a DNS response takes longer than 376ms, indeed.
Does it harm us if we send another packet? No.

If you are behind a slow satellite link, it can take more than that
*every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here. 

https://nlnetlabs.nl/documentation/unbound/info-timeout/

When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. 
The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.

So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable. 

When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right. 

So you would always have sent another query before
getting a response to the previous one.

True, but aren’t these extra-ordinary circumstances?

On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.

If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.

With TCP that would mean never getting a response, because you'd
always terminate the connection too soon. With UDP, I'm not sure,
depends on how unbound handles incoming responses to queries it's
already deemed lost and sent again. Adjusting delay-close might help.
But it may be it would not work at all when the limit is too small.

That would mean that someone installing IPFire in some remote location
with a slow link would conclude that it just doesn't work.

The downside of increasing the limit is that sometimes replies will
take longer when a packet is lost on the way because we'd wait longer
before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.

I don't have data to judge what the limit should be, but I'd tend to
trust nllabs recommendation here and go with the suggested 1128 ms.

Did anyone actually experience some problems here that this needs changing?

@Jonatan: What is your motivation for this patch?

Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
But people should first check the current infra cache as these values would determine if this setting would help.

I hope a could make some things a little bit more clear.

Greetings Jonatan   


--
Tapani Tarvainen