From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paul Simmons <mbatranch@gmail.com>
To: development@lists.ipfire.org
Subject: Re: [RFC] unbound: Increase timeout value for unknown dns-server
Date: Sat, 09 Jan 2021 12:57:44 -0600
Message-ID: <096e8184-7dd0-e081-8b5a-c1f7c8dff476@gmail.com>
In-Reply-To: <4EEEF91B-540A-406B-B9C7-C3C8606026A0@ipfire.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="===============6396944829732134700=="
List-Id: <development.lists.ipfire.org>

--===============6396944829732134700==
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

On 1/9/21 9:04 AM, Michael Tremer wrote:
> Hi,
>
> In that case, I do not think that this change realistically changes anythin=
g for anyone.
>
> In Paul=E2=80=99s case, where the name servers are further away than the ti=
meout, he would send another packet, but then receive the first reply (not re=
garding any actual packet loss here), and after that unbound will have learne=
d that the name server is further away.
>
> He would have sent one extra packet. Potentially re-probing will cause the =
same effect, but usually unbound should be busy enough to have a rolling mean=
 that is up to date at any time.
>
> Therefore this only matters in recursor mode where there are many servers b=
eing contacted instead of only a few forwarders. Again, there would be more o=
verhead here, but there should not be any effect where names cannot be resolv=
ed.
>
> We can now increase the timeout, which will cause slower resolution for man=
y users that are running in recursor mode, or we can just leave it and nothin=
g would change.
>
> -Michael
>
>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag(a)ipfire.org> wrot=
e:
>>
>> Hi,
>>
>> I will try to provide some explanations to the questions.
>>
>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer(a)ipfire.or=
g>:
>>>
>>> =EF=BB=BFHello,
>>>
>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire(a)tapanitarvainen.fi> =
wrote:
>>>>
>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer=
(a)ipfire.org) wrote:
>>>>
>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch(a)gmail.com> wrote:
>>>>>>
>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>> When unbound has no information about a DNS-server
>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situati=
ons,
>>>>>>> but they mention in their documentation that this could be way too lo=
w.
>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>> the patch has:
>>>>
>>>>>>> +    unknown-server-time-limit: 1128
>>>> But that's trivial. The point:
>>>>
>>>>> I am not entirely sure what this is supposed to fix.
>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>> Does it harm us if we send another packet? No.
>>>> If you are behind a slow satellite link, it can take more than that
>>>> *every time*.
>> This should actually not the case. There is no fixed timeout which can be =
set in unbound. They do something much sophisticated here.
>>
>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>
>> When I unterstand this document correctly. They keep something like a roll=
ing mean. So if everybody would execute =E2=80=9Aunbound-control dump_infra=
=E2=80=98 we all would get different timeout limits for every server and ever=
y site.
>> The actual calculation seems to much more complex (or their explanation of=
 simple things is very complex without any formulas), this is only a simple e=
xplanation which seems to be necessary for my next paragraph.
>>
>> So the question is, when we have no information about a server (for exampl=
e right after startup of unbound or if the entry in the infra cache has expir=
ed (time limit 15 min)), which timeout should we assume. We currently assume =
a timeout of 376 msec. They state in their documentation that on slow links 1=
128 msec is more suitable.
>>
>> When we have informations about a server (so the rtt of previous requests)=
, this value should not matter, when I am get this right.
>>
>>>> So you would always have sent another query before
>>>> getting a response to the previous one.
>>> True, but aren=E2=80=99t these extra-ordinary circumstances?
>>>
>>> On a regular network we want to keep eyeballs happy and when packets get =
lost or get sent to a slow server, we want to try again - sooner rather than =
later.
>>>
>>> If we would set this to a worst case setting (let=E2=80=99s say 10 second=
s), then even for average users DNS resolution will become slower.
>>>
>>>> With TCP that would mean never getting a response, because you'd
>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>> depends on how unbound handles incoming responses to queries it's
>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>> But it may be it would not work at all when the limit is too small.
>>>>
>>>> That would mean that someone installing IPFire in some remote location
>>>> with a slow link would conclude that it just doesn't work.
>>>>
>>>> The downside of increasing the limit is that sometimes replies will
>>>> take longer when a packet is lost on the way because we'd wait longer
>>>> before re-sending. So it should not be increased too much either.
>> This should only happen in the first time where our own rolling mean is no=
t adjusted to the needs of this side.
>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>> Did anyone actually experience some problems here that this needs changin=
g?
>>>
>>> @Jonatan: What is your motivation for this patch?
>> Just opening the discussion. It seems that their handling of timeouts and =
the infra cache could had caused a lot of problems for some users, so I thoug=
ht about bringing this up. Maybe it is a good idea that people like Paul test=
 this before we further think about how this could be implemented. Also addin=
g this to the wiki, that this might be a tweak to improve dns resolution, cou=
ld be a solution.
>> But people should first check the current infra cache as these values woul=
d determine if this setting would help.
>>
>> I hope a could make some things a little bit more clear.
>>
>> Greetings Jonatan
>>>> --=20
>>>> Tapani Tarvainen

Greetings, Michael and @list.

I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS=20
server list from the wiki.=C2=A0 I can test more, if desired.

The fastest return was 596ms, and the slowest was 857ms.=C2=A0 At present,=20
I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).

My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to=20
the release with TLS support, I was unable to resolve hosts at all.=C2=A0=20
(Did I mention that I dislike HughesNot?=C2=A0 I have no other option for=20
'net connectivity - boonie life is great for the nerves, but hell on=20
talking to anyone.)

I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it=20
will clarify the situation.=C2=A0 Also, I'm prepared to backup and edit any=20
other files that might assist testing.

I've noticed (from NTP logs) that name resolution usually stalls/fails=20
after ~3 hours when my LAN is quiet.=C2=A0 Could changes to cache timeout=20
settings be beneficial?

Please advise...

Thank you (and, GREAT EFFORT, ALL!),

Paul

--=20
It is better to have loved a short man than never to have loved a tall.


--===============6396944829732134700==--