From mboxrd@z Thu Jan  1 00:00:00 1970
From: Michael Tremer <michael.tremer@ipfire.org>
To: development@lists.ipfire.org
Subject: Re: [RFC] unbound: Increase timeout value for unknown dns-server
Date: Mon, 11 Jan 2021 11:10:39 +0000
Message-ID: <1468B7A9-ECA3-4B77-A4A1-30FBB114C6CB@ipfire.org>
In-Reply-To: <096e8184-7dd0-e081-8b5a-c1f7c8dff476@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="===============3893513123639227440=="
List-Id: <development.lists.ipfire.org>

--===============3893513123639227440==
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable


> On 9 Jan 2021, at 18:57, Paul Simmons <mbatranch(a)gmail.com> wrote:
>=20
> On 1/9/21 9:04 AM, Michael Tremer wrote:
>> Hi,
>>=20
>> In that case, I do not think that this change realistically changes anythi=
ng for anyone.
>>=20
>> In Paul=E2=80=99s case, where the name servers are further away than the t=
imeout, he would send another packet, but then receive the first reply (not r=
egarding any actual packet loss here), and after that unbound will have learn=
ed that the name server is further away.
>>=20
>> He would have sent one extra packet. Potentially re-probing will cause the=
 same effect, but usually unbound should be busy enough to have a rolling mea=
n that is up to date at any time.
>>=20
>> Therefore this only matters in recursor mode where there are many servers =
being contacted instead of only a few forwarders. Again, there would be more =
overhead here, but there should not be any effect where names cannot be resol=
ved.
>>=20
>> We can now increase the timeout, which will cause slower resolution for ma=
ny users that are running in recursor mode, or we can just leave it and nothi=
ng would change.
>>=20
>> -Michael
>>=20
>>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag(a)ipfire.org> wro=
te:
>>>=20
>>> Hi,
>>>=20
>>> I will try to provide some explanations to the questions.
>>>=20
>>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer(a)ipfire.o=
rg>:
>>>>=20
>>>> =EF=BB=BFHello,
>>>>=20
>>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire(a)tapanitarvainen.fi>=
 wrote:
>>>>>=20
>>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.treme=
r(a)ipfire.org) wrote:
>>>>>=20
>>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch(a)gmail.com> wrote:
>>>>>>>=20
>>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>>> When unbound has no information about a DNS-server
>>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situat=
ions,
>>>>>>>> but they mention in their documentation that this could be way too l=
ow.
>>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>>> the patch has:
>>>>>=20
>>>>>>>> +    unknown-server-time-limit: 1128
>>>>> But that's trivial. The point:
>>>>>=20
>>>>>> I am not entirely sure what this is supposed to fix.
>>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>>> Does it harm us if we send another packet? No.
>>>>> If you are behind a slow satellite link, it can take more than that
>>>>> *every time*.
>>> This should actually not the case. There is no fixed timeout which can be=
 set in unbound. They do something much sophisticated here.
>>>=20
>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>>=20
>>> When I unterstand this document correctly. They keep something like a rol=
ling mean. So if everybody would execute =E2=80=9Aunbound-control dump_infra=
=E2=80=98 we all would get different timeout limits for every server and ever=
y site.
>>> The actual calculation seems to much more complex (or their explanation o=
f simple things is very complex without any formulas), this is only a simple =
explanation which seems to be necessary for my next paragraph.
>>>=20
>>> So the question is, when we have no information about a server (for examp=
le right after startup of unbound or if the entry in the infra cache has expi=
red (time limit 15 min)), which timeout should we assume. We currently assume=
 a timeout of 376 msec. They state in their documentation that on slow links =
1128 msec is more suitable.
>>>=20
>>> When we have informations about a server (so the rtt of previous requests=
), this value should not matter, when I am get this right.
>>>=20
>>>>> So you would always have sent another query before
>>>>> getting a response to the previous one.
>>>> True, but aren=E2=80=99t these extra-ordinary circumstances?
>>>>=20
>>>> On a regular network we want to keep eyeballs happy and when packets get=
 lost or get sent to a slow server, we want to try again - sooner rather than=
 later.
>>>>=20
>>>> If we would set this to a worst case setting (let=E2=80=99s say 10 secon=
ds), then even for average users DNS resolution will become slower.
>>>>=20
>>>>> With TCP that would mean never getting a response, because you'd
>>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>>> depends on how unbound handles incoming responses to queries it's
>>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>>> But it may be it would not work at all when the limit is too small.
>>>>>=20
>>>>> That would mean that someone installing IPFire in some remote location
>>>>> with a slow link would conclude that it just doesn't work.
>>>>>=20
>>>>> The downside of increasing the limit is that sometimes replies will
>>>>> take longer when a packet is lost on the way because we'd wait longer
>>>>> before re-sending. So it should not be increased too much either.
>>> This should only happen in the first time where our own rolling mean is n=
ot adjusted to the needs of this side.
>>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>>> Did anyone actually experience some problems here that this needs changi=
ng?
>>>>=20
>>>> @Jonatan: What is your motivation for this patch?
>>> Just opening the discussion. It seems that their handling of timeouts and=
 the infra cache could had caused a lot of problems for some users, so I thou=
ght about bringing this up. Maybe it is a good idea that people like Paul tes=
t this before we further think about how this could be implemented. Also addi=
ng this to the wiki, that this might be a tweak to improve dns resolution, co=
uld be a solution.
>>> But people should first check the current infra cache as these values wou=
ld determine if this setting would help.
>>>=20
>>> I hope a could make some things a little bit more clear.
>>>=20
>>> Greetings Jonatan
>>>>> --=20
>>>>> Tapani Tarvainen
>=20
> Greetings, Michael and @list.
>=20
> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS se=
rver list from the wiki.  I can test more, if desired.
>=20
> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm u=
sing 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
>=20
> My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to th=
e release with TLS support, I was unable to resolve hosts at all.  (Did I men=
tion that I dislike HughesNot?  I have no other option for 'net connectivity =
- boonie life is great for the nerves, but hell on talking to anyone.)

The good thing is though, that we have a good test-bed for this kind of conne=
ction :)

I know of some more people who use a satellite connection, but they are not v=
ery keen on testing things with it.

> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will=
 clarify the situation.  Also, I'm prepared to backup and edit any other file=
s that might assist testing.
>=20
> I've noticed (from NTP logs) that name resolution usually stalls/fails afte=
r ~3 hours when my LAN is quiet.  Could changes to cache timeout settings be =
beneficial?
>=20
> Please advise...
>=20
> Thank you (and, GREAT EFFORT, ALL!),
>=20
> Paul
>=20
> --=20
> It is better to have loved a short man than never to have loved a tall.
>=20


--===============3893513123639227440==--