From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paul Simmons <mbatranch@gmail.com>
To: development@lists.ipfire.org
Subject: Re: [RFC] unbound: Increase timeout value for unknown dns-server
Date: Mon, 11 Jan 2021 22:37:47 -0600
Message-ID: <1788b289-f1b8-cef0-9560-370bad641e04@gmail.com>
In-Reply-To: <1468B7A9-ECA3-4B77-A4A1-30FBB114C6CB@ipfire.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="===============6311527017268222945=="
List-Id: <development.lists.ipfire.org>

--===============6311527017268222945==
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

On 1/11/21 5:10 AM, Michael Tremer wrote:
>
>> On 9 Jan 2021, at 18:57, Paul Simmons <mbatranch(a)gmail.com> wrote:
>>
>> On 1/9/21 9:04 AM, Michael Tremer wrote:
>>> Hi,
>>>
>>> In that case, I do not think that this change realistically changes anyth=
ing for anyone.
>>>
>>> In Paul=E2=80=99s case, where the name servers are further away than the =
timeout, he would send another packet, but then receive the first reply (not =
regarding any actual packet loss here), and after that unbound will have lear=
ned that the name server is further away.
>>>
>>> He would have sent one extra packet. Potentially re-probing will cause th=
e same effect, but usually unbound should be busy enough to have a rolling me=
an that is up to date at any time.
>>>
>>> Therefore this only matters in recursor mode where there are many servers=
 being contacted instead of only a few forwarders. Again, there would be more=
 overhead here, but there should not be any effect where names cannot be reso=
lved.
>>>
>>> We can now increase the timeout, which will cause slower resolution for m=
any users that are running in recursor mode, or we can just leave it and noth=
ing would change.
>>>
>>> -Michael
>>>
>>>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag(a)ipfire.org> wr=
ote:
>>>>
>>>> Hi,
>>>>
>>>> I will try to provide some explanations to the questions.
>>>>
>>>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer(a)ipfire.=
org>:
>>>>>
>>>>> =EF=BB=BFHello,
>>>>>
>>>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire(a)tapanitarvainen.fi=
> wrote:
>>>>>>
>>>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.trem=
er(a)ipfire.org) wrote:
>>>>>>
>>>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch(a)gmail.com> wrote:
>>>>>>>>
>>>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>>>> When unbound has no information about a DNS-server
>>>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situa=
tions,
>>>>>>>>> but they mention in their documentation that this could be way too =
low.
>>>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>>>> the patch has:
>>>>>>
>>>>>>>>> +    unknown-server-time-limit: 1128
>>>>>> But that's trivial. The point:
>>>>>>
>>>>>>> I am not entirely sure what this is supposed to fix.
>>>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>>>> Does it harm us if we send another packet? No.
>>>>>> If you are behind a slow satellite link, it can take more than that
>>>>>> *every time*.
>>>> This should actually not the case. There is no fixed timeout which can b=
e set in unbound. They do something much sophisticated here.
>>>>
>>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>>>
>>>> When I unterstand this document correctly. They keep something like a ro=
lling mean. So if everybody would execute =E2=80=9Aunbound-control dump_infra=
=E2=80=98 we all would get different timeout limits for every server and ever=
y site.
>>>> The actual calculation seems to much more complex (or their explanation =
of simple things is very complex without any formulas), this is only a simple=
 explanation which seems to be necessary for my next paragraph.
>>>>
>>>> So the question is, when we have no information about a server (for exam=
ple right after startup of unbound or if the entry in the infra cache has exp=
ired (time limit 15 min)), which timeout should we assume. We currently assum=
e a timeout of 376 msec. They state in their documentation that on slow links=
 1128 msec is more suitable.
>>>>
>>>> When we have informations about a server (so the rtt of previous request=
s), this value should not matter, when I am get this right.
>>>>
>>>>>> So you would always have sent another query before
>>>>>> getting a response to the previous one.
>>>>> True, but aren=E2=80=99t these extra-ordinary circumstances?
>>>>>
>>>>> On a regular network we want to keep eyeballs happy and when packets ge=
t lost or get sent to a slow server, we want to try again - sooner rather tha=
n later.
>>>>>
>>>>> If we would set this to a worst case setting (let=E2=80=99s say 10 seco=
nds), then even for average users DNS resolution will become slower.
>>>>>
>>>>>> With TCP that would mean never getting a response, because you'd
>>>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>>>> depends on how unbound handles incoming responses to queries it's
>>>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>>>> But it may be it would not work at all when the limit is too small.
>>>>>>
>>>>>> That would mean that someone installing IPFire in some remote location
>>>>>> with a slow link would conclude that it just doesn't work.
>>>>>>
>>>>>> The downside of increasing the limit is that sometimes replies will
>>>>>> take longer when a packet is lost on the way because we'd wait longer
>>>>>> before re-sending. So it should not be increased too much either.
>>>> This should only happen in the first time where our own rolling mean is =
not adjusted to the needs of this side.
>>>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>>>> Did anyone actually experience some problems here that this needs chang=
ing?
>>>>>
>>>>> @Jonatan: What is your motivation for this patch?
>>>> Just opening the discussion. It seems that their handling of timeouts an=
d the infra cache could had caused a lot of problems for some users, so I tho=
ught about bringing this up. Maybe it is a good idea that people like Paul te=
st this before we further think about how this could be implemented. Also add=
ing this to the wiki, that this might be a tweak to improve dns resolution, c=
ould be a solution.
>>>> But people should first check the current infra cache as these values wo=
uld determine if this setting would help.
>>>>
>>>> I hope a could make some things a little bit more clear.
>>>>
>>>> Greetings Jonatan
>>>>>> --=20
>>>>>> Tapani Tarvainen
>> Greetings, Michael and @list.
>>
>> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS s=
erver list from the wiki.  I can test more, if desired.
>>
>> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm =
using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
>>
>> My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to t=
he release with TLS support, I was unable to resolve hosts at all.  (Did I me=
ntion that I dislike HughesNot?  I have no other option for 'net connectivity=
 - boonie life is great for the nerves, but hell on talking to anyone.)
> The good thing is though, that we have a good test-bed for this kind of con=
nection :)
>
> I know of some more people who use a satellite connection, but they are not=
 very keen on testing things with it.
>
>> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it wil=
l clarify the situation.  Also, I'm prepared to backup and edit any other fil=
es that might assist testing.
>>
>> I've noticed (from NTP logs) that name resolution usually stalls/fails aft=
er ~3 hours when my LAN is quiet.  Could changes to cache timeout settings be=
 beneficial?
>>
>> Please advise...
>>
>> Thank you (and, GREAT EFFORT, ALL!),
>>
>> Paul
>>
>> --=20
>> It is better to have loved a short man than never to have loved a tall.
>>
I'm pleased to be able to help, and grateful for the attention and=20
assistance.=C2=A0 See my next msg for testing update.

p.

--=20
I have a madness to my method.


--===============6311527017268222945==--