From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Simmons To: development@lists.ipfire.org Subject: Re: [RFC] unbound: Increase timeout value for unknown dns-server Date: Mon, 11 Jan 2021 22:37:47 -0600 Message-ID: <1788b289-f1b8-cef0-9560-370bad641e04@gmail.com> In-Reply-To: <1468B7A9-ECA3-4B77-A4A1-30FBB114C6CB@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6311527017268222945==" List-Id: --===============6311527017268222945== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On 1/11/21 5:10 AM, Michael Tremer wrote: > >> On 9 Jan 2021, at 18:57, Paul Simmons wrote: >> >> On 1/9/21 9:04 AM, Michael Tremer wrote: >>> Hi, >>> >>> In that case, I do not think that this change realistically changes anyth= ing for anyone. >>> >>> In Paul=E2=80=99s case, where the name servers are further away than the = timeout, he would send another packet, but then receive the first reply (not = regarding any actual packet loss here), and after that unbound will have lear= ned that the name server is further away. >>> >>> He would have sent one extra packet. Potentially re-probing will cause th= e same effect, but usually unbound should be busy enough to have a rolling me= an that is up to date at any time. >>> >>> Therefore this only matters in recursor mode where there are many servers= being contacted instead of only a few forwarders. Again, there would be more= overhead here, but there should not be any effect where names cannot be reso= lved. >>> >>> We can now increase the timeout, which will cause slower resolution for m= any users that are running in recursor mode, or we can just leave it and noth= ing would change. >>> >>> -Michael >>> >>>> On 8 Jan 2021, at 17:33, Jonatan Schlag wr= ote: >>>> >>>> Hi, >>>> >>>> I will try to provide some explanations to the questions. >>>> >>>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer : >>>>> >>>>> =EF=BB=BFHello, >>>>> >>>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen wrote: >>>>>> >>>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.trem= er(a)ipfire.org) wrote: >>>>>> >>>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons wrote: >>>>>>>> >>>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote: >>>>>>>>> When unbound has no information about a DNS-server >>>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situa= tions, >>>>>>>>> but they mention in their documentation that this could be way too = low. >>>>>>>>> They recommend a timeout of 1126 msec for satellite connections >>>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf). >>>>>> A small nit, they actually suggest 1128 ... and that's indeed what >>>>>> the patch has: >>>>>> >>>>>>>>> + unknown-server-time-limit: 1128 >>>>>> But that's trivial. The point: >>>>>> >>>>>>> I am not entirely sure what this is supposed to fix. >>>>>>> It is possible that a DNS response takes longer than 376ms, indeed. >>>>>>> Does it harm us if we send another packet? No. >>>>>> If you are behind a slow satellite link, it can take more than that >>>>>> *every time*. >>>> This should actually not the case. There is no fixed timeout which can b= e set in unbound. They do something much sophisticated here. >>>> >>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/ >>>> >>>> When I unterstand this document correctly. They keep something like a ro= lling mean. So if everybody would execute =E2=80=9Aunbound-control dump_infra= =E2=80=98 we all would get different timeout limits for every server and ever= y site. >>>> The actual calculation seems to much more complex (or their explanation = of simple things is very complex without any formulas), this is only a simple= explanation which seems to be necessary for my next paragraph. >>>> >>>> So the question is, when we have no information about a server (for exam= ple right after startup of unbound or if the entry in the infra cache has exp= ired (time limit 15 min)), which timeout should we assume. We currently assum= e a timeout of 376 msec. They state in their documentation that on slow links= 1128 msec is more suitable. >>>> >>>> When we have informations about a server (so the rtt of previous request= s), this value should not matter, when I am get this right. >>>> >>>>>> So you would always have sent another query before >>>>>> getting a response to the previous one. >>>>> True, but aren=E2=80=99t these extra-ordinary circumstances? >>>>> >>>>> On a regular network we want to keep eyeballs happy and when packets ge= t lost or get sent to a slow server, we want to try again - sooner rather tha= n later. >>>>> >>>>> If we would set this to a worst case setting (let=E2=80=99s say 10 seco= nds), then even for average users DNS resolution will become slower. >>>>> >>>>>> With TCP that would mean never getting a response, because you'd >>>>>> always terminate the connection too soon. With UDP, I'm not sure, >>>>>> depends on how unbound handles incoming responses to queries it's >>>>>> already deemed lost and sent again. Adjusting delay-close might help. >>>>>> But it may be it would not work at all when the limit is too small. >>>>>> >>>>>> That would mean that someone installing IPFire in some remote location >>>>>> with a slow link would conclude that it just doesn't work. >>>>>> >>>>>> The downside of increasing the limit is that sometimes replies will >>>>>> take longer when a packet is lost on the way because we'd wait longer >>>>>> before re-sending. So it should not be increased too much either. >>>> This should only happen in the first time where our own rolling mean is = not adjusted to the needs of this side. >>>>>> I don't have data to judge what the limit should be, but I'd tend to >>>>>> trust nllabs recommendation here and go with the suggested 1128 ms. >>>>> Did anyone actually experience some problems here that this needs chang= ing? >>>>> >>>>> @Jonatan: What is your motivation for this patch? >>>> Just opening the discussion. It seems that their handling of timeouts an= d the infra cache could had caused a lot of problems for some users, so I tho= ught about bringing this up. Maybe it is a good idea that people like Paul te= st this before we further think about how this could be implemented. Also add= ing this to the wiki, that this might be a tweak to improve dns resolution, c= ould be a solution. >>>> But people should first check the current infra cache as these values wo= uld determine if this setting would help. >>>> >>>> I hope a could make some things a little bit more clear. >>>> >>>> Greetings Jonatan >>>>>> --=20 >>>>>> Tapani Tarvainen >> Greetings, Michael and @list. >> >> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS s= erver list from the wiki. I can test more, if desired. >> >> The fastest return was 596ms, and the slowest was 857ms. At present, I'm = using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping). >> >> My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to t= he release with TLS support, I was unable to resolve hosts at all. (Did I me= ntion that I dislike HughesNot? I have no other option for 'net connectivity= - boonie life is great for the nerves, but hell on talking to anyone.) > The good thing is though, that we have a good test-bed for this kind of con= nection :) > > I know of some more people who use a satellite connection, but they are not= very keen on testing things with it. > >> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it wil= l clarify the situation. Also, I'm prepared to backup and edit any other fil= es that might assist testing. >> >> I've noticed (from NTP logs) that name resolution usually stalls/fails aft= er ~3 hours when my LAN is quiet. Could changes to cache timeout settings be= beneficial? >> >> Please advise... >> >> Thank you (and, GREAT EFFORT, ALL!), >> >> Paul >> >> --=20 >> It is better to have loved a short man than never to have loved a tall. >> I'm pleased to be able to help, and grateful for the attention and=20 assistance.=C2=A0 See my next msg for testing update. p. --=20 I have a madness to my method. --===============6311527017268222945==--