From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Simmons To: development@lists.ipfire.org Subject: Re: [RFC] unbound: Increase timeout value for unknown dns-server Date: Sat, 09 Jan 2021 12:57:44 -0600 Message-ID: <096e8184-7dd0-e081-8b5a-c1f7c8dff476@gmail.com> In-Reply-To: <4EEEF91B-540A-406B-B9C7-C3C8606026A0@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6396944829732134700==" List-Id: --===============6396944829732134700== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On 1/9/21 9:04 AM, Michael Tremer wrote: > Hi, > > In that case, I do not think that this change realistically changes anythin= g for anyone. > > In Paul=E2=80=99s case, where the name servers are further away than the ti= meout, he would send another packet, but then receive the first reply (not re= garding any actual packet loss here), and after that unbound will have learne= d that the name server is further away. > > He would have sent one extra packet. Potentially re-probing will cause the = same effect, but usually unbound should be busy enough to have a rolling mean= that is up to date at any time. > > Therefore this only matters in recursor mode where there are many servers b= eing contacted instead of only a few forwarders. Again, there would be more o= verhead here, but there should not be any effect where names cannot be resolv= ed. > > We can now increase the timeout, which will cause slower resolution for man= y users that are running in recursor mode, or we can just leave it and nothin= g would change. > > -Michael > >> On 8 Jan 2021, at 17:33, Jonatan Schlag wrot= e: >> >> Hi, >> >> I will try to provide some explanations to the questions. >> >>> Am 06.01.2021 um 19:01 schrieb Michael Tremer : >>> >>> =EF=BB=BFHello, >>> >>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen = wrote: >>>> >>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer= (a)ipfire.org) wrote: >>>> >>>>>> On 6 Jan 2021, at 12:02, Paul Simmons wrote: >>>>>> >>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote: >>>>>>> When unbound has no information about a DNS-server >>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situati= ons, >>>>>>> but they mention in their documentation that this could be way too lo= w. >>>>>>> They recommend a timeout of 1126 msec for satellite connections >>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf). >>>> A small nit, they actually suggest 1128 ... and that's indeed what >>>> the patch has: >>>> >>>>>>> + unknown-server-time-limit: 1128 >>>> But that's trivial. The point: >>>> >>>>> I am not entirely sure what this is supposed to fix. >>>>> It is possible that a DNS response takes longer than 376ms, indeed. >>>>> Does it harm us if we send another packet? No. >>>> If you are behind a slow satellite link, it can take more than that >>>> *every time*. >> This should actually not the case. There is no fixed timeout which can be = set in unbound. They do something much sophisticated here. >> >> https://nlnetlabs.nl/documentation/unbound/info-timeout/ >> >> When I unterstand this document correctly. They keep something like a roll= ing mean. So if everybody would execute =E2=80=9Aunbound-control dump_infra= =E2=80=98 we all would get different timeout limits for every server and ever= y site. >> The actual calculation seems to much more complex (or their explanation of= simple things is very complex without any formulas), this is only a simple e= xplanation which seems to be necessary for my next paragraph. >> >> So the question is, when we have no information about a server (for exampl= e right after startup of unbound or if the entry in the infra cache has expir= ed (time limit 15 min)), which timeout should we assume. We currently assume = a timeout of 376 msec. They state in their documentation that on slow links 1= 128 msec is more suitable. >> >> When we have informations about a server (so the rtt of previous requests)= , this value should not matter, when I am get this right. >> >>>> So you would always have sent another query before >>>> getting a response to the previous one. >>> True, but aren=E2=80=99t these extra-ordinary circumstances? >>> >>> On a regular network we want to keep eyeballs happy and when packets get = lost or get sent to a slow server, we want to try again - sooner rather than = later. >>> >>> If we would set this to a worst case setting (let=E2=80=99s say 10 second= s), then even for average users DNS resolution will become slower. >>> >>>> With TCP that would mean never getting a response, because you'd >>>> always terminate the connection too soon. With UDP, I'm not sure, >>>> depends on how unbound handles incoming responses to queries it's >>>> already deemed lost and sent again. Adjusting delay-close might help. >>>> But it may be it would not work at all when the limit is too small. >>>> >>>> That would mean that someone installing IPFire in some remote location >>>> with a slow link would conclude that it just doesn't work. >>>> >>>> The downside of increasing the limit is that sometimes replies will >>>> take longer when a packet is lost on the way because we'd wait longer >>>> before re-sending. So it should not be increased too much either. >> This should only happen in the first time where our own rolling mean is no= t adjusted to the needs of this side. >>>> I don't have data to judge what the limit should be, but I'd tend to >>>> trust nllabs recommendation here and go with the suggested 1128 ms. >>> Did anyone actually experience some problems here that this needs changin= g? >>> >>> @Jonatan: What is your motivation for this patch? >> Just opening the discussion. It seems that their handling of timeouts and = the infra cache could had caused a lot of problems for some users, so I thoug= ht about bringing this up. Maybe it is a good idea that people like Paul test= this before we further think about how this could be implemented. Also addin= g this to the wiki, that this might be a tweak to improve dns resolution, cou= ld be a solution. >> But people should first check the current infra cache as these values woul= d determine if this setting would help. >> >> I hope a could make some things a little bit more clear. >> >> Greetings Jonatan >>>> --=20 >>>> Tapani Tarvainen Greetings, Michael and @list. I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS=20 server list from the wiki.=C2=A0 I can test more, if desired. The fastest return was 596ms, and the slowest was 857ms.=C2=A0 At present,=20 I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping). My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to=20 the release with TLS support, I was unable to resolve hosts at all.=C2=A0=20 (Did I mention that I dislike HughesNot?=C2=A0 I have no other option for=20 'net connectivity - boonie life is great for the nerves, but hell on=20 talking to anyone.) I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it=20 will clarify the situation.=C2=A0 Also, I'm prepared to backup and edit any=20 other files that might assist testing. I've noticed (from NTP logs) that name resolution usually stalls/fails=20 after ~3 hours when my LAN is quiet.=C2=A0 Could changes to cache timeout=20 settings be beneficial? Please advise... Thank you (and, GREAT EFFORT, ALL!), Paul --=20 It is better to have loved a short man than never to have loved a tall. --===============6396944829732134700==--