When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf). Settings this value to 1126 msec should make the first queries to an unknown server, more useful. They do not timeout and so these queries do not need to be sent again.
On a stable link, this behaviour should not have negative implications. As the first result of queries arrive the timeout value gets updated, and the high value of 1126 msec gets set to something useful.
Signed-off-by: Jonatan Schlag jonatan.schlag@ipfire.org --- config/unbound/unbound.conf | 1 + 1 file changed, 1 insertion(+)
diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf index f78aaae8c..02f093015 100644 --- a/config/unbound/unbound.conf +++ b/config/unbound/unbound.conf @@ -62,6 +62,7 @@ server:
# Timeout behaviour infra-keep-probing: yes + unknown-server-time-limit: 1128
# Bootstrap root servers root-hints: "/etc/unbound/root.hints"
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf). Settings this value to 1126 msec should make the first queries to an unknown server, more useful. They do not timeout and so these queries do not need to be sent again.
On a stable link, this behaviour should not have negative implications. As the first result of queries arrive the timeout value gets updated, and the high value of 1126 msec gets set to something useful.
Signed-off-by: Jonatan Schlag jonatan.schlag@ipfire.org
config/unbound/unbound.conf | 1 + 1 file changed, 1 insertion(+)
diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf index f78aaae8c..02f093015 100644 --- a/config/unbound/unbound.conf +++ b/config/unbound/unbound.conf @@ -62,6 +62,7 @@ server:
# Timeout behaviour infra-keep-probing: yes
unknown-server-time-limit: 1128
# Bootstrap root servers root-hints: "/etc/unbound/root.hints"
This sounds promising to me, as I have many DNS lookup timeouts (ISP is HughesNot, er, HughesNet).
+1
Paul
Hello,
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf). Settings this value to 1126 msec should make the first queries to an unknown server, more useful. They do not timeout and so these queries do not need to be sent again.
On a stable link, this behaviour should not have negative implications. As the first result of queries arrive the timeout value gets updated, and the high value of 1126 msec gets set to something useful.
Signed-off-by: Jonatan Schlag jonatan.schlag@ipfire.org
config/unbound/unbound.conf | 1 + 1 file changed, 1 insertion(+)
diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf index f78aaae8c..02f093015 100644 --- a/config/unbound/unbound.conf +++ b/config/unbound/unbound.conf @@ -62,6 +62,7 @@ server: # Timeout behaviour infra-keep-probing: yes
- unknown-server-time-limit: 1128 # Bootstrap root servers root-hints: "/etc/unbound/root.hints"
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
So what is this changing in real life?
This sounds promising to me, as I have many DNS lookup timeouts (ISP is HughesNot, er, HughesNet).
@Paul: I am not sure if the solution is to increase timeouts. In my point of view, you should change the name servers.
+1
Paul
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
- unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*. So you would always have sent another query before getting a response to the previous one.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
- unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*. So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
-- Tapani Tarvainen
On Jan 6, 2021, at 12:01 PM, Michael Tremer michael.tremer@ipfire.org wrote:
Did anyone actually experience some problems here that this needs changing?
Maybe here?
https://community.ipfire.org/t/override-disable-dnssec-system/2717 https://community.ipfire.org/t/override-disable-dnssec-system/2717
Hello Jon,
Yes, that could be true.
Can someone reach out to that user and see if they can apply the change and confirm that this works?
-Michael
On 6 Jan 2021, at 18:59, Jon Murphy jcmurphy26@gmail.com wrote:
On Jan 6, 2021, at 12:01 PM, Michael Tremer michael.tremer@ipfire.org wrote:
Did anyone actually experience some problems here that this needs changing?
Maybe here?
https://community.ipfire.org/t/override-disable-dnssec-system/2717
Inasmuch as the need for this is likely to be rare and potentially at least slightly harmful to normal users, perhaps it would be sufficient to suggest in the documentation that people who need it simply add their preferred unknown-server-time-limit setting to a file in /etc/unbound/local.d?
It would be an easy way to test it, too.
Tapani
On Thu, Jan 07, 2021 at 11:27:43AM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
Hello Jon,
Yes, that could be true.
Can someone reach out to that user and see if they can apply the change and confirm that this works?
-Michael
On 6 Jan 2021, at 18:59, Jon Murphy jcmurphy26@gmail.com wrote:
On Jan 6, 2021, at 12:01 PM, Michael Tremer michael.tremer@ipfire.org wrote:
Did anyone actually experience some problems here that this needs changing?
Maybe here?
https://community.ipfire.org/t/override-disable-dnssec-system/2717
Hello,
Yes that would be the easiest way to test this.
But in general I do not recommend to have local changes like this permanently because they might break things.
-Michael
On 7 Jan 2021, at 14:35, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
Inasmuch as the need for this is likely to be rare and potentially at least slightly harmful to normal users, perhaps it would be sufficient to suggest in the documentation that people who need it simply add their preferred unknown-server-time-limit setting to a file in /etc/unbound/local.d?
It would be an easy way to test it, too.
Tapani
On Thu, Jan 07, 2021 at 11:27:43AM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
Hello Jon,
Yes, that could be true.
Can someone reach out to that user and see if they can apply the change and confirm that this works?
-Michael
On 6 Jan 2021, at 18:59, Jon Murphy jcmurphy26@gmail.com wrote:
On Jan 6, 2021, at 12:01 PM, Michael Tremer michael.tremer@ipfire.org wrote:
Did anyone actually experience some problems here that this needs changing?
Maybe here?
https://community.ipfire.org/t/override-disable-dnssec-system/2717
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
- unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen
Hi,
In that case, I do not think that this change realistically changes anything for anyone.
In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
-Michael
On 8 Jan 2021, at 17:33, Jonatan Schlag jonatan.schlag@ipfire.org wrote:
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
- unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen
On 1/9/21 9:04 AM, Michael Tremer wrote:
Hi,
In that case, I do not think that this change realistically changes anything for anyone.
In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
-Michael
On 8 Jan 2021, at 17:33, Jonatan Schlag jonatan.schlag@ipfire.org wrote:
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote: > When unbound has no information about a DNS-server > a timeout of 376 msec is assumed. This works well in a lot of situations, > but they mention in their documentation that this could be way too low. > They recommend a timeout of 1126 msec for satellite connections > (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
> + unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix. It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen
Greetings, Michael and @list.
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to the release with TLS support, I was unable to resolve hosts at all. (Did I mention that I dislike HughesNot? I have no other option for 'net connectivity - boonie life is great for the nerves, but hell on talking to anyone.)
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation. Also, I'm prepared to backup and edit any other files that might assist testing.
I've noticed (from NTP logs) that name resolution usually stalls/fails after ~3 hours when my LAN is quiet. Could changes to cache timeout settings be beneficial?
Please advise...
Thank you (and, GREAT EFFORT, ALL!),
Paul
On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
Wow. That *is* slow.
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation.
I think it would be very useful if you could test if changing the limits actually helps in your situation.
It's easy enough to do: e.g.,
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
and restart unbound and see if it makes a difference for you.
You might also try if non-TLS settings (TCP or UDP) work after that.
On 1/10/21 8:07 AM, Tapani Tarvainen wrote:
On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
Wow. That *is* slow.
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation.
I think it would be very useful if you could test if changing the limits actually helps in your situation.
It's easy enough to do: e.g.,
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
and restart unbound and see if it makes a difference for you.
You might also try if non-TLS settings (TCP or UDP) work after that.
Hello, I have some results.
The /etc/unbound/local.d/timeouts (+unbound restart) did not completely resolve NTP related lookup failures. It "seemed" to prevent complete failure, but the first of two lookups, to different pool aliases, did fail.
I retained the "timeouts" and changed from TLS to TCP, and haven't seen any lookup failures.
Tomorrow, I will experiment using "timeouts" and UDP. After a day or so, I'll try removing the "timeouts" and repeat the TCP and UDP tests.
Thank you!
p.
On 1/11/21 11:07 PM, Paul Simmons wrote:
On 1/10/21 8:07 AM, Tapani Tarvainen wrote:
On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
Wow. That *is* slow.
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation.
I think it would be very useful if you could test if changing the limits actually helps in your situation.
It's easy enough to do: e.g.,
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
and restart unbound and see if it makes a difference for you.
You might also try if non-TLS settings (TCP or UDP) work after that.
Hello, I have some results.
The /etc/unbound/local.d/timeouts (+unbound restart) did not completely resolve NTP related lookup failures. It "seemed" to prevent complete failure, but the first of two lookups, to different pool aliases, did fail.
I retained the "timeouts" and changed from TLS to TCP, and haven't seen any lookup failures.
Tomorrow, I will experiment using "timeouts" and UDP. After a day or so, I'll try removing the "timeouts" and repeat the TCP and UDP tests.
Thank you!
p.
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
Will now test TCP without "timeout" mod.
Paul
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
You might also try if UDP works with
delay-close: 1500
instead of or in addition to the unknown-server-time-limit.
On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
You might also try if UDP works with
delay-close: 1500
instead of or in addition to the unknown-server-time-limit.
Howdy!
I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500'). Unfortunately, I experienced intermittent resolution errors.
Am now using TCP... no apparent errors, but resolution is SssLllOooWww, just as before. (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
Thank you for your efforts. Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity. I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
Paul
Hello everyone,
So what does that leave us with?
Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?
-Michael
On 19 Jan 2021, at 06:22, Paul Simmons mbatranch@gmail.com wrote:
On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
You might also try if UDP works with
delay-close: 1500
instead of or in addition to the unknown-server-time-limit.
Howdy!
I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500'). Unfortunately, I experienced intermittent resolution errors.
Am now using TCP... no apparent errors, but resolution is SssLllOooWww, just as before. (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
Thank you for your efforts. Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity. I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
Paul
-- It is hard for an empty bag to stand upright. -- Benjamin Franklin, 1757
On 1/25/21 1:23 PM, Michael Tremer wrote:
Hello everyone,
So what does that leave us with?
Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?
-Michael
On 19 Jan 2021, at 06:22, Paul Simmons mbatranch@gmail.com wrote:
On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
You might also try if UDP works with
delay-close: 1500
instead of or in addition to the unknown-server-time-limit.
Howdy!
I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500'). Unfortunately, I experienced intermittent resolution errors.
Am now using TCP... no apparent errors, but resolution is SssLllOooWww, just as before. (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
Thank you for your efforts. Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity. I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
Paul
-- It is hard for an empty bag to stand upright. -- Benjamin Franklin, 1757
I haven't studied the metrics from unbound, so can't say if the modified timeouts help to avoid retransmissions.
As of this moment, TCP works, albeit slowly. If you'd rather drop the patch, I'm okay with that.
Thanks for all the effort!
Paul
Hi,
On 25 Jan 2021, at 20:29, Paul Simmons mbatranch@gmail.com wrote:
On 1/25/21 1:23 PM, Michael Tremer wrote:
Hello everyone,
So what does that leave us with?
Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?
-Michael
On 19 Jan 2021, at 06:22, Paul Simmons mbatranch@gmail.com wrote:
On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
I've found that UDP doesn't work at all. TCP with "timeout" mod never fails.
You might also try if UDP works with
delay-close: 1500
instead of or in addition to the unknown-server-time-limit.
Howdy!
I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500'). Unfortunately, I experienced intermittent resolution errors.
Am now using TCP... no apparent errors, but resolution is SssLllOooWww, just as before. (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
Thank you for your efforts. Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity. I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
Paul
-- It is hard for an empty bag to stand upright. -- Benjamin Franklin, 1757
I haven't studied the metrics from unbound, so can't say if the modified timeouts help to avoid retransmissions.
As of this moment, TCP works, albeit slowly. If you'd rather drop the patch, I'm okay with that.
Yes, TCP should always work and it will be much faster with Core Update 154 since the connections remain open.
We can always come back to this thread if there is any reason in the future.
Thanks for all the effort!
Thank you very much for your testing, too!
Best, -Michael
Paul
On 9 Jan 2021, at 18:57, Paul Simmons mbatranch@gmail.com wrote:
On 1/9/21 9:04 AM, Michael Tremer wrote:
Hi,
In that case, I do not think that this change realistically changes anything for anyone.
In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
-Michael
On 8 Jan 2021, at 17:33, Jonatan Schlag jonatan.schlag@ipfire.org wrote:
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
> On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote: > > On 1/6/21 4:17 AM, Jonatan Schlag wrote: >> When unbound has no information about a DNS-server >> a timeout of 376 msec is assumed. This works well in a lot of situations, >> but they mention in their documentation that this could be way too low. >> They recommend a timeout of 1126 msec for satellite connections >> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
>> + unknown-server-time-limit: 1128
But that's trivial. The point:
I am not entirely sure what this is supposed to fix. It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen
Greetings, Michael and @list.
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to the release with TLS support, I was unable to resolve hosts at all. (Did I mention that I dislike HughesNot? I have no other option for 'net connectivity - boonie life is great for the nerves, but hell on talking to anyone.)
The good thing is though, that we have a good test-bed for this kind of connection :)
I know of some more people who use a satellite connection, but they are not very keen on testing things with it.
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation. Also, I'm prepared to backup and edit any other files that might assist testing.
I've noticed (from NTP logs) that name resolution usually stalls/fails after ~3 hours when my LAN is quiet. Could changes to cache timeout settings be beneficial?
Please advise...
Thank you (and, GREAT EFFORT, ALL!),
Paul
-- It is better to have loved a short man than never to have loved a tall.
On 1/11/21 5:10 AM, Michael Tremer wrote:
On 9 Jan 2021, at 18:57, Paul Simmons mbatranch@gmail.com wrote:
On 1/9/21 9:04 AM, Michael Tremer wrote:
Hi,
In that case, I do not think that this change realistically changes anything for anyone.
In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
-Michael
On 8 Jan 2021, at 17:33, Jonatan Schlag jonatan.schlag@ipfire.org wrote:
Hi,
I will try to provide some explanations to the questions.
Am 06.01.2021 um 19:01 schrieb Michael Tremer michael.tremer@ipfire.org:
Hello,
On 6 Jan 2021, at 16:19, Tapani Tarvainen ipfire@tapanitarvainen.fi wrote:
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>> On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote: >> >> On 1/6/21 4:17 AM, Jonatan Schlag wrote: >>> When unbound has no information about a DNS-server >>> a timeout of 376 msec is assumed. This works well in a lot of situations, >>> but they mention in their documentation that this could be way too low. >>> They recommend a timeout of 1126 msec for satellite connections >>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf). A small nit, they actually suggest 1128 ... and that's indeed what the patch has:
>>> + unknown-server-time-limit: 1128 But that's trivial. The point:
> I am not entirely sure what this is supposed to fix. > It is possible that a DNS response takes longer than 376ms, indeed. > Does it harm us if we send another packet? No. If you are behind a slow satellite link, it can take more than that *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
https://nlnetlabs.nl/documentation/unbound/info-timeout/
When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
So you would always have sent another query before getting a response to the previous one.
True, but aren’t these extra-ordinary circumstances?
On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
With TCP that would mean never getting a response, because you'd always terminate the connection too soon. With UDP, I'm not sure, depends on how unbound handles incoming responses to queries it's already deemed lost and sent again. Adjusting delay-close might help. But it may be it would not work at all when the limit is too small.
That would mean that someone installing IPFire in some remote location with a slow link would conclude that it just doesn't work.
The downside of increasing the limit is that sometimes replies will take longer when a packet is lost on the way because we'd wait longer before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
I don't have data to judge what the limit should be, but I'd tend to trust nllabs recommendation here and go with the suggested 1128 ms.
Did anyone actually experience some problems here that this needs changing?
@Jonatan: What is your motivation for this patch?
Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution. But people should first check the current infra cache as these values would determine if this setting would help.
I hope a could make some things a little bit more clear.
Greetings Jonatan
-- Tapani Tarvainen
Greetings, Michael and @list.
I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki. I can test more, if desired.
The fastest return was 596ms, and the slowest was 857ms. At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to the release with TLS support, I was unable to resolve hosts at all. (Did I mention that I dislike HughesNot? I have no other option for 'net connectivity - boonie life is great for the nerves, but hell on talking to anyone.)
The good thing is though, that we have a good test-bed for this kind of connection :)
I know of some more people who use a satellite connection, but they are not very keen on testing things with it.
I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation. Also, I'm prepared to backup and edit any other files that might assist testing.
I've noticed (from NTP logs) that name resolution usually stalls/fails after ~3 hours when my LAN is quiet. Could changes to cache timeout settings be beneficial?
Please advise...
Thank you (and, GREAT EFFORT, ALL!),
Paul
-- It is better to have loved a short man than never to have loved a tall.
I'm pleased to be able to help, and grateful for the attention and assistance. See my next msg for testing update.
p.
On 1/6/21 9:14 AM, Michael Tremer wrote:
Hello,
On 6 Jan 2021, at 12:02, Paul Simmons mbatranch@gmail.com wrote:
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
When unbound has no information about a DNS-server a timeout of 376 msec is assumed. This works well in a lot of situations, but they mention in their documentation that this could be way too low. They recommend a timeout of 1126 msec for satellite connections (https://nlnetlabs.nl/documentation/unbound/unbound.conf). Settings this value to 1126 msec should make the first queries to an unknown server, more useful. They do not timeout and so these queries do not need to be sent again.
On a stable link, this behaviour should not have negative implications. As the first result of queries arrive the timeout value gets updated, and the high value of 1126 msec gets set to something useful.
Signed-off-by: Jonatan Schlag jonatan.schlag@ipfire.org
config/unbound/unbound.conf | 1 + 1 file changed, 1 insertion(+)
diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf index f78aaae8c..02f093015 100644 --- a/config/unbound/unbound.conf +++ b/config/unbound/unbound.conf @@ -62,6 +62,7 @@ server: # Timeout behaviour infra-keep-probing: yes
- unknown-server-time-limit: 1128 # Bootstrap root servers root-hints: "/etc/unbound/root.hints"
I am not entirely sure what this is supposed to fix.
It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
So what is this changing in real life?
This sounds promising to me, as I have many DNS lookup timeouts (ISP is HughesNot, er, HughesNet).
@Paul: I am not sure if the solution is to increase timeouts. In my point of view, you should change the name servers.
+1
Paul
Greetings, Michael. The two DNS servers I use have ping times of 631ms (addr 9.9.9.10) and 742ms (addr 81.3.27.54).
I tested the ping times of the first 27 IPV4 address of servers listed in the wiki.
The times ranged from 596ms to 857ms, so I question if changing servers will afford any measurable relief.
Thank you,
Paul