Hello Peter,
On 28 Oct 2022, at 21:29, Peter Müller peter.mueller@ipfire.org wrote:
Hello Michael,
above all, thank you very much for the patchset and all the work behind it.
Unfortunately, as briefly discussed via the phone already, I have some general concerns regarding geofeeds:
(a) In contrast to RIRs, I do not see geofeed providers as trustworthy source. While the former are not trustworthy in terms of the data they provide (since no vetting or QA of database changes is usually conducted, and it does not look to me like this is going to change soon), at least their infrastructure is: It seems reasonable to me to trust, for example, RIPE's FTP server to serve the same database files regardless of the client requesting it. For some of them, we could even verify that through file signature validation, assuming that it is too costly to do live GPG-signing at scale.
Geofeed URLs, in contrast, can lead to anywhere, and I would not be surprised at all to see dubious ISPs serving different geofeeds to different clients. Given that our IP address ranges are public and static, and libloc reveals itself through the User-Agent HTTP header, it would be quite easy to serve us a geofeed that tampers with data, while playing innocent to other clients.
In addition, many of the 215 geofeed URLs that are currently live (attached) point to services such as Google Docs or GitHub - both don't strike me as reliable sources in terms of persistence. Generally, we have the full problem of URL/domain rot again. :-(
One could argue that these points (to a certain extend) hold true for RIRs as well. However, if we cannot trust them, it's curtains for libloc either way. :-) Some random ISPs trying to make us consuming geolocation data from random URLs, on the other hand, poses a greater risk than benefit to the quality of the location database.
I see your point, but I disagree.
The RIR databases are self-assessment, too. People can put whatever they want in there and it is not being checked by anyone.
The only thing that you might have in favour of your argument is that there is a better paper trail of any changes than the geo feeds. Those can be changed - even randomly generated. But I believe that we have in both cases no chance to verify any data.
Malicious players will fake their location even in the RIR databases.
What I would suggest as a minimum is to select at least a couple of “trusted” or very large sources that we maintain manually. There are a couple of cloud providers which use Geofeeds and we would quite likely improve the quality of the data for them.
Which brings me directly to the next point...
(b) Presumed we still agree on not being more precise than /24 or /48, all the information geofeeds provide could (should?) have been in the RIR databases as well.
The only exception is ARIN, but since we do not get their raw database, we won't be able to consume any geofeed URLs in it. So, for the area where we lack accuracy of geolocation information most, geofeed won't help us. And for all the other RIRs (LACNIC included, for which we process an additional geolocation database feed already), the geofeeds ideally should not contain any new information to us.
Why should we not process anything smaller than those prefixes? It wouldn’t hurt us at all.
Earlier today, I created a location database text dump on location02 with and without the geofeed patchset applied. The diff can be retrieved from https://people.ipfire.org/~pmueller/location-database-geofeed-diff.tar.gz, and is rather massive, partly because CIDRs smaller than /24 resp. /48 are yet to be ignored by the geofeed processing routines.
I have yet to assess the diff closely, but for a superficial analysis, it appears like geofeed introduces a lot of changes that could have been in the respective RIR databases as well. The fact that they are not there does not inspire confidence.
Apologies for this rather disappointing feedback, and best regards, Peter Müller<20221028_live_geofeeds.txt>
Well, I don’t think this is disappointing. Technically I suspect that you are happy with the code.
We now just need to figure out where to use it and where to not use it.
Best, -Michael