From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: location@lists.ipfire.org Subject: Re: [PATCH] location-importer.in: Import (technical) AS names from ARIN Date: Tue, 08 Jun 2021 16:15:56 +0100 Message-ID: <6369E1EA-BBD2-4765-A9DA-5405545F01A6@ipfire.org> In-Reply-To: <4b3a87ec-c160-2f02-bbab-81163b5ed2f8@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5110279672200974525==" List-Id: --===============5110279672200974525== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, > On 8 Jun 2021, at 16:10, Peter M=C3=BCller wro= te: >=20 > Hello Michael, >=20 > thanks for your reply. >=20 >> Hello, >>=20 >> Good code. I am sure it works. >=20 > Well, I tested it on location02, and would be a bit puzzled if it does not = work somewhere else... >=20 Exactly what I expected :) >> I just want to use this a little bit of a Python exercise because I am qui= te particular about my Python :) >>=20 >>> On 8 Jun 2021, at 13:10, Peter M=C3=BCller w= rote: >>>=20 >>> ARIN and LACNIC, unfortunately, do not seem to publish data containing >>> human readable AS names. For the former, we at least have a list of >>> tecnical names, which this patch fetches and inserts into the autnums >>> table. >>>=20 >>> While some of them do not seem to be suitable for human consumption (i. >>> e. being very cryptic), providing these data might be helpful >>> neverthelesss. >>>=20 >>> Signed-off-by: Peter M=C3=BCller >>> --- >>> src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++ >>> 1 file changed, 61 insertions(+) >>>=20 >>> diff --git a/src/python/location-importer.in b/src/python/location-import= er.in >>> index aa3b8f7..2a9bf33 100644 >>> --- a/src/python/location-importer.in >>> +++ b/src/python/location-importer.in >>> @@ -505,6 +505,9 @@ class CLI(object): >>> for line in f: >>> self._parse_line(line, source_key, validcountries) >>>=20 >>> + # Download and import (technical) AS names from ARIN >>> + self._import_as_names_from_arin() >>> + >>> def _check_parsed_network(self, network): >>> """ >>> Assistive function to detect and subsequently sort out parsed >>> @@ -775,6 +778,64 @@ class CLI(object): >>> "%s" % network, country, [country], source_key, >>> ) >>>=20 >>> + def _import_as_names_from_arin(self): >>> + downloader =3D location.importer.Downloader() >>> + >>> + # XXX: Download AS names file from ARIN (note that these names appear = to be quite >>> + # technical, not intended for human consumption, as description fields= in >>> + # organisation handles for other RIRs are - however, this is what we h= ave got, >>> + # and in some cases, it might be still better than nothing) >>> + try: >>> + with downloader.request("https://ftp.arin.net/info/asn.txt", return_b= locks=3DFalse) as f: >>> + arin_as_names_file =3D f.body >>=20 >> You copy the entire content into memory. That is something you can do, but= the whole point of the download handler was so that a large file can be read= line by line=E2=80=A6 In this case memory is probably not a concern, but gen= erally I do not like aliasing something. >=20 > ACK. >=20 >>=20 >>> + except Exception as e: >>> + log.error("failed to download and preprocess AS name file from ARIN: = %s" % e) >>> + return >>=20 >> Do we just want to continue if this fails? >=20 > What for? If we cannot download the asn.txt file from ARIN, there is nothin= g left to do in this function. True, but should we not raise an error instead of silently logging this? >>=20 >>> + >>> + # Split downloaded body into lines and parse each of them... >>> + for sline in arin_as_names_file.readlines(): >>=20 >> Iterating over f (above) would make the readlines() call unnecessary. It c= reates a list with the lines of the file. If memory was a concern, this is qu= ite inefficient. Reading one line at a time would be better. >=20 > ACK. >=20 >>=20 >>> + >>> + # ... valid lines start with a space, followed by the number of the A= utonomous System ... >>> + if not sline.startswith(b" "): >>> + continue >>=20 >> I usually convert bytes into strings straight away (unless I can perform s= ome cheap checks before). >=20 > This is just a personal habit of mine: Unless I can reasonably assume somet= hing binary looks like valid input, > I never peek too much into it. However, str() should be safe to use in Pyth= on, so I guess your approach is better. Hmm, I suppose we can call Python memory-safe. Since it is using glibc=E2=80= =99s string functions underneath, it would have less overhead and take advant= age of AVX, SSE, etc. and therefore have less overhead to only call it once p= er line. Even better would be to only call it once per file, but then you might run in= to memory allocation issues again. >=20 >>=20 >>> + >>> + # Split line and check if there is a valid ASN in it... >>> + scontents =3D sline.split() >>> + try: >>> + asn =3D int(scontents[0]) >>> + except ValueError: >>> + log.debug("Skipping ARIN AS names line not containing an integer for= ASN") >>> + continue >>=20 >> Here is where it is getting interesting. You are using lots and lots of li= nes to do something very simple: Split a string in two parts. >>=20 >> You can either do this: >>=20 >> asn, space, name =3D line.partition(=E2=80=9C =E2=80=9C) >>=20 >> Or you can use split with a maximum: >>=20 >> asn, name =3D line.split(=E2=80=9C =E2=80=9C, 1) >>=20 >> That avoids the whole aliasing thing with scontent[0] later. The variable = has a good name straight away and you will never need to change any array ind= ices, because that is messy and a waste of time. >=20 > ACK. >=20 >>=20 >> I also like brief variable names. as_name should be name because context m= akes it clear that it is the name of the AS, doesn=E2=80=99t it? >=20 > ACK. >=20 >>=20 >>> + >>> + if not ((1 <=3D asn and asn <=3D 23455) or (23457 <=3D asn and asn <= =3D 64495) or (131072 <=3D asn and asn <=3D 4199999999)): >>> + log.debug("Skipping ARIN AS names line not containing a valid ASN: %= s" % asn) >>> + continue >>=20 >> I believe we are using this in other places, so it could become a function= that can be reused. >=20 > Yes, we have that for updating the BGP feed, but placed the checks inside a= SQL statement there, so I did not came up > with a good idea on how to recycle that. Is it fine to you if I just leave = this as it is for the time being? I am okay with this. >=20 >>=20 >>> + >>> + # Skip any AS name that appears to be a placeholder for a different R= IR or entity... >>> + as_name =3D scontents[1].decode("ascii") >>=20 >> This could fail and you would throw an exception. >=20 > Yes, that slipped my mind... >=20 >>=20 >>> + if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIP= E|IANA)\d{0,1}-*", as_name): >>> + continue >>=20 >> This is a good solution. To make the regular expression faster, you could = end after the \d+. You are checking if there is no or many dashes after it. I= am sure if that is what you intended or if that is an attempt to use globbin= g. >=20 > Not exactly. The hyphen after one or no digit is crucial. But it could be s= omething like \d? (PCRE thinking)... You are checking for none to many. So it technically does not matter what com= es after \d+ You probably want this: ^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d\-+ >=20 >>=20 >>> + # Bail out in case the AS name contains anything we do not expect her= e... >>> + if re.search(r"[^a-zA-Z0-9-_]", as_name): >>> + log.debug("Skipping ARIN AS name for %s containing invalid character= s: %s" % \ >>> + (asn, as_name)) >>=20 >> Hmm. What are you trying to do here? Find non-printable characters? What a= bout special characters like =C3=A9, =C3=A4, =C3=B1 and so on? >=20 > They should not appear in this file. AS names consist of ASCII letters, dig= its, and hyphens only. Okay. Let=E2=80=99s hope that people will follow this. >>=20 >>> + >>> + # Things look good here, run INSERT statement and skip this one if we= already have >>> + # a (better?) name for this Autonomous System... >>> + self.db.execute(""" >>> + INSERT INTO autnums( >>> + number, >>> + name, >>> + source >>> + ) VALUES (%s, %s, %s) >>> + ON CONFLICT (number) DO NOTHING""", >>> + asn, >>> + as_name, >>> + "ARIN", >>> + ) >>> + >>> def handle_update_announcements(self, ns): >>> server =3D ns.server[0] >>>=20 >>> --=20 >>> 2.20.1 >>>=20 >>=20 >> This is just my attempt to make you more aware about some things in the co= de that might not be very efficient (when it might not matter that much), and= what my personal =E2=80=9Cstyle=E2=80=9D is. Others might disagree, but in g= eneral my code performs very well and is very readable. I hope :) >=20 > Thank you. I will hand in a second version of this patch... Up to you if you want to :) -Michael >=20 > Thanks, and best regards, > Peter M=C3=BCller --===============5110279672200974525==--