From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: location@lists.ipfire.org Subject: Re: [PATCH] location-importer.in: Import (technical) AS names from ARIN Date: Tue, 08 Jun 2021 15:40:06 +0100 Message-ID: <534008A1-0A82-4ACF-A82D-51513E1DB7C7@ipfire.org> In-Reply-To: <20210608121036.16242-1-peter.mueller@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7327112351318348626==" List-Id: --===============7327112351318348626== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, Good code. I am sure it works. I just want to use this a little bit of a Python exercise because I am quite = particular about my Python :) > On 8 Jun 2021, at 13:10, Peter M=C3=BCller wro= te: >=20 > ARIN and LACNIC, unfortunately, do not seem to publish data containing > human readable AS names. For the former, we at least have a list of > tecnical names, which this patch fetches and inserts into the autnums > table. >=20 > While some of them do not seem to be suitable for human consumption (i. > e. being very cryptic), providing these data might be helpful > neverthelesss. >=20 > Signed-off-by: Peter M=C3=BCller > --- > src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++ > 1 file changed, 61 insertions(+) >=20 > diff --git a/src/python/location-importer.in b/src/python/location-importer= .in > index aa3b8f7..2a9bf33 100644 > --- a/src/python/location-importer.in > +++ b/src/python/location-importer.in > @@ -505,6 +505,9 @@ class CLI(object): > for line in f: > self._parse_line(line, source_key, validcountries) >=20 > + # Download and import (technical) AS names from ARIN > + self._import_as_names_from_arin() > + > def _check_parsed_network(self, network): > """ > Assistive function to detect and subsequently sort out parsed > @@ -775,6 +778,64 @@ class CLI(object): > "%s" % network, country, [country], source_key, > ) >=20 > + def _import_as_names_from_arin(self): > + downloader =3D location.importer.Downloader() > + > + # XXX: Download AS names file from ARIN (note that these names appear to= be quite > + # technical, not intended for human consumption, as description fields in > + # organisation handles for other RIRs are - however, this is what we hav= e got, > + # and in some cases, it might be still better than nothing) > + try: > + with downloader.request("https://ftp.arin.net/info/asn.txt", return_blo= cks=3DFalse) as f: > + arin_as_names_file =3D f.body You copy the entire content into memory. That is something you can do, but th= e whole point of the download handler was so that a large file can be read li= ne by line=E2=80=A6 In this case memory is probably not a concern, but genera= lly I do not like aliasing something. > + except Exception as e: > + log.error("failed to download and preprocess AS name file from ARIN: %s= " % e) > + return Do we just want to continue if this fails? > + > + # Split downloaded body into lines and parse each of them... > + for sline in arin_as_names_file.readlines(): Iterating over f (above) would make the readlines() call unnecessary. It crea= tes a list with the lines of the file. If memory was a concern, this is quite= inefficient. Reading one line at a time would be better. > + > + # ... valid lines start with a space, followed by the number of the Aut= onomous System ... > + if not sline.startswith(b" "): > + continue I usually convert bytes into strings straight away (unless I can perform some= cheap checks before). > + > + # Split line and check if there is a valid ASN in it... > + scontents =3D sline.split() > + try: > + asn =3D int(scontents[0]) > + except ValueError: > + log.debug("Skipping ARIN AS names line not containing an integer for A= SN") > + continue Here is where it is getting interesting. You are using lots and lots of lines= to do something very simple: Split a string in two parts. You can either do this: asn, space, name =3D line.partition(=E2=80=9C =E2=80=9C) Or you can use split with a maximum: asn, name =3D line.split(=E2=80=9C =E2=80=9C, 1) That avoids the whole aliasing thing with scontent[0] later. The variable has= a good name straight away and you will never need to change any array indice= s, because that is messy and a waste of time. I also like brief variable names. as_name should be name because context make= s it clear that it is the name of the AS, doesn=E2=80=99t it? > + > + if not ((1 <=3D asn and asn <=3D 23455) or (23457 <=3D asn and asn <=3D= 64495) or (131072 <=3D asn and asn <=3D 4199999999)): > + log.debug("Skipping ARIN AS names line not containing a valid ASN: %s"= % asn) > + continue I believe we are using this in other places, so it could become a function th= at can be reused. > + > + # Skip any AS name that appears to be a placeholder for a different RIR= or entity... > + as_name =3D scontents[1].decode("ascii") This could fail and you would throw an exception. > + if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|= IANA)\d{0,1}-*", as_name): > + continue This is a good solution. To make the regular expression faster, you could end= after the \d+. You are checking if there is no or many dashes after it. I am= sure if that is what you intended or if that is an attempt to use globbing. > + # Bail out in case the AS name contains anything we do not expect here.= .. > + if re.search(r"[^a-zA-Z0-9-_]", as_name): > + log.debug("Skipping ARIN AS name for %s containing invalid characters:= %s" % \ > + (asn, as_name)) Hmm. What are you trying to do here? Find non-printable characters? What abou= t special characters like =C3=A9, =C3=A4, =C3=B1 and so on? > + > + # Things look good here, run INSERT statement and skip this one if we a= lready have > + # a (better?) name for this Autonomous System... > + self.db.execute(""" > + INSERT INTO autnums( > + number, > + name, > + source > + ) VALUES (%s, %s, %s) > + ON CONFLICT (number) DO NOTHING""", > + asn, > + as_name, > + "ARIN", > + ) > + > def handle_update_announcements(self, ns): > server =3D ns.server[0] >=20 > --=20 > 2.20.1 >=20 This is just my attempt to make you more aware about some things in the code = that might not be very efficient (when it might not matter that much), and wh= at my personal =E2=80=9Cstyle=E2=80=9D is. Others might disagree, but in gene= ral my code performs very well and is very readable. I hope :) -Michael --===============7327112351318348626==--