From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: development@lists.ipfire.org Subject: Re: [PATCH 4/5] ids-functions.pl: Use If-Modified-Since header to reduce file downloads. Date: Mon, 28 Mar 2022 16:15:49 +0100 Message-ID: <026F49A8-7D3B-4374-855A-740B75209BD1@ipfire.org> In-Reply-To: <1576548dfa8d2c950e321c766dc85b8967eeb20f.camel@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7736114491599181470==" List-Id: --===============7736114491599181470== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, > On 24 Mar 2022, at 18:50, Stefan Schantl wrot= e: >=20 > Hello Michael, >=20 > thanks for your thoughts and feedback. >=20 >> Hello, >>=20 >> I like this mechanism. It is quite powerful since it will save a lot >> of unnecessary downloads and CPU cycles. However, I am not sure if >> all providers will support this because some use crappy CDNs which do >> not give anything about conforming to any RFC. >=20 > Currently I'm only aware that content which is hosted on github.com > cannot be checked with this header flag. They are using Etags which > also can be used for cache management, but works in a completely > different way. Yes, but we can still use them. You can just store the Etag and send it with your next request. > There are plenty more providers, some which requires a subscription > which I do not own - so they currently remains untested.=20 >=20 >>=20 >>> On 23 Mar 2022, at 04:04, Stefan Schantl >>> wrote: >>>=20 >>> When using the "If-Modified-Since" header, the server can be >>> requested >>> if a modified version of the file can be served. >>>=20 >>> In case that is true, the file will be sent and stored by the >>> downloader >>> function. If the file has not been touched since the last time, the >>> server will respond with the code "304" (Not modified). >>>=20 >>> This tells us, that the current stored file is the latest one >>> (still up-to-date) >>> and we safely can skip the download attempt for this provider. >>>=20 >>> Signed-off-by: Stefan Schantl >>> --- >>> config/cfgroot/ids-functions.pl | 38 >>> ++++++++++++++++++++++++++++++--- >>> 1 file changed, 35 insertions(+), 3 deletions(-) >>>=20 >>> diff --git a/config/cfgroot/ids-functions.pl b/config/cfgroot/ids- >>> functions.pl >>> index dfbeb1a7d..d7df41dd1 100644 >>> --- a/config/cfgroot/ids-functions.pl >>> +++ b/config/cfgroot/ids-functions.pl >>> @@ -365,9 +365,25 @@ sub downloadruleset ($) { >>> my $tmp =3D File::Temp->new( SUFFIX =3D> ".tmp", DIR =3D> >>> "/var/tmp/", UNLINK =3D> 0 ); >>> my $tmpfile =3D $tmp->filename(); >>>=20 >>> + # Genarate and assign file name and path to store >>> the downloaded rules file. >>> + my $dl_rulesfile =3D &_get_dl_rulesfile($provider); >>> + >>> + # Load perl module to deal with file atributes. >>> + use File::stat; >>=20 >> I don=E2=80=99t like modules in the middle of a script. This makes testing= a >> new perl update more difficult. If I can load the script, I expect it >> to have all dependencies it needs. This does not work when modules >> are being loaded later on. I would say that we need to cover those >> things by a module test then. >>=20 >> I would prefer reliability over performance (I assume that this your >> argument why to load it only when you need it). >=20 > I'll move the load of the modules to the top of the script. This is > also the way all other (library) scripts are doing this. Thank you. > This may will lead to some more resource consumption, but will help us > to execute the script in a safe way - even when anything got updated. >=20 > Module tests generally are a good way - I'm thinking about implementing > some at a later time (after some important changes are finished) for > the "ids-functions.pl". >=20 >>=20 >>> + >>> + # Get the mtime of the rulesfile if it exists. >>> + my $mtime =3D (stat($dl_rulesfile)->mtime) if (-f >>> $dl_rulesfile); >>=20 >> I suggest making this a little bit simpler: >>=20 >> Create the request first, then stat the file (you don=E2=80=99t need to ch= eck >> if it exists first because stat will just tell you). >>=20 >> Then add the header if you have a timestamp; if not, then don=E2=80=99t ad= d a >> useless header. This is a lot more straight forward :) >>=20 >>> + >>> + # Convert the mtime into gmtime format. >>> + my $gmtime =3D gmtime($mtime || 0); >>> + >>> # Pass the requested url to the downloader. >>> my $request =3D HTTP::Request->new(GET =3D> $url); >>>=20 >>> + # Add the If-Modified-Since header to the request, >>> containing the omited and converted >>> + # mtime of the downloaded rules file, if one is >>> present. >>> + $request->header( 'If-Modified-Since' =3D> "$gmtime" >>> ); >>=20 >> Will this survive any redirects? I know that some redirect to AWS S3 >> and you will need to make sure that the header is being sent when you >> follow the redirect - assuming S3 supports this which I think they >> would. >=20 > If the servers in between are not entirely broken or do silly stuff, > they should. This is in the control of the client. It should not matter what the server re= sponds with. >>=20 >>> + >>> my $dl_attempt =3D 1; >>> my $response; >>>=20 >>> @@ -381,6 +397,14 @@ sub downloadruleset ($) { >>> # Break loop. >>> last; >>>=20 >>> + # Check if the server responds with 304 >>> (Not Modified). >>> + } elsif ($response->code =3D=3D 304) { >>> + # Log to syslog. >>> + &_log_to_syslog("Ruleset is up-to- >>> date, no update required."); >>> + >>> + # Nothing to be done, the ruleset >>> is up-to-date. >>> + return; >>> + >>> # Check if we ran out of download re-tries. >>> } elsif ($dl_attempt eq $max_dl_attempts) { >>> # Obtain error. >>> @@ -406,6 +430,10 @@ sub downloadruleset ($) { >>> # Get the remote size of the downloaded file. >>> my $remote_filesize =3D $headers->content_length; >>>=20 >>> + # Get the timestamp from header, when the file has >>> been modified the >>> + # last time. >>> + my $last_modified =3D $headers->last_modified; >>> + >>> # Load perl stat module. >>> use File::stat; >>=20 >> You are loading the module again. >=20 > Thanks, this is an additional point why, loading all required modules > on top are a good idea. You do not have to think about which modules > are already loaded or may be reloaded etc. >>=20 >>>=20 >>> @@ -428,9 +456,6 @@ sub downloadruleset ($) { >>> return 1; >>> } >>>=20 >>> - # Genarate and assign file name and path to store >>> the downloaded rules file. >>> - my $dl_rulesfile =3D &_get_dl_rulesfile($provider); >>> - >>> # Check if a file name could be obtained. >>> unless ($dl_rulesfile) { >>> # Log error message. >>> @@ -449,6 +474,13 @@ sub downloadruleset ($) { >>> # Overwrite the may existing rulefile or tarball >>> with the downloaded one. >>> move("$tmpfile", "$dl_rulesfile"); >>>=20 >>> + # Check if the server respond contained a >>> last_modified value. >>> + if ($last_modified) { >>> + # Assign the last modified timestamp from >>> server as mtime to the >>> + # rules file. >>> + utime(time(), "$last_modified", >>> "$dl_rulesfile"); >>> + } >>> + >>> # Delete temporary file. >>> unlink("$tmpfile"); >>>=20 >>> --=20 >>> 2.30.2 >>>=20 >>=20 >> -Michael >=20 > -Stefan --===============7736114491599181470==--