Re: [PATCH 1/3] general-functions.pl: Add LWP-based flexible downloader function.

public inbox for development@lists.ipfire.org
 help / color / mirror / Atom feed

From: Michael Tremer <michael.tremer@ipfire.org>
To: Stefan Schantl <stefan.schantl@ipfire.org>
Cc: development@lists.ipfire.org
Subject: Re: [PATCH 1/3] general-functions.pl: Add LWP-based flexible downloader function.
Date: Mon, 24 Mar 2025 10:02:05 +0000	[thread overview]
Message-ID: <ABB4812C-0897-45D6-A97E-05E98CEFF9DF@ipfire.org> (raw)
In-Reply-To: <ad75eafb1db4f9e976f2119113248d5e3a0742b7.camel@ipfire.org>

Hello,

> On 23 Mar 2025, at 19:00, Stefan Schantl <stefan.schantl@ipfire.org> wrote:
> 
> Hello Michael,
> 
> thanks for the review and your detailed feedback.
> 
> You are right this patch mainly moves the downloader logic from the
> ipblocklist feature into an own easy to use general function.
> 
>> Morning Stefan,
>> 
>> Thank you for taking the time and getting this out of your git
>> repository and onto this list :)
>> 
>> As we are already shipping this code and this patch only moves it
>> around, there is probably not much standing in the way of seeing this
>> being merged. I have however a couple of questions that we might want
>> to address in a follow-up patch.
>> 
>>> On 22 Mar 2025, at 14:57, Stefan Schantl
>>> <stefan.schantl@ipfire.org> wrote:
>>> 
>>> This function can be used to grab content and/or store it into
>>> files.
>>> 
>>> Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
>>> ---
>>> config/cfgroot/general-functions.pl | 262
>>> ++++++++++++++++++++++++++++
>>> 1 file changed, 262 insertions(+)
>>> 
>>> diff --git a/config/cfgroot/general-functions.pl
>>> b/config/cfgroot/general-functions.pl
>>> index 8ba6e3f79..cb8df69c6 100644
>>> --- a/config/cfgroot/general-functions.pl
>>> +++ b/config/cfgroot/general-functions.pl
>>> @@ -20,6 +20,23 @@ use IO::Socket;
>>> use Net::SSLeay;
>>> use Net::IPv4Addr qw(:all);
>>> 
>>> +# Load module to move files.
>>> +use File::Copy;
>>> +
>>> +# Load module to get file stats.
>>> +use File::stat;
>>> +
>>> +# Load module to deal with temporary files.
>>> +use File::Temp;
>>> +
>>> +# Load module to deal with the date formats used by the HTTP
>>> protocol.
>>> +use HTTP::Date;
>>> +
>>> +# Load the libwwwperl User Agent module.
>>> +use LWP::UserAgent;
>> 
>> Recently (let’s say a year ago) I have spent some time on making the
>> web UI faster. Especially on some hardware, is it not very snappy.
>> The reason for the slowness however was not that we are doing much
>> stuff in our code; it was rather that loading modules was taking a
>> long time in Perl. So I removed what was no longer needed and moved
>> some other stuff around which slightly helped (saved around 100ms on
>> the older IPFire Mini Appliance).
>> 
>> Since the downloading functionality is not used *that* much here.
>> Should this maybe not live in a separate place and we only include it
>> when it is needed?
> 
> Sounds good, so I'm totally fine with this suggestion. This can be part
> of a following patch(set).

Well, moving it should not be a separate patch set, because we should not move it twice.

But any of the other fixes should be implemented in a separate patch each.

>> 
>>> +
>>> +$|=1; # line buffering
>> 
>> Why is this needed?
> 
> This line does not belong to the patch and is already part of the
> general-functions.pl file.
> 
> The following can be found in the perl special variables manual about
> $|: 
> 
> $| - If set to nonzero, forces an fflush(3) after every write or print
> on the currently selected output channel.
> 
> I'm not sure, if this line is required in general.

I know what it does. I just wanted to know why this has been enabled here.

> 
>> 
>>> +
>>> $General::version = 'VERSION';
>>> $General::swroot = 'CONFIG_ROOT';
>>> $General::noipprefix = 'noipg-';
>>> @@ -1346,6 +1363,251 @@ sub generateProtoTransHash () {
>>> return %protocols;
>>> }
>>> 
>>> +# Function to grab a given URL content or to download and store it
>>> on disk.
>>> +#
>>> +# The function requires a configuration hash to be passed.
>>> +#
>>> +# The following options (hash keys) are supported:
>>> +#
>>> +# URL -> The URL to the content or file. REQUIRED!
>>> +# FILE -> The filename as fullpath where the content/file should
>>> be stored on disk.
>>> +# ETAGSFILE -> A filename again as fullpath where Etags should be
>>> stored and read.
>>> +# ETAGPREFIX -> In case a custom etag name should be used,
>>> otherwise it defaults to the given URL.
>>> +# MAXSIZE -> In bytes until the downloader will abort downloading.
>>> (example: 10_485_760 for 10MB)
>>> +#
>>> +# If a file is given an If-Modified-Since header will be generated
>>> from the last modified timestamp
>>> +# of an already stored file. In case an Etag file is specified an
>>> If-None-Match header will be added to
>>> +# the request - Both can be used at the same time.
>>> +#
>>> +# In case no FILE option has been passed to the function, the
>>> content of the requested URL will be returned.
>>> +#
>>> +# Return codes (if FILE is used):
>>> +#
>>> +# nothing - On success
>>> +# no url - If no URL has been specified.
>>> +# not_modified - In case the servers responds with "Not modified"
>>> (304)
>>> +# dl_error - If the requested URL cannot be accessed.
>>> +# incomplete download - In case the size of the local file does
>>> not match the remote content_lenght.
>>> +#
>>> +sub downloader (%) {
>>> + my (%args) = @_;
>>> +
>>> + # Remap args hash and convert all keys into upper case format.
>>> + %args = map { uc $_ => $args{$_} } keys %args;
>>> +
>>> + # The amount of download attempts before giving up and
>>> + # logging an error.
>>> + my $max_dl_attempts = 3;
>>> +
>>> + # Temporary directory to download the files.
>>> + my $tmp_dl_directory = "/var/tmp";
>>> +
>>> + # Assign hash values.
>>> + my $url = $args{"URL"} if (exists($args{"URL"}));
>>> + my $file = $args{"FILE"} if (exists($args{"FILE"}));
>>> + my $etags_file = $args{"ETAGSFILE"} if
>>> (exists($args{"ETAGSFILE"}));
>>> + my $etagprefix = $url;
>>> + $etagprefix = $args{"ETAGPREFIX"} if
>>> (exists($args{"ETAGPREFIX"}));
>>> + my $max_size = $args{"MAXSIZE"} if (exists($args{"MAXSIZE"}));
>> 
>> I consider this really ugly, but I suppose this is what Perl wants.
>> 
>>> + # Abort with error "no url", if no URL has been given.
>>> + return "no url" unless ($url);
>> 
>> Could we rather die here with an error? Just returning a string here
>> seems to be too easy to ignore in case something above went wrong.
> 
> Technically you are right, but let me explain the reason for this kind
> of error handling a bit further down.
> 
>> 
>>> +
>>> + my %etags = ();
>>> + my $tmpfile;
>>> +
>>> + # Read-in proxysettings.
>>> + my %proxysettings=();
>>> + &readhash("${General::swroot}/proxy/settings", \%proxysettings);
>>> +
>>> + # Create a user agent instance.
>>> + #
>>> + # Request SSL hostname verification and specify path
>>> + # to the CA file.
>>> + my $ua = LWP::UserAgent->new(
>>> + ssl_opts => {
>>> + SSL_ca_file     => '/etc/ssl/cert.pem',
>>> + verify_hostname => 1,
>>> + },
>>> + );
>>> +
>>> + # Set timeout to 10 seconds.
>>> + $ua->timeout(10);
>> 
>> A 10 second timeout also sounds rather optimistic considering some
>> slower, overloaded mirrors or those that are quite far away. For any
>> tasks like downloading data in the background we could afford longer.
> 
> Accepted, what would be a good default value here?

60 seconds? I think it makes sense to make this optionally configurable so that for the IP check that has been changed here, we will only wait around 10 seconds as it is a simple API request.

> 
>> 
>>> +
>>> + # Assign maximum download size if set.
>>> + $ua->max_size($max_size) if ($max_size);
>>> +
>>> + # Generate UserAgent.
>>> + my $agent = &MakeUserAgent();
>>> +
>>> + # Set the generated UserAgent.
>>> + $ua->agent($agent);
>>> +
>>> + # Check if an upstream proxy is configured.
>>> + if ($proxysettings{'UPSTREAM_PROXY'}) {
>>> + my $proxy_url;
>>> +
>>> + $proxy_url = "http://";
>>> +
>>> + # Check if the proxy requires authentication.
>>> + if (($proxysettings{'UPSTREAM_USER'}) &&
>>> ($proxysettings{'UPSTREAM_PASSWORD'})) {
>>> + $proxy_url .=
>>> "$proxysettings{'UPSTREAM_USER'}\:$proxysettings{'UPSTREAM_PASSWORD
>>> '}\@";
>>> + }
>>> +
>>> + # Add proxy server address and port.
>>> + $proxy_url .= $proxysettings{'UPSTREAM_PROXY'};
>>> +
>>> + # Append proxy settings.
>>> + $ua->proxy(['http', 'https'], $proxy_url);
>>> + }
>>> +
>>> + # Create a HTTP request element and pass the given URL to it.
>>> + my $request = HTTP::Request->new(GET => $url);
>>> +
>>> + # Check if a file to store the output has been provided.
>>> + if ($file) {
>>> + # Check if the given file already exits, because it has been
>>> downloaded in the past.
>>> + #
>>> + # In this case we are requesting the server if the remote file
>>> has been changed or not.
>>> + # This will be done by sending the modification time in a special
>>> HTTP header.
>>> + if (-f $file) {
>>> + # Call stat on the file.
>>> + my $stat = stat($file);
>>> +
>>> + # Omit the mtime of the existing file.
>>> + my $mtime = $stat->mtime;
>>> +
>>> + # Convert the timestamp into right format.
>>> + my $http_date = time2str($mtime);
>>> +
>>> + # Add the If-Modified-Since header to the request to ask the
>>> server if the
>>> + # file has been modified.
>>> + $request->header( 'If-Modified-Since' => "$http_date" );
>>> + }
>>> +
>>> + # Generate temporary file name, located in the tempoary download
>>> directory and with a suffix of ".tmp".
>>> + # The downloaded file will be stored there until some sanity
>>> checks are performed.
>>> + my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR =>
>>> "$tmp_dl_directory/", UNLINK => 0 );
>>> + $tmpfile = $tmp->filename();
>>> + }
>>> +
>>> + # Check if an file for etags has been given.
>>> + if ($etags_file) {
>>> + # Read-in Etags file for known Etags if the file is present.
>>> + &readhash("$etags_file", \%etags) if (-f $etags_file);
>>> +
>>> + # Check if an Etag for the current provider is stored.
>>> + if ($etags{$etagprefix}) {
>>> + # Grab the stored tag.
>>> + my $etag = $etags{$etagprefix};
>>> +
>>> + # Add an "If-None-Match header to the request to ask the server
>>> if the
>>> + # file has been modified.
>>> + $request->header( 'If-None-Match' => $etag );
>>> + }
>>> + }
>>> +
>>> + my $dl_attempt = 1;
>>> + my $response;
>>> +
>>> + # Download and retry on failure.
>>> + while ($dl_attempt <= $max_dl_attempts) {
>>> + # Perform the request and save the output into the tmpfile if
>>> requested.
>>> + $response = $ua->request($request, $tmpfile);
>>> +
>>> + # Check if the download was successfull.
>>> + if($response->is_success) {
>>> + # Break loop.
>>> + last;
>>> +
>>> + # Check if the server responds with 304 (Not Modified).
>>> + } elsif ($response->code == 304) {
>>> + # Remove temporary file, if one exists.
>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>> +
>>> + # Return "not modified".
>>> + return "not modified";
>> 
>> Same here. It feels like strings are not the most ideal return code.
> 
> As promised up, I'm going to explain why I decided to return various
> strings in case of different errors. Calling die() immediately
> terminates the perl (main) script.

This is correct, and I never really like this behaviour. However, Perl does not know exceptions and so we cannot gracefully throw an error.

But, not passing a URL is a programming error and that should get some attention.

The “not modified” case should of course not call die.

> So for example when processing a list of downloads and such an
> event/exception/error appeared in my opinion the main script, which
> called the function should decide how to go further.
> 
> This is very similar to C where you call the function and take care of
> the return code and decide what to do next.

Usually in C we return an integer or even better an enum.

> So in this case I have to disagree with you and strongly recommend to
> keep the error processing at it is.
> 
> Best regards,
> 
> -Stefan
> 
>> 
>>> +
>>> + # Check if we ran out of download re-tries.
>>> + } elsif ($dl_attempt eq $max_dl_attempts) {
>>> + # Obtain error.
>>> + my $error = $response->content;
>>> +
>>> + # Remove temporary file, if one exists.
>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>> +
>>> + # Return the error message from response..
>>> + return "$error";
>>> + }
>>> +
>>> + # Remove temporary file, if one exists.
>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>> +
>>> + # Increase download attempt counter.
>>> + $dl_attempt++;
>>> + }
>>> +
>>> + # Obtain the connection headers.
>>> + my $headers = $response->headers;
>>> +
>>> + # Check if an Etag file has been provided.
>>> + if ($etags_file) {
>>> + # Grab the Etag from the response if the server provides one.
>>> + if ($response->header('Etag')) {
>>> + # Add the provided Etag to the hash of tags.
>>> + $etags{$etagprefix} = $response->header('Etag');
>>> +
>>> + # Write the etags file.
>>> + &writehash($etags_file, \%etags);
>>> + }
>>> + }
>>> +
>>> + # Check if the response should be stored on disk.
>>> + if ($file) {
>>> + # Get the remote size of the content.
>>> + my $remote_size = $response->header('Content-Length');
>>> +
>>> + # Perform a stat on the temporary file.
>>> + my $stat = stat($tmpfile);
>>> +
>>> + # Grab the size of the stored temporary file.
>>> + my $local_size = $stat->size;
>>> +
>>> + # Check if both sizes are equal.
>>> + if(($remote_size) && ($remote_size ne $local_size)) {
>>> + # Delete the temporary file.
>>> + unlink("$tmpfile");
>>> +
>>> + # Abort and return "incomplete download" as error.
>>> + return "incomplete download";
>>> + }
>>> +
>>> + # Move the temporaray file to the desired file by overwriting a
>>> may
>>> + # existing one.
>>> + move("$tmpfile", "$file");
>>> +
>>> + # Omit the timestamp from response header, when the file has been
>>> modified the
>>> + # last time.
>>> + my $last_modified = $headers->last_modified;
>>> +
>>> + # Check if we got a last-modified value from the server.
>>> + if ($last_modified) {
>>> + # Assign the last-modified timestamp as mtime to the
>>> + # stored file.
>>> + utime(time(), "$last_modified", "$file");
>>> + }
>>> +
>>> + # Delete temporary file.
>>> + unlink("$tmpfile");
>>> +
>>> + # If we got here, everything worked fine. Return nothing.
>>> + return;
>>> + } else {
>>> + # Decode the response content and return it.
>>> + return $response->decoded_content;
>>> + }
>>> +}
>>> +
>>> # Cloud Stuff
>>> 
>>> sub running_in_cloud() {
>>> -- 
>>> 2.47.2

next prev parent reply	other threads:[~2025-03-24 10:02 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-22 14:57 Stefan Schantl
2025-03-22 14:57 ` [PATCH 2/3] general-functions.pl: Use new downloader for FetchPublicIp function Stefan Schantl
2025-03-23 12:08   ` Michael Tremer
2025-03-22 14:57 ` [PATCH 3/3] ids-functions.pl: Use new general downloader function Stefan Schantl
2025-03-23 12:08 ` [PATCH 1/3] general-functions.pl: Add LWP-based flexible " Michael Tremer
2025-03-23 19:00   ` Stefan Schantl
2025-03-24 10:02     ` Michael Tremer [this message]
2025-03-26 12:47       ` Stefan Schantl
2025-03-26 16:18         ` Michael Tremer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ABB4812C-0897-45D6-A97E-05E98CEFF9DF@ipfire.org \
    --to=michael.tremer@ipfire.org \
    --cc=development@lists.ipfire.org \
    --cc=stefan.schantl@ipfire.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox