Re: [PATCH 1/3] general-functions.pl: Add LWP-based flexible downloader function.

public inbox for development@lists.ipfire.org
 help / color / mirror / Atom feed

From: Michael Tremer <michael.tremer@ipfire.org>
To: Stefan Schantl <stefan.schantl@ipfire.org>
Cc: development@lists.ipfire.org
Subject: Re: [PATCH 1/3] general-functions.pl: Add LWP-based flexible downloader function.
Date: Wed, 26 Mar 2025 16:18:15 +0000	[thread overview]
Message-ID: <B99F2B13-AB79-4C6F-911E-2228108535E9@ipfire.org> (raw)
In-Reply-To: <195d27e4348.27e7.cac9d3ffac9e24d09d20af05166fd73b@ipfire.org>

Hello Stefan,

> On 26 Mar 2025, at 12:47, Stefan Schantl <stefan.schantl@ipfire.org> wrote:
> 
> 
> 
> Am 24. März 2025 11:02:08 schrieb Michael Tremer <michael.tremer@ipfire.org>:
> 
>> Hello,
>> 
>>> On 23 Mar 2025, at 19:00, Stefan Schantl <stefan.schantl@ipfire.org> wrote:
>>> 
>>> Hello Michael,
>>> 
>>> thanks for the review and your detailed feedback.
>>> 
>>> You are right this patch mainly moves the downloader logic from the
>>> ipblocklist feature into an own easy to use general function.
>>> 
>>>> Morning Stefan,
>>>> 
>>>> Thank you for taking the time and getting this out of your git
>>>> repository and onto this list :)
>>>> 
>>>> As we are already shipping this code and this patch only moves it
>>>> around, there is probably not much standing in the way of seeing this
>>>> being merged. I have however a couple of questions that we might want
>>>> to address in a follow-up patch.
>>>> 
>>>>> On 22 Mar 2025, at 14:57, Stefan Schantl
>>>>> <stefan.schantl@ipfire.org> wrote:
>>>>> 
>>>>> This function can be used to grab content and/or store it into
>>>>> files.
>>>>> 
>>>>> Signed-off-by: Stefan Schantl <stefan.schantl@ipfire.org>
>>>>> ---
>>>>> config/cfgroot/general-functions.pl | 262
>>>>> ++++++++++++++++++++++++++++
>>>>> 1 file changed, 262 insertions(+)
>>>>> 
>>>>> diff --git a/config/cfgroot/general-functions.pl
>>>>> b/config/cfgroot/general-functions.pl
>>>>> index 8ba6e3f79..cb8df69c6 100644
>>>>> --- a/config/cfgroot/general-functions.pl
>>>>> +++ b/config/cfgroot/general-functions.pl
>>>>> @@ -20,6 +20,23 @@ use IO::Socket;
>>>>> use Net::SSLeay;
>>>>> use Net::IPv4Addr qw(:all);
>>>>> 
>>>>> +# Load module to move files.
>>>>> +use File::Copy;
>>>>> +
>>>>> +# Load module to get file stats.
>>>>> +use File::stat;
>>>>> +
>>>>> +# Load module to deal with temporary files.
>>>>> +use File::Temp;
>>>>> +
>>>>> +# Load module to deal with the date formats used by the HTTP
>>>>> protocol.
>>>>> +use HTTP::Date;
>>>>> +
>>>>> +# Load the libwwwperl User Agent module.
>>>>> +use LWP::UserAgent;
>>>> 
>>>> Recently (let’s say a year ago) I have spent some time on making the
>>>> web UI faster. Especially on some hardware, is it not very snappy.
>>>> The reason for the slowness however was not that we are doing much
>>>> stuff in our code; it was rather that loading modules was taking a
>>>> long time in Perl. So I removed what was no longer needed and moved
>>>> some other stuff around which slightly helped (saved around 100ms on
>>>> the older IPFire Mini Appliance).
>>>> 
>>>> Since the downloading functionality is not used *that* much here.
>>>> Should this maybe not live in a separate place and we only include it
>>>> when it is needed?
>>> 
>>> Sounds good, so I'm totally fine with this suggestion. This can be part
>>> of a following patch(set).
>> 
>> Well, moving it should not be a separate patch set, because we should not move it twice.
> 
> 
> Hello Michael,
> 
> this absolutely makes sense. Any ideas for the filename?
> 
> I think downloader.pl could be misinterpreted by a certain users - maybe web-function.pl?

download-functions.pl?

> 
>> 
>> But any of the other fixes should be implemented in a separate patch each.
> 
> 
> Let's do it this way.
> 
>> 
>>>> 
>>>>> +
>>>>> +$|=1; # line buffering
>>>> 
>>>> Why is this needed?
>>> 
>>> This line does not belong to the patch and is already part of the
>>> general-functions.pl file.
>>> 
>>> The following can be found in the perl special variables manual about
>>> $|: 
>>> 
>>> $| - If set to nonzero, forces an fflush(3) after every write or print
>>> on the currently selected output channel.
>>> 
>>> I'm not sure, if this line is required in general.
>> 
>> I know what it does. I just wanted to know why this has been enabled here.
> 
> 
> Short answer: No clue.

Ooookay.

> 
>> 
>>> 
>>>> 
>>>>> +
>>>>> $General::version = 'VERSION';
>>>>> $General::swroot = 'CONFIG_ROOT';
>>>>> $General::noipprefix = 'noipg-';
>>>>> @@ -1346,6 +1363,251 @@ sub generateProtoTransHash () {
>>>>> return %protocols;
>>>>> }
>>>>> 
>>>>> +# Function to grab a given URL content or to download and store it
>>>>> on disk.
>>>>> +#
>>>>> +# The function requires a configuration hash to be passed.
>>>>> +#
>>>>> +# The following options (hash keys) are supported:
>>>>> +#
>>>>> +# URL -> The URL to the content or file. REQUIRED!
>>>>> +# FILE -> The filename as fullpath where the content/file should
>>>>> be stored on disk.
>>>>> +# ETAGSFILE -> A filename again as fullpath where Etags should be
>>>>> stored and read.
>>>>> +# ETAGPREFIX -> In case a custom etag name should be used,
>>>>> otherwise it defaults to the given URL.
>>>>> +# MAXSIZE -> In bytes until the downloader will abort downloading.
>>>>> (example: 10_485_760 for 10MB)
>>>>> +#
>>>>> +# If a file is given an If-Modified-Since header will be generated
>>>>> from the last modified timestamp
>>>>> +# of an already stored file. In case an Etag file is specified an
>>>>> If-None-Match header will be added to
>>>>> +# the request - Both can be used at the same time.
>>>>> +#
>>>>> +# In case no FILE option has been passed to the function, the
>>>>> content of the requested URL will be returned.
>>>>> +#
>>>>> +# Return codes (if FILE is used):
>>>>> +#
>>>>> +# nothing - On success
>>>>> +# no url - If no URL has been specified.
>>>>> +# not_modified - In case the servers responds with "Not modified"
>>>>> (304)
>>>>> +# dl_error - If the requested URL cannot be accessed.
>>>>> +# incomplete download - In case the size of the local file does
>>>>> not match the remote content_lenght.
>>>>> +#
>>>>> +sub downloader (%) {
>>>>> + my (%args) = @_;
>>>>> +
>>>>> + # Remap args hash and convert all keys into upper case format.
>>>>> + %args = map { uc $_ => $args{$_} } keys %args;
>>>>> +
>>>>> + # The amount of download attempts before giving up and
>>>>> + # logging an error.
>>>>> + my $max_dl_attempts = 3;
>>>>> +
>>>>> + # Temporary directory to download the files.
>>>>> + my $tmp_dl_directory = "/var/tmp";
>>>>> +
>>>>> + # Assign hash values.
>>>>> + my $url = $args{"URL"} if (exists($args{"URL"}));
>>>>> + my $file = $args{"FILE"} if (exists($args{"FILE"}));
>>>>> + my $etags_file = $args{"ETAGSFILE"} if
>>>>> (exists($args{"ETAGSFILE"}));
>>>>> + my $etagprefix = $url;
>>>>> + $etagprefix = $args{"ETAGPREFIX"} if
>>>>> (exists($args{"ETAGPREFIX"}));
>>>>> + my $max_size = $args{"MAXSIZE"} if (exists($args{"MAXSIZE"}));
>>>> 
>>>> I consider this really ugly, but I suppose this is what Perl wants.
>>>> 
>>>>> + # Abort with error "no url", if no URL has been given.
>>>>> + return "no url" unless ($url);
>>>> 
>>>> Could we rather die here with an error? Just returning a string here
>>>> seems to be too easy to ignore in case something above went wrong.
>>> 
>>> Technically you are right, but let me explain the reason for this kind
>>> of error handling a bit further down.
>>> 
>>>> 
>>>>> +
>>>>> + my %etags = ();
>>>>> + my $tmpfile;
>>>>> +
>>>>> + # Read-in proxysettings.
>>>>> + my %proxysettings=();
>>>>> + &readhash("${General::swroot}/proxy/settings", \%proxysettings);
>>>>> +
>>>>> + # Create a user agent instance.
>>>>> + #
>>>>> + # Request SSL hostname verification and specify path
>>>>> + # to the CA file.
>>>>> + my $ua = LWP::UserAgent->new(
>>>>> + ssl_opts => {
>>>>> + SSL_ca_file     => '/etc/ssl/cert.pem',
>>>>> + verify_hostname => 1,
>>>>> + },
>>>>> + );
>>>>> +
>>>>> + # Set timeout to 10 seconds.
>>>>> + $ua->timeout(10);
>>>> 
>>>> A 10 second timeout also sounds rather optimistic considering some
>>>> slower, overloaded mirrors or those that are quite far away. For any
>>>> tasks like downloading data in the background we could afford longer.
>>> 
>>> Accepted, what would be a good default value here?
>> 
>> 60 seconds? I think it makes sense to make this optionally configurable so that for the IP check that has been changed here, we will only wait around 10 seconds as it is a simple API request.
> 
> 
> I'll  implement it as optional and config-able feature. Default if not set would be 60 seconds.

Agreed.

>> 
>>> 
>>>> 
>>>>> +
>>>>> + # Assign maximum download size if set.
>>>>> + $ua->max_size($max_size) if ($max_size);
>>>>> +
>>>>> + # Generate UserAgent.
>>>>> + my $agent = &MakeUserAgent();
>>>>> +
>>>>> + # Set the generated UserAgent.
>>>>> + $ua->agent($agent);
>>>>> +
>>>>> + # Check if an upstream proxy is configured.
>>>>> + if ($proxysettings{'UPSTREAM_PROXY'}) {
>>>>> + my $proxy_url;
>>>>> +
>>>>> + $proxy_url = "http://";
>>>>> +
>>>>> + # Check if the proxy requires authentication.
>>>>> + if (($proxysettings{'UPSTREAM_USER'}) &&
>>>>> ($proxysettings{'UPSTREAM_PASSWORD'})) {
>>>>> + $proxy_url .=
>>>>> "$proxysettings{'UPSTREAM_USER'}\:$proxysettings{'UPSTREAM_PASSWORD
>>>>> '}\@";
>>>>> + }
>>>>> +
>>>>> + # Add proxy server address and port.
>>>>> + $proxy_url .= $proxysettings{'UPSTREAM_PROXY'};
>>>>> +
>>>>> + # Append proxy settings.
>>>>> + $ua->proxy(['http', 'https'], $proxy_url);
>>>>> + }
>>>>> +
>>>>> + # Create a HTTP request element and pass the given URL to it.
>>>>> + my $request = HTTP::Request->new(GET => $url);
>>>>> +
>>>>> + # Check if a file to store the output has been provided.
>>>>> + if ($file) {
>>>>> + # Check if the given file already exits, because it has been
>>>>> downloaded in the past.
>>>>> + #
>>>>> + # In this case we are requesting the server if the remote file
>>>>> has been changed or not.
>>>>> + # This will be done by sending the modification time in a special
>>>>> HTTP header.
>>>>> + if (-f $file) {
>>>>> + # Call stat on the file.
>>>>> + my $stat = stat($file);
>>>>> +
>>>>> + # Omit the mtime of the existing file.
>>>>> + my $mtime = $stat->mtime;
>>>>> +
>>>>> + # Convert the timestamp into right format.
>>>>> + my $http_date = time2str($mtime);
>>>>> +
>>>>> + # Add the If-Modified-Since header to the request to ask the
>>>>> server if the
>>>>> + # file has been modified.
>>>>> + $request->header( 'If-Modified-Since' => "$http_date" );
>>>>> + }
>>>>> +
>>>>> + # Generate temporary file name, located in the tempoary download
>>>>> directory and with a suffix of ".tmp".
>>>>> + # The downloaded file will be stored there until some sanity
>>>>> checks are performed.
>>>>> + my $tmp = File::Temp->new( SUFFIX => ".tmp", DIR =>
>>>>> "$tmp_dl_directory/", UNLINK => 0 );
>>>>> + $tmpfile = $tmp->filename();
>>>>> + }
>>>>> +
>>>>> + # Check if an file for etags has been given.
>>>>> + if ($etags_file) {
>>>>> + # Read-in Etags file for known Etags if the file is present.
>>>>> + &readhash("$etags_file", \%etags) if (-f $etags_file);
>>>>> +
>>>>> + # Check if an Etag for the current provider is stored.
>>>>> + if ($etags{$etagprefix}) {
>>>>> + # Grab the stored tag.
>>>>> + my $etag = $etags{$etagprefix};
>>>>> +
>>>>> + # Add an "If-None-Match header to the request to ask the server
>>>>> if the
>>>>> + # file has been modified.
>>>>> + $request->header( 'If-None-Match' => $etag );
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + my $dl_attempt = 1;
>>>>> + my $response;
>>>>> +
>>>>> + # Download and retry on failure.
>>>>> + while ($dl_attempt <= $max_dl_attempts) {
>>>>> + # Perform the request and save the output into the tmpfile if
>>>>> requested.
>>>>> + $response = $ua->request($request, $tmpfile);
>>>>> +
>>>>> + # Check if the download was successfull.
>>>>> + if($response->is_success) {
>>>>> + # Break loop.
>>>>> + last;
>>>>> +
>>>>> + # Check if the server responds with 304 (Not Modified).
>>>>> + } elsif ($response->code == 304) {
>>>>> + # Remove temporary file, if one exists.
>>>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>>>> +
>>>>> + # Return "not modified".
>>>>> + return "not modified";
>>>> 
>>>> Same here. It feels like strings are not the most ideal return code.
>>> 
>>> As promised up, I'm going to explain why I decided to return various
>>> strings in case of different errors. Calling die() immediately
>>> terminates the perl (main) script.
>> 
>> This is correct, and I never really like this behaviour. However, Perl does not know exceptions and so we cannot gracefully throw an error.
>> 
>> But, not passing a URL is a programming error and that should get some attention.
> 
> 
> Yeah, you have take care of this kind of perl behaviour and I'm also not in favour of it. That was the reason why I decided to do it in this way.
> 
> I see the reason why you are recommending to die in case no url has passed - I'll adjust the code.
> 
> Best regards,
> 
> -Stefan
> 
>> 
>> The “not modified” case should of course not call die.
>> 
>>> So for example when processing a list of downloads and such an
>>> event/exception/error appeared in my opinion the main script, which
>>> called the function should decide how to go further.
>>> 
>>> This is very similar to C where you call the function and take care of
>>> the return code and decide what to do next.
>> 
>> Usually in C we return an integer or even better an enum.
>> 
>>> So in this case I have to disagree with you and strongly recommend to
>>> keep the error processing at it is.
>>> 
>>> Best regards,
>>> 
>>> -Stefan
>>> 
>>>> 
>>>>> +
>>>>> + # Check if we ran out of download re-tries.
>>>>> + } elsif ($dl_attempt eq $max_dl_attempts) {
>>>>> + # Obtain error.
>>>>> + my $error = $response->content;
>>>>> +
>>>>> + # Remove temporary file, if one exists.
>>>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>>>> +
>>>>> + # Return the error message from response..
>>>>> + return "$error";
>>>>> + }
>>>>> +
>>>>> + # Remove temporary file, if one exists.
>>>>> + unlink("$tmpfile") if (-e "$tmpfile");
>>>>> +
>>>>> + # Increase download attempt counter.
>>>>> + $dl_attempt++;
>>>>> + }
>>>>> +
>>>>> + # Obtain the connection headers.
>>>>> + my $headers = $response->headers;
>>>>> +
>>>>> + # Check if an Etag file has been provided.
>>>>> + if ($etags_file) {
>>>>> + # Grab the Etag from the response if the server provides one.
>>>>> + if ($response->header('Etag')) {
>>>>> + # Add the provided Etag to the hash of tags.
>>>>> + $etags{$etagprefix} = $response->header('Etag');
>>>>> +
>>>>> + # Write the etags file.
>>>>> + &writehash($etags_file, \%etags);
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + # Check if the response should be stored on disk.
>>>>> + if ($file) {
>>>>> + # Get the remote size of the content.
>>>>> + my $remote_size = $response->header('Content-Length');
>>>>> +
>>>>> + # Perform a stat on the temporary file.
>>>>> + my $stat = stat($tmpfile);
>>>>> +
>>>>> + # Grab the size of the stored temporary file.
>>>>> + my $local_size = $stat->size;
>>>>> +
>>>>> + # Check if both sizes are equal.
>>>>> + if(($remote_size) && ($remote_size ne $local_size)) {
>>>>> + # Delete the temporary file.
>>>>> + unlink("$tmpfile");
>>>>> +
>>>>> + # Abort and return "incomplete download" as error.
>>>>> + return "incomplete download";
>>>>> + }
>>>>> +
>>>>> + # Move the temporaray file to the desired file by overwriting a
>>>>> may
>>>>> + # existing one.
>>>>> + move("$tmpfile", "$file");
>>>>> +
>>>>> + # Omit the timestamp from response header, when the file has been
>>>>> modified the
>>>>> + # last time.
>>>>> + my $last_modified = $headers->last_modified;
>>>>> +
>>>>> + # Check if we got a last-modified value from the server.
>>>>> + if ($last_modified) {
>>>>> + # Assign the last-modified timestamp as mtime to the
>>>>> + # stored file.
>>>>> + utime(time(), "$last_modified", "$file");
>>>>> + }
>>>>> +
>>>>> + # Delete temporary file.
>>>>> + unlink("$tmpfile");
>>>>> +
>>>>> + # If we got here, everything worked fine. Return nothing.
>>>>> + return;
>>>>> + } else {
>>>>> + # Decode the response content and return it.
>>>>> + return $response->decoded_content;
>>>>> + }
>>>>> +}
>>>>> +
>>>>> # Cloud Stuff
>>>>> 
>>>>> sub running_in_cloud() {
>>>>> -- 
>>>>> 2.47.2
>

     prev parent reply	other threads:[~2025-03-26 16:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-22 14:57 Stefan Schantl
2025-03-22 14:57 ` [PATCH 2/3] general-functions.pl: Use new downloader for FetchPublicIp function Stefan Schantl
2025-03-23 12:08   ` Michael Tremer
2025-03-22 14:57 ` [PATCH 3/3] ids-functions.pl: Use new general downloader function Stefan Schantl
2025-03-23 12:08 ` [PATCH 1/3] general-functions.pl: Add LWP-based flexible " Michael Tremer
2025-03-23 19:00   ` Stefan Schantl
2025-03-24 10:02     ` Michael Tremer
2025-03-26 12:47       ` Stefan Schantl
2025-03-26 16:18         ` Michael Tremer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B99F2B13-AB79-4C6F-911E-2228108535E9@ipfire.org \
    --to=michael.tremer@ipfire.org \
    --cc=development@lists.ipfire.org \
    --cc=stefan.schantl@ipfire.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox