Fwd: Stale pakfire lock-file causing pakfire to no longer work - Development

22 Sep 2022


      Hi all,
just FYI the conversation between Robin and myself.
I am optimistic to present a tested patch in the next days.
Main idea behind the modification:
Move the locking mechanism from test of file existence to file locking ( 
which is an atomic, uninteruptible OS function ).
Regards,
Bernhard
-------- Weitergeleitete Nachricht --------
Betreff: Re: Stale pakfire lock-file causing pakfire to no longer work
Datum: Thu, 22 Sep 2022 14:54:24 +0200
Von: Robin Roevens robin.roevens@disroot.org
An: Bernhard Bitsch bbitsch@ipfire.org
Hi Bernhard
Bernhard Bitsch schreef op do 22-09-2022 om 14:47 [+0200]:
...
Hi Robin,
the .cgi file uses pakfire routines, /opt/pakfire/lib/functions.pl to
be 
exact. So it is guaranteed that the state is read consistantly. This 
should have been true until now.
The problem is the timing of processing. The flock() function
delegates 
the locking/unlocking to the OS. That is the unique instance you are 
speaking of.
The calling of pakfire ( via system()? ) and analysing the output is 
more error-prone and time-consuming.
My approach just adds the lock() / unlock functions to the pakfire 
library. All other functions remain untouched in the interface.
Even better indeed. I did more or less the same with all the metadata
parsing recently added to pakfire by me, where the webui no longer
tries to do the same on it's own, but uses the pakfire functions.
...
Two processes working on the pakfire database only need this one 
semaphore to synchronise. The lock information ( PID, locking time )
are 
an extension, usable by the web interface for example.
My test tasks, which have implemented the mechanism, are running now 
without problems. I'm going to implement it into the pakfire packet 
today. So I think, I'm able to send you a test version soon.
Thanks. Looking forward to it.
Robin
...
Bernhard
Am 22.09.2022 um 14:24 schrieb Robin Roevens:
...
Hi Bernhard
Bernhard Bitsch schreef op do 22-09-2022 om 08:48 [+0200]:
...
Hi Robin,
Am 22.09.2022 um 00:16 schrieb Robin Roevens:
...
Hi Bernhard
Yes, I noticed that too in the meantime.
Can we let the webgui just call pakfire without checking for a
lockfile. But when pakfire exits with the error that it is
locked,
just
retry (after delay?), showing the current waiting for lock
screen
in
the webgui.
The WebGUI repeats the lock check, yet.
But in the moment there is just the problem that the mechanism of
locking changed from creation/deletion to file lock/unlock. The
file
isn't deleted.
My idea: the locking writes pid and time to the file. So the
WebGUI
can
get the information (locked, locking time) for display in the
message.
Moreover the locking process can be given.
This arises some more complexity. Therefore the solution lasts a
bit.
If I understand your intention the webgui would still try to
independently figure out if pakfire is locked or not?
But I don't think the webgui should try to figure that out by
itself
(as it did before by independently checking the existence of the
file)
but leave that part to pakfire itself.
When pakfire is started, it checks for the lock and exits if
locked,
eventually indeed with a message containing the process that has
the
lock. And the webgui should pick that up to display to the user,
and
retry launching pakfire, until it actually does some work instead
of
exiting because of a lock.
This way there is a single source of truth about the locking state
of
pakfire and that is pakfire itself; If the locking mechanism would
change again some time in the future for reasons we don't know yet
:-)
then the webgui should not be needing modifications anymore to
comply
with the new mechanism.
...
Further I want to check for more possible problems in the pakfire
part
of IPFire.
Maybe we can find some more possible race conditions explaining
mysterious errors.
Did you find any indications for the missing file deletion in the
'normal' pakfire SW, yet?
Not yet. To be honest, I didn't look for it anymore due to work on
another bug. But I hope to investigate this further in the days to
come.
Robin
...
Bernhard
...
Thanks for the good work!
Robin
Bernhard Bitsch schreef op ma 19-09-2022 om 15:32 [+0200]:
...
Hi Robin,
we missed something. The WebGUI doesn't know about our file
locking.
:(
So it waits for deletion of the lock file.
I'll work on that asap.
Bernhard
Am 19.09.2022 um 00:58 schrieb Robin Roevens:
...
Hi Bernhard
I have good news. For now my pakfire status check in Zabbix
is
still
correctly running and pakfire is still functioning.
I do see a few missing results now and then.
Every 10 minutes I should get the output of pakfire status
in
Zabbix.
But once in a while I'm missing a result. So I assume those
are
the
points where previous pakfire version would leave the
lockfile
and
stop
functioning.
Now a subsequent call will succeed again and both the check
and
pakfire
keeps on working.
One time I'm missing 2 subsequent data points (02:50 and
03:00)
but
at
03:10 Zabbix again got new data from pakfire.
So I think the flock method is a success!
The pakfire process sometimes terminating unexpectedly
still
happens,
but this now no longer breaks pakfire.
As I'm still suspecting sudo, I think them most elegant
solution to
that would be to allow 'pakfire status' to be run as non
privileged
user without using sudo. But I will try to implement that
myself
soon
and publish it to the devel list (if that indeed solves
this
problem).
The most important thing is that pakfire no longer breaks,
which
seems
to be fixed with the flock method.
Anyway, I will leave this running now for about a week and
see
if
pakfire still won't break. But I'm already quite sure that
is
solved
now.
Thanks for this implementation
Robin
Bernhard Bitsch schreef op zo 18-09-2022 om 02:00 [+0200]:
> Hi Robin,
> 
> Am 18.09.2022 um 00:23 schrieb Robin Roevens:
> > Hi Bernhard
> > 
> > Thanks. I installed it on my IPFire mini appliance. If
> > tomorrow
> > I
> > still
> > have a working check in Zabbix, it will already be a
> > success as
> > the
> > original version always leaves a stale lockfile within
> > 24h.
> > But for safety I think is it best to have it run for at
> > least a
> > week or
> > so without problem.
> > 
> > The quick tests I perform manually indeed proved
> > pakfire
> > refusing
> > to
> > start when another instance is running.
> > 
> > One thing, I noticed is that I now get
> > ---
> > Use of uninitialized value $ARGV[1] in string at
> > /opt/pakfire/pakfire
> > line 344.
> > Use of uninitialized value $ARGV[1] in string at
> > /opt/pakfire/pakfire
> > line 346.
> > ---
> > when performing 'pakfire list', but for as far as I can
> > see
> > this
> > was
> > already there, but only now visible due to 'use
> > warnings'.
> > So
> > that
> > is
> > not in scope here.
> > 
> 
> Sorry, didn't scroll up the output while testing.
> Yes, this issues were there already. I use the warnings
> generally,
> because of the complexity of Perl; you never are aware of
> all
> possible
> interpretations of your written code.
> I'm just doing a little change. As far as I see, all
> options
> are
> handled
> at the start of the program. If these options are deleted
> from
> the
> arguments ( @ARGV ), the rest of the program works on an
> array
> containing only command ($ARGV[0]) and parameters. So the
> checks
> for
> options can be eliminated.
> 
> > And in a final version of course a message telling the
> > user
> > why
> > it
> > exited prematurely, should also be displayed. But that
> > is a
> > detail
> > for
> > later.
> > 
> 
> Agreed.
> 
> Regards
> Bernhard
> 
> > Regards
> > Robin
> > 
> > Bernhard Bitsch schreef op za 17-09-2022 om 23:11
> > [+0200]:
> > > 
> > > Hi Robin,
> > > 
> > > as announced I've made a version with the flock()
> > > function.
> > > Just unpack the arcive attached and copy it to
> > > /opt/pakfire
> > > directory.
> > > My tests with the locking simulation look good. I
> > > will
> > > look
> > > into
> > > other
> > > IPC functions for handling the race condition.
> > > 
> > > BTW, do you know the reason for the waits in pakfire-
> > > update
> > > and
> > > pakfire-upgrade functionality?
> > > 
> > > Regards,
> > > Bernhard
> > > 
> > 
>
-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.