Hi all,
just FYI the conversation between Robin and myself. I am optimistic to present a tested patch in the next days. Main idea behind the modification: Move the locking mechanism from test of file existence to file locking ( which is an atomic, uninteruptible OS function ).
Regards, Bernhard
-------- Weitergeleitete Nachricht -------- Betreff: Re: Stale pakfire lock-file causing pakfire to no longer work Datum: Thu, 22 Sep 2022 14:54:24 +0200 Von: Robin Roevens robin.roevens@disroot.org An: Bernhard Bitsch bbitsch@ipfire.org
Hi Bernhard
Bernhard Bitsch schreef op do 22-09-2022 om 14:47 [+0200]:
Hi Robin,
the .cgi file uses pakfire routines, /opt/pakfire/lib/functions.pl to be exact. So it is guaranteed that the state is read consistantly. This should have been true until now. The problem is the timing of processing. The flock() function delegates the locking/unlocking to the OS. That is the unique instance you are speaking of.
The calling of pakfire ( via system()? ) and analysing the output is more error-prone and time-consuming. My approach just adds the lock() / unlock functions to the pakfire library. All other functions remain untouched in the interface.
Even better indeed. I did more or less the same with all the metadata parsing recently added to pakfire by me, where the webui no longer tries to do the same on it's own, but uses the pakfire functions.
Two processes working on the pakfire database only need this one semaphore to synchronise. The lock information ( PID, locking time ) are an extension, usable by the web interface for example.
My test tasks, which have implemented the mechanism, are running now without problems. I'm going to implement it into the pakfire packet today. So I think, I'm able to send you a test version soon.
Thanks. Looking forward to it.
Robin
Bernhard
Am 22.09.2022 um 14:24 schrieb Robin Roevens:
Hi Bernhard
Bernhard Bitsch schreef op do 22-09-2022 om 08:48 [+0200]:
Hi Robin,
Am 22.09.2022 um 00:16 schrieb Robin Roevens:
Hi Bernhard
Yes, I noticed that too in the meantime. Can we let the webgui just call pakfire without checking for a lockfile. But when pakfire exits with the error that it is locked, just retry (after delay?), showing the current waiting for lock screen in the webgui.
The WebGUI repeats the lock check, yet. But in the moment there is just the problem that the mechanism of locking changed from creation/deletion to file lock/unlock. The file isn't deleted. My idea: the locking writes pid and time to the file. So the WebGUI can get the information (locked, locking time) for display in the message. Moreover the locking process can be given. This arises some more complexity. Therefore the solution lasts a bit.
If I understand your intention the webgui would still try to independently figure out if pakfire is locked or not? But I don't think the webgui should try to figure that out by itself (as it did before by independently checking the existence of the file) but leave that part to pakfire itself. When pakfire is started, it checks for the lock and exits if locked, eventually indeed with a message containing the process that has the lock. And the webgui should pick that up to display to the user, and retry launching pakfire, until it actually does some work instead of exiting because of a lock. This way there is a single source of truth about the locking state of pakfire and that is pakfire itself; If the locking mechanism would change again some time in the future for reasons we don't know yet :-) then the webgui should not be needing modifications anymore to comply with the new mechanism.
Further I want to check for more possible problems in the pakfire part of IPFire. Maybe we can find some more possible race conditions explaining mysterious errors.
Did you find any indications for the missing file deletion in the 'normal' pakfire SW, yet?
Not yet. To be honest, I didn't look for it anymore due to work on another bug. But I hope to investigate this further in the days to come.
Robin
Bernhard
Thanks for the good work!
Robin
Bernhard Bitsch schreef op ma 19-09-2022 om 15:32 [+0200]:
Hi Robin,
we missed something. The WebGUI doesn't know about our file locking. :( So it waits for deletion of the lock file. I'll work on that asap.
Bernhard
Am 19.09.2022 um 00:58 schrieb Robin Roevens:
Hi Bernhard
I have good news. For now my pakfire status check in Zabbix is still correctly running and pakfire is still functioning.
I do see a few missing results now and then. Every 10 minutes I should get the output of pakfire status in Zabbix. But once in a while I'm missing a result. So I assume those are the points where previous pakfire version would leave the lockfile and stop functioning. Now a subsequent call will succeed again and both the check and pakfire keeps on working. One time I'm missing 2 subsequent data points (02:50 and 03:00) but at 03:10 Zabbix again got new data from pakfire.
So I think the flock method is a success!
The pakfire process sometimes terminating unexpectedly still happens, but this now no longer breaks pakfire. As I'm still suspecting sudo, I think them most elegant solution to that would be to allow 'pakfire status' to be run as non privileged user without using sudo. But I will try to implement that myself soon and publish it to the devel list (if that indeed solves this problem).
The most important thing is that pakfire no longer breaks, which seems to be fixed with the flock method.
Anyway, I will leave this running now for about a week and see if pakfire still won't break. But I'm already quite sure that is solved now.
Thanks for this implementation
Robin
Bernhard Bitsch schreef op zo 18-09-2022 om 02:00 [+0200]: > Hi Robin, > > Am 18.09.2022 um 00:23 schrieb Robin Roevens: > > Hi Bernhard > > > > Thanks. I installed it on my IPFire mini appliance. If > > tomorrow > > I > > still > > have a working check in Zabbix, it will already be a > > success as > > the > > original version always leaves a stale lockfile within > > 24h. > > But for safety I think is it best to have it run for at > > least a > > week or > > so without problem. > > > > The quick tests I perform manually indeed proved > > pakfire > > refusing > > to > > start when another instance is running. > > > > One thing, I noticed is that I now get > > --- > > Use of uninitialized value $ARGV[1] in string at > > /opt/pakfire/pakfire > > line 344. > > Use of uninitialized value $ARGV[1] in string at > > /opt/pakfire/pakfire > > line 346. > > --- > > when performing 'pakfire list', but for as far as I can > > see > > this > > was > > already there, but only now visible due to 'use > > warnings'. > > So > > that > > is > > not in scope here. > > > > Sorry, didn't scroll up the output while testing. > Yes, this issues were there already. I use the warnings > generally, > because of the complexity of Perl; you never are aware of > all > possible > interpretations of your written code. > I'm just doing a little change. As far as I see, all > options > are > handled > at the start of the program. If these options are deleted > from > the > arguments ( @ARGV ), the rest of the program works on an > array > containing only command ($ARGV[0]) and parameters. So the > checks > for > options can be eliminated. > > > And in a final version of course a message telling the > > user > > why > > it > > exited prematurely, should also be displayed. But that > > is a > > detail > > for > > later. > > > > Agreed. > > Regards > Bernhard > > > Regards > > Robin > > > > Bernhard Bitsch schreef op za 17-09-2022 om 23:11 > > [+0200]: > > > > > > Hi Robin, > > > > > > as announced I've made a version with the flock() > > > function. > > > Just unpack the arcive attached and copy it to > > > /opt/pakfire > > > directory. > > > My tests with the locking simulation look good. I > > > will > > > look > > > into > > > other > > > IPC functions for handling the race condition. > > > > > > BTW, do you know the reason for the waits in pakfire- > > > update > > > and > > > pakfire-upgrade functionality? > > > > > > Regards, > > > Bernhard > > > > > >