* Fwd: Stale pakfire lock-file causing pakfire to no longer work
[not found] <3e3529f26167b58351ee6a37d3f7347a3c6cc1c6.camel@sicho.home>
@ 2022-09-22 13:17 ` Bernhard Bitsch
0 siblings, 0 replies; only message in thread
From: Bernhard Bitsch @ 2022-09-22 13:17 UTC (permalink / raw)
To: development
[-- Attachment #1: Type: text/plain, Size: 11087 bytes --]
Hi all,
just FYI the conversation between Robin and myself.
I am optimistic to present a tested patch in the next days.
Main idea behind the modification:
Move the locking mechanism from test of file existence to file locking (
which is an atomic, uninteruptible OS function ).
Regards,
Bernhard
-------- Weitergeleitete Nachricht --------
Betreff: Re: Stale pakfire lock-file causing pakfire to no longer work
Datum: Thu, 22 Sep 2022 14:54:24 +0200
Von: Robin Roevens <robin.roevens(a)disroot.org>
An: Bernhard Bitsch <bbitsch(a)ipfire.org>
Hi Bernhard
Bernhard Bitsch schreef op do 22-09-2022 om 14:47 [+0200]:
> Hi Robin,
>
> the .cgi file uses pakfire routines, /opt/pakfire/lib/functions.pl to
> be
> exact. So it is guaranteed that the state is read consistantly. This
> should have been true until now.
> The problem is the timing of processing. The flock() function
> delegates
> the locking/unlocking to the OS. That is the unique instance you are
> speaking of.
>
> The calling of pakfire ( via system()? ) and analysing the output is
> more error-prone and time-consuming.
> My approach just adds the lock() / unlock functions to the pakfire
> library. All other functions remain untouched in the interface.
Even better indeed. I did more or less the same with all the metadata
parsing recently added to pakfire by me, where the webui no longer
tries to do the same on it's own, but uses the pakfire functions.
>
> Two processes working on the pakfire database only need this one
> semaphore to synchronise. The lock information ( PID, locking time )
> are
> an extension, usable by the web interface for example.
>
> My test tasks, which have implemented the mechanism, are running now
> without problems. I'm going to implement it into the pakfire packet
> today. So I think, I'm able to send you a test version soon.
Thanks. Looking forward to it.
Robin
>
> Bernhard
>
>
> Am 22.09.2022 um 14:24 schrieb Robin Roevens:
> > Hi Bernhard
> >
> > Bernhard Bitsch schreef op do 22-09-2022 om 08:48 [+0200]:
> > > Hi Robin,
> > >
> > >
> > > Am 22.09.2022 um 00:16 schrieb Robin Roevens:
> > > > Hi Bernhard
> > > >
> > > > Yes, I noticed that too in the meantime.
> > > > Can we let the webgui just call pakfire without checking for a
> > > > lockfile. But when pakfire exits with the error that it is
> > > > locked,
> > > > just
> > > > retry (after delay?), showing the current waiting for lock
> > > > screen
> > > > in
> > > > the webgui.
> > > >
> > >
> > > The WebGUI repeats the lock check, yet.
> > > But in the moment there is just the problem that the mechanism of
> > > locking changed from creation/deletion to file lock/unlock. The
> > > file
> > > isn't deleted.
> > > My idea: the locking writes pid and time to the file. So the
> > > WebGUI
> > > can
> > > get the information (locked, locking time) for display in the
> > > message.
> > > Moreover the locking process can be given.
> > > This arises some more complexity. Therefore the solution lasts a
> > > bit.
> >
> > If I understand your intention the webgui would still try to
> > independently figure out if pakfire is locked or not?
> > But I don't think the webgui should try to figure that out by
> > itself
> > (as it did before by independently checking the existence of the
> > file)
> > but leave that part to pakfire itself.
> > When pakfire is started, it checks for the lock and exits if
> > locked,
> > eventually indeed with a message containing the process that has
> > the
> > lock. And the webgui should pick that up to display to the user,
> > and
> > retry launching pakfire, until it actually does some work instead
> > of
> > exiting because of a lock.
> > This way there is a single source of truth about the locking state
> > of
> > pakfire and that is pakfire itself; If the locking mechanism would
> > change again some time in the future for reasons we don't know yet
> > :-)
> > then the webgui should not be needing modifications anymore to
> > comply
> > with the new mechanism.
> >
> > > Further I want to check for more possible problems in the pakfire
> > > part
> > > of IPFire.
> > > Maybe we can find some more possible race conditions explaining
> > > mysterious errors.
> > >
> > > Did you find any indications for the missing file deletion in the
> > > 'normal' pakfire SW, yet?
> > Not yet. To be honest, I didn't look for it anymore due to work on
> > another bug. But I hope to investigate this further in the days to
> > come.
> >
> > Robin
> >
> > >
> > > Bernhard
> > >
> > > > Thanks for the good work!
> > > >
> > > > Robin
> > > >
> > > > Bernhard Bitsch schreef op ma 19-09-2022 om 15:32 [+0200]:
> > > > > Hi Robin,
> > > > >
> > > > > we missed something. The WebGUI doesn't know about our file
> > > > > locking.
> > > > > :(
> > > > > So it waits for deletion of the lock file.
> > > > > I'll work on that asap.
> > > > >
> > > > > Bernhard
> > > > >
> > > > > Am 19.09.2022 um 00:58 schrieb Robin Roevens:
> > > > > > Hi Bernhard
> > > > > >
> > > > > > I have good news. For now my pakfire status check in Zabbix
> > > > > > is
> > > > > > still
> > > > > > correctly running and pakfire is still functioning.
> > > > > >
> > > > > > I do see a few missing results now and then.
> > > > > > Every 10 minutes I should get the output of pakfire status
> > > > > > in
> > > > > > Zabbix.
> > > > > > But once in a while I'm missing a result. So I assume those
> > > > > > are
> > > > > > the
> > > > > > points where previous pakfire version would leave the
> > > > > > lockfile
> > > > > > and
> > > > > > stop
> > > > > > functioning.
> > > > > > Now a subsequent call will succeed again and both the check
> > > > > > and
> > > > > > pakfire
> > > > > > keeps on working.
> > > > > > One time I'm missing 2 subsequent data points (02:50 and
> > > > > > 03:00)
> > > > > > but
> > > > > > at
> > > > > > 03:10 Zabbix again got new data from pakfire.
> > > > > >
> > > > > > So I think the flock method is a success!
> > > > > >
> > > > > > The pakfire process sometimes terminating unexpectedly
> > > > > > still
> > > > > > happens,
> > > > > > but this now no longer breaks pakfire.
> > > > > > As I'm still suspecting sudo, I think them most elegant
> > > > > > solution to
> > > > > > that would be to allow 'pakfire status' to be run as non
> > > > > > privileged
> > > > > > user without using sudo. But I will try to implement that
> > > > > > myself
> > > > > > soon
> > > > > > and publish it to the devel list (if that indeed solves
> > > > > > this
> > > > > > problem).
> > > > > >
> > > > > > The most important thing is that pakfire no longer breaks,
> > > > > > which
> > > > > > seems
> > > > > > to be fixed with the flock method.
> > > > > >
> > > > > > Anyway, I will leave this running now for about a week and
> > > > > > see
> > > > > > if
> > > > > > pakfire still won't break. But I'm already quite sure that
> > > > > > is
> > > > > > solved
> > > > > > now.
> > > > > >
> > > > > > Thanks for this implementation
> > > > > >
> > > > > > Robin
> > > > > >
> > > > > > Bernhard Bitsch schreef op zo 18-09-2022 om 02:00 [+0200]:
> > > > > > > Hi Robin,
> > > > > > >
> > > > > > > Am 18.09.2022 um 00:23 schrieb Robin Roevens:
> > > > > > > > Hi Bernhard
> > > > > > > >
> > > > > > > > Thanks. I installed it on my IPFire mini appliance. If
> > > > > > > > tomorrow
> > > > > > > > I
> > > > > > > > still
> > > > > > > > have a working check in Zabbix, it will already be a
> > > > > > > > success as
> > > > > > > > the
> > > > > > > > original version always leaves a stale lockfile within
> > > > > > > > 24h.
> > > > > > > > But for safety I think is it best to have it run for at
> > > > > > > > least a
> > > > > > > > week or
> > > > > > > > so without problem.
> > > > > > > >
> > > > > > > > The quick tests I perform manually indeed proved
> > > > > > > > pakfire
> > > > > > > > refusing
> > > > > > > > to
> > > > > > > > start when another instance is running.
> > > > > > > >
> > > > > > > > One thing, I noticed is that I now get
> > > > > > > > ---
> > > > > > > > Use of uninitialized value $ARGV[1] in string at
> > > > > > > > /opt/pakfire/pakfire
> > > > > > > > line 344.
> > > > > > > > Use of uninitialized value $ARGV[1] in string at
> > > > > > > > /opt/pakfire/pakfire
> > > > > > > > line 346.
> > > > > > > > ---
> > > > > > > > when performing 'pakfire list', but for as far as I can
> > > > > > > > see
> > > > > > > > this
> > > > > > > > was
> > > > > > > > already there, but only now visible due to 'use
> > > > > > > > warnings'.
> > > > > > > > So
> > > > > > > > that
> > > > > > > > is
> > > > > > > > not in scope here.
> > > > > > > >
> > > > > > >
> > > > > > > Sorry, didn't scroll up the output while testing.
> > > > > > > Yes, this issues were there already. I use the warnings
> > > > > > > generally,
> > > > > > > because of the complexity of Perl; you never are aware of
> > > > > > > all
> > > > > > > possible
> > > > > > > interpretations of your written code.
> > > > > > > I'm just doing a little change. As far as I see, all
> > > > > > > options
> > > > > > > are
> > > > > > > handled
> > > > > > > at the start of the program. If these options are deleted
> > > > > > > from
> > > > > > > the
> > > > > > > arguments ( @ARGV ), the rest of the program works on an
> > > > > > > array
> > > > > > > containing only command ($ARGV[0]) and parameters. So the
> > > > > > > checks
> > > > > > > for
> > > > > > > options can be eliminated.
> > > > > > >
> > > > > > > > And in a final version of course a message telling the
> > > > > > > > user
> > > > > > > > why
> > > > > > > > it
> > > > > > > > exited prematurely, should also be displayed. But that
> > > > > > > > is a
> > > > > > > > detail
> > > > > > > > for
> > > > > > > > later.
> > > > > > > >
> > > > > > >
> > > > > > > Agreed.
> > > > > > >
> > > > > > > Regards
> > > > > > > Bernhard
> > > > > > >
> > > > > > > > Regards
> > > > > > > > Robin
> > > > > > > >
> > > > > > > > Bernhard Bitsch schreef op za 17-09-2022 om 23:11
> > > > > > > > [+0200]:
> > > > > > > > >
> > > > > > > > > Hi Robin,
> > > > > > > > >
> > > > > > > > > as announced I've made a version with the flock()
> > > > > > > > > function.
> > > > > > > > > Just unpack the arcive attached and copy it to
> > > > > > > > > /opt/pakfire
> > > > > > > > > directory.
> > > > > > > > > My tests with the locking simulation look good. I
> > > > > > > > > will
> > > > > > > > > look
> > > > > > > > > into
> > > > > > > > > other
> > > > > > > > > IPC functions for handling the race condition.
> > > > > > > > >
> > > > > > > > > BTW, do you know the reason for the waits in pakfire-
> > > > > > > > > update
> > > > > > > > > and
> > > > > > > > > pakfire-upgrade functionality?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Bernhard
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
--
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-09-22 13:17 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <3e3529f26167b58351ee6a37d3f7347a3c6cc1c6.camel@sicho.home>
2022-09-22 13:17 ` Fwd: Stale pakfire lock-file causing pakfire to no longer work Bernhard Bitsch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox