public inbox for development@lists.ipfire.org
 help / color / mirror / Atom feed
* Stale pakfire lock-file causing pakfire to no longer work
@ 2022-09-14 19:48 Robin Roevens
  2022-09-15  7:39 ` Peter Müller
  2022-09-15 11:48 ` Bernhard Bitsch
  0 siblings, 2 replies; 10+ messages in thread
From: Robin Roevens @ 2022-09-14 19:48 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 3578 bytes --]

Hi all

Since the introduction of the /tmp/pakfire_lock-file in pakfire, I have
a problem with monitoring 'pakfire status' using Zabbix.

Every 10 minutes, I execute "sudo /opt/pakfire/pakfire status" using
the Zabbix Agent (which runs as user 'zabbix'); (this check was
actually implemented by Alex back when he maintained the zabbix_agent
addon) 
This works correctly for a while until pakfire suddenly refuses to
start because /tmp/pakfire_lock is still present. But there is no (old)
pakfire proces active anymore and the lockfile is never cleared. I have
to manually delete it, to have pakfire work again for a while.

Zabbix agent has a built-in timeout of 30s waiting for output of a
called process; and if by then the process has not exited, it will get
killed. 
At first I thought that that could be the problem, so I modified the
check so that instead of Zabbix agent calling pakfire, it calls a
custom script which in turn spawns a background process for pakfire,
with the output redirected to zabbix_sender (a utility to directly sent
data to Zabbix bypassing the agent). This way the agent won't kill the
pakfire process as the custom script finishes almost instantly and the
agent itself does not know of the spawned pakfire process.
Then when the background pakfire process finishes, zabbix_sender just
sends the output to Zabbix and this works without any timeout. So if it
would happen that pakfire hangs, it would stay so..
But also using this method.. I get the exact same result. This works
correctly for a while until suddenly the lockfile is not cleared and
pakfire won't start anymore.

I have tried to emulate this behaviour manually trying to kill pakfire
aggressively while it is busy and executing pakfire many times shortly
after each other and in parallel.. But I fail to reproduce this
behaviour. So I have no idea why this behavior happens when called
unattended by Zabbix.

The only possible clue I found is this line in the agent logfile (when
still using the 'normal' method of letting the agent call pakfire
directly):
failed to kill [sudo /opt/pakfire/pakfire status]: [1] Operation not
permitted
which according some Chinese blogs I found, could be caused by sudo bug
447: 
https://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/
https://bugzilla.sudo.ws/show_bug.cgi?id=447
However, that bug should no longer be present in sudo 1.9 which is
currently shipped with IPFire.
Despite that, I currently do suspect sudo to be the culprit.

So I would like to propose a change to pakfire and its permissions, to
allow for a non-root user to execute pakfire, and then within pakfire
itself, check if the current user is root or not, and allow
informational commands like 'status' to be executed by a non-root user
(all db files are world-readable anyway).
This way, sudo is no longer required for Zabbix to call 'pakfire
status'. Hoping this would fix the problem.

Alternatively we could record the pid of the current process during
lock-file creation, and have a new pakfire process check if that pid
still exists; if not, dump its own pid in the lockfile and continue
work instead of bailing out. But I'm not sure how to implement this
without again having a chance for some race conditions when multiple
pakfire executions are performed in parallel. 

Or if anyone has better ideas to (try to) fix this ?

Regards
Robin

-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-14 19:48 Stale pakfire lock-file causing pakfire to no longer work Robin Roevens
@ 2022-09-15  7:39 ` Peter Müller
  2022-09-15 19:01   ` Robin Roevens
  2022-09-15 11:48 ` Bernhard Bitsch
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Müller @ 2022-09-15  7:39 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 3916 bytes --]

Hello Robin,

thank you for your detailed e-mail.

Just to ensure I did not misunderstood/overlook anything: Is this bug a
show-stopper to the release of Core Update 170? I.e., does it prevent
(some) IPFire installations from conducting further Pakfire tasks?

Thanks, and best regards,
Peter Müller


> Hi all
> 
> Since the introduction of the /tmp/pakfire_lock-file in pakfire, I have
> a problem with monitoring 'pakfire status' using Zabbix.
> 
> Every 10 minutes, I execute "sudo /opt/pakfire/pakfire status" using
> the Zabbix Agent (which runs as user 'zabbix'); (this check was
> actually implemented by Alex back when he maintained the zabbix_agent
> addon) 
> This works correctly for a while until pakfire suddenly refuses to
> start because /tmp/pakfire_lock is still present. But there is no (old)
> pakfire proces active anymore and the lockfile is never cleared. I have
> to manually delete it, to have pakfire work again for a while.
> 
> Zabbix agent has a built-in timeout of 30s waiting for output of a
> called process; and if by then the process has not exited, it will get
> killed. 
> At first I thought that that could be the problem, so I modified the
> check so that instead of Zabbix agent calling pakfire, it calls a
> custom script which in turn spawns a background process for pakfire,
> with the output redirected to zabbix_sender (a utility to directly sent
> data to Zabbix bypassing the agent). This way the agent won't kill the
> pakfire process as the custom script finishes almost instantly and the
> agent itself does not know of the spawned pakfire process.
> Then when the background pakfire process finishes, zabbix_sender just
> sends the output to Zabbix and this works without any timeout. So if it
> would happen that pakfire hangs, it would stay so..
> But also using this method.. I get the exact same result. This works
> correctly for a while until suddenly the lockfile is not cleared and
> pakfire won't start anymore.
> 
> I have tried to emulate this behaviour manually trying to kill pakfire
> aggressively while it is busy and executing pakfire many times shortly
> after each other and in parallel.. But I fail to reproduce this
> behaviour. So I have no idea why this behavior happens when called
> unattended by Zabbix.
> 
> The only possible clue I found is this line in the agent logfile (when
> still using the 'normal' method of letting the agent call pakfire
> directly):
> failed to kill [sudo /opt/pakfire/pakfire status]: [1] Operation not
> permitted
> which according some Chinese blogs I found, could be caused by sudo bug
> 447: 
> https://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/
> https://bugzilla.sudo.ws/show_bug.cgi?id=447
> However, that bug should no longer be present in sudo 1.9 which is
> currently shipped with IPFire.
> Despite that, I currently do suspect sudo to be the culprit.
> 
> So I would like to propose a change to pakfire and its permissions, to
> allow for a non-root user to execute pakfire, and then within pakfire
> itself, check if the current user is root or not, and allow
> informational commands like 'status' to be executed by a non-root user
> (all db files are world-readable anyway).
> This way, sudo is no longer required for Zabbix to call 'pakfire
> status'. Hoping this would fix the problem.
> 
> Alternatively we could record the pid of the current process during
> lock-file creation, and have a new pakfire process check if that pid
> still exists; if not, dump its own pid in the lockfile and continue
> work instead of bailing out. But I'm not sure how to implement this
> without again having a chance for some race conditions when multiple
> pakfire executions are performed in parallel. 
> 
> Or if anyone has better ideas to (try to) fix this ?
> 
> Regards
> Robin
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-14 19:48 Stale pakfire lock-file causing pakfire to no longer work Robin Roevens
  2022-09-15  7:39 ` Peter Müller
@ 2022-09-15 11:48 ` Bernhard Bitsch
  2022-09-15 19:43   ` Robin Roevens
  1 sibling, 1 reply; 10+ messages in thread
From: Bernhard Bitsch @ 2022-09-15 11:48 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 372 bytes --]

Hi all,

as an 'old real time programmer' this reminds me deeply at 
Dijkstra/Hoare's "Dining philosophers problem".

The check for presence of the lockfile and the generation of it are not 
'atomic'. Means two programs can run in parallel.

I'll investigate this further. But the deletion of the lock should 
happen anyways, as far I've seen till now.

Regards,
Bernhard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15  7:39 ` Peter Müller
@ 2022-09-15 19:01   ` Robin Roevens
  2022-09-15 19:09     ` Bernhard Bitsch
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Roevens @ 2022-09-15 19:01 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 5159 bytes --]

Hi Peter

This is definitely _not_ a show-stopper for CU 170 as this is already
present in pakfire since the lock-file was introduced in commit
https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=d6c2e6715575c4d531f1302ab6c7368329da8bd4
(24/05/21)

I noticed this problem back then but didn't investigate it properly
until now. And since in the meantime nobody else seems to have noticed
or reported this problem here, in bugzilla, the forum nor on my github
page for my zabbix template. 
So I can only assume it is quite obscure and possibly easier triggered
on an IPFire mini appliance (which is where I see the problem) than on
higher-end HW.

So I see no reason to delay CU 170 for this, as it was already present
since CU 158.

Regards
Robin

Peter Müller schreef op do 15-09-2022 om 07:39 [+0000]:
> Hello Robin,
> 
> thank you for your detailed e-mail.
> 
> Just to ensure I did not misunderstood/overlook anything: Is this bug
> a
> show-stopper to the release of Core Update 170? I.e., does it prevent
> (some) IPFire installations from conducting further Pakfire tasks?
> 
> Thanks, and best regards,
> Peter Müller
> 
> 
> > Hi all
> > 
> > Since the introduction of the /tmp/pakfire_lock-file in pakfire, I
> > have
> > a problem with monitoring 'pakfire status' using Zabbix.
> > 
> > Every 10 minutes, I execute "sudo /opt/pakfire/pakfire status"
> > using
> > the Zabbix Agent (which runs as user 'zabbix'); (this check was
> > actually implemented by Alex back when he maintained the
> > zabbix_agent
> > addon) 
> > This works correctly for a while until pakfire suddenly refuses to
> > start because /tmp/pakfire_lock is still present. But there is no
> > (old)
> > pakfire proces active anymore and the lockfile is never cleared. I
> > have
> > to manually delete it, to have pakfire work again for a while.
> > 
> > Zabbix agent has a built-in timeout of 30s waiting for output of a
> > called process; and if by then the process has not exited, it will
> > get
> > killed. 
> > At first I thought that that could be the problem, so I modified
> > the
> > check so that instead of Zabbix agent calling pakfire, it calls a
> > custom script which in turn spawns a background process for
> > pakfire,
> > with the output redirected to zabbix_sender (a utility to directly
> > sent
> > data to Zabbix bypassing the agent). This way the agent won't kill
> > the
> > pakfire process as the custom script finishes almost instantly and
> > the
> > agent itself does not know of the spawned pakfire process.
> > Then when the background pakfire process finishes, zabbix_sender
> > just
> > sends the output to Zabbix and this works without any timeout. So
> > if it
> > would happen that pakfire hangs, it would stay so..
> > But also using this method.. I get the exact same result. This
> > works
> > correctly for a while until suddenly the lockfile is not cleared
> > and
> > pakfire won't start anymore.
> > 
> > I have tried to emulate this behaviour manually trying to kill
> > pakfire
> > aggressively while it is busy and executing pakfire many times
> > shortly
> > after each other and in parallel.. But I fail to reproduce this
> > behaviour. So I have no idea why this behavior happens when called
> > unattended by Zabbix.
> > 
> > The only possible clue I found is this line in the agent logfile
> > (when
> > still using the 'normal' method of letting the agent call pakfire
> > directly):
> > failed to kill [sudo /opt/pakfire/pakfire status]: [1] Operation
> > not
> > permitted
> > which according some Chinese blogs I found, could be caused by sudo
> > bug
> > 447: 
> > https://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/
> > https://bugzilla.sudo.ws/show_bug.cgi?id=447
> > However, that bug should no longer be present in sudo 1.9 which is
> > currently shipped with IPFire.
> > Despite that, I currently do suspect sudo to be the culprit.
> > 
> > So I would like to propose a change to pakfire and its permissions,
> > to
> > allow for a non-root user to execute pakfire, and then within
> > pakfire
> > itself, check if the current user is root or not, and allow
> > informational commands like 'status' to be executed by a non-root
> > user
> > (all db files are world-readable anyway).
> > This way, sudo is no longer required for Zabbix to call 'pakfire
> > status'. Hoping this would fix the problem.
> > 
> > Alternatively we could record the pid of the current process during
> > lock-file creation, and have a new pakfire process check if that
> > pid
> > still exists; if not, dump its own pid in the lockfile and continue
> > work instead of bailing out. But I'm not sure how to implement this
> > without again having a chance for some race conditions when
> > multiple
> > pakfire executions are performed in parallel. 
> > 
> > Or if anyone has better ideas to (try to) fix this ?
> > 
> > Regards
> > Robin
> > 
> 

-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 19:01   ` Robin Roevens
@ 2022-09-15 19:09     ` Bernhard Bitsch
  0 siblings, 0 replies; 10+ messages in thread
From: Bernhard Bitsch @ 2022-09-15 19:09 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 5249 bytes --]

Agreed (see my other post).


Am 15.09.2022 um 21:01 schrieb Robin Roevens:
> Hi Peter
> 
> This is definitely _not_ a show-stopper for CU 170 as this is already
> present in pakfire since the lock-file was introduced in commit
> https://git.ipfire.org/?p=ipfire-2.x.git;a=commit;h=d6c2e6715575c4d531f1302ab6c7368329da8bd4
> (24/05/21)
> 
> I noticed this problem back then but didn't investigate it properly
> until now. And since in the meantime nobody else seems to have noticed
> or reported this problem here, in bugzilla, the forum nor on my github
> page for my zabbix template.
> So I can only assume it is quite obscure and possibly easier triggered
> on an IPFire mini appliance (which is where I see the problem) than on
> higher-end HW.
>

And yes, it depends on speed/performance as all race conditions.

> So I see no reason to delay CU 170 for this, as it was already present
> since CU 158.
> 
> Regards
> Robin
> 

Regards
Bernhard

> Peter Müller schreef op do 15-09-2022 om 07:39 [+0000]:
>> Hello Robin,
>>
>> thank you for your detailed e-mail.
>>
>> Just to ensure I did not misunderstood/overlook anything: Is this bug
>> a
>> show-stopper to the release of Core Update 170? I.e., does it prevent
>> (some) IPFire installations from conducting further Pakfire tasks?
>>
>> Thanks, and best regards,
>> Peter Müller
>>
>>
>>> Hi all
>>>
>>> Since the introduction of the /tmp/pakfire_lock-file in pakfire, I
>>> have
>>> a problem with monitoring 'pakfire status' using Zabbix.
>>>
>>> Every 10 minutes, I execute "sudo /opt/pakfire/pakfire status"
>>> using
>>> the Zabbix Agent (which runs as user 'zabbix'); (this check was
>>> actually implemented by Alex back when he maintained the
>>> zabbix_agent
>>> addon)
>>> This works correctly for a while until pakfire suddenly refuses to
>>> start because /tmp/pakfire_lock is still present. But there is no
>>> (old)
>>> pakfire proces active anymore and the lockfile is never cleared. I
>>> have
>>> to manually delete it, to have pakfire work again for a while.
>>>
>>> Zabbix agent has a built-in timeout of 30s waiting for output of a
>>> called process; and if by then the process has not exited, it will
>>> get
>>> killed.
>>> At first I thought that that could be the problem, so I modified
>>> the
>>> check so that instead of Zabbix agent calling pakfire, it calls a
>>> custom script which in turn spawns a background process for
>>> pakfire,
>>> with the output redirected to zabbix_sender (a utility to directly
>>> sent
>>> data to Zabbix bypassing the agent). This way the agent won't kill
>>> the
>>> pakfire process as the custom script finishes almost instantly and
>>> the
>>> agent itself does not know of the spawned pakfire process.
>>> Then when the background pakfire process finishes, zabbix_sender
>>> just
>>> sends the output to Zabbix and this works without any timeout. So
>>> if it
>>> would happen that pakfire hangs, it would stay so..
>>> But also using this method.. I get the exact same result. This
>>> works
>>> correctly for a while until suddenly the lockfile is not cleared
>>> and
>>> pakfire won't start anymore.
>>>
>>> I have tried to emulate this behaviour manually trying to kill
>>> pakfire
>>> aggressively while it is busy and executing pakfire many times
>>> shortly
>>> after each other and in parallel.. But I fail to reproduce this
>>> behaviour. So I have no idea why this behavior happens when called
>>> unattended by Zabbix.
>>>
>>> The only possible clue I found is this line in the agent logfile
>>> (when
>>> still using the 'normal' method of letting the agent call pakfire
>>> directly):
>>> failed to kill [sudo /opt/pakfire/pakfire status]: [1] Operation
>>> not
>>> permitted
>>> which according some Chinese blogs I found, could be caused by sudo
>>> bug
>>> 447:
>>> https://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/
>>> https://bugzilla.sudo.ws/show_bug.cgi?id=447
>>> However, that bug should no longer be present in sudo 1.9 which is
>>> currently shipped with IPFire.
>>> Despite that, I currently do suspect sudo to be the culprit.
>>>
>>> So I would like to propose a change to pakfire and its permissions,
>>> to
>>> allow for a non-root user to execute pakfire, and then within
>>> pakfire
>>> itself, check if the current user is root or not, and allow
>>> informational commands like 'status' to be executed by a non-root
>>> user
>>> (all db files are world-readable anyway).
>>> This way, sudo is no longer required for Zabbix to call 'pakfire
>>> status'. Hoping this would fix the problem.
>>>
>>> Alternatively we could record the pid of the current process during
>>> lock-file creation, and have a new pakfire process check if that
>>> pid
>>> still exists; if not, dump its own pid in the lockfile and continue
>>> work instead of bailing out. But I'm not sure how to implement this
>>> without again having a chance for some race conditions when
>>> multiple
>>> pakfire executions are performed in parallel.
>>>
>>> Or if anyone has better ideas to (try to) fix this ?
>>>
>>> Regards
>>> Robin
>>>
>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 11:48 ` Bernhard Bitsch
@ 2022-09-15 19:43   ` Robin Roevens
  2022-09-15 20:03     ` Bernhard Bitsch
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Roevens @ 2022-09-15 19:43 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

Hi Bernhard

Bernhard Bitsch schreef op do 15-09-2022 om 13:48 [+0200]:
> Hi all,
> 
> as an 'old real time programmer' this reminds me deeply at 
> Dijkstra/Hoare's "Dining philosophers problem".
> 
> The check for presence of the lockfile and the generation of it are
> not 
> 'atomic'. Means two programs can run in parallel.
Indeed.. 
In a shell script, a more atomic approach would be instead of using a
lockfile, a lock-directory: 
'mkdir' creates a directory only if it not already exists and if it
does already exist, it will return an exit code. So here we have both
checking and generating in one atomic operation.
This is better explained here:
https://wiki.bash-hackers.org/howto/mutex

Not sure if this can be translated to Perl in an atomic way..
I did find this perl code snippet however: 
---
use strict;
use warnings;
use Fcntl ':flock';

flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one! [$0]";


# mandatory line, flocking depends on DATA file handle
__DATA__
---
Which could be a possible solution, I think.

I also found this, which seems quiet promising:
https://metacpan.org/pod/Script::Singleton
to perform locking by using shared memory.

> 
> I'll investigate this further. But the deletion of the lock should 
> happen anyways, as far I've seen till now.
True, it should be deleted always and as said before, I could not
reproduce this manually .. but my Zabbix agent seems to be able to
trigger this problem at least once every 24h on my IPFire mini
appliance, only by executing pakfire every 10 minutes. That is why I'm
suspecting the abnormal termination of pakfire, leaving the lockfile in
place, is actually caused by sudo.

On the other hand.. this can also happen when pakfire is running and
suddenly the power is cut.. then the lockfile will still be present
when the machine is back up.. So I think, if we stay with the lockfile,
we at least need some check for a stale lockfile, like checking if the
process that created the lockfile still exists or not and removing it
if not. 

Regards
Robin

> 
> Regards,
> Bernhard
> 

-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 19:43   ` Robin Roevens
@ 2022-09-15 20:03     ` Bernhard Bitsch
  2022-09-15 20:30       ` Robin Roevens
  0 siblings, 1 reply; 10+ messages in thread
From: Bernhard Bitsch @ 2022-09-15 20:03 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 2367 bytes --]

Hi Robin,


Am 15.09.2022 um 21:43 schrieb Robin Roevens:
> Hi Bernhard
> 
> Bernhard Bitsch schreef op do 15-09-2022 om 13:48 [+0200]:
>> Hi all,
>>
>> as an 'old real time programmer' this reminds me deeply at
>> Dijkstra/Hoare's "Dining philosophers problem".
>>
>> The check for presence of the lockfile and the generation of it are
>> not
>> 'atomic'. Means two programs can run in parallel.
> Indeed..
> In a shell script, a more atomic approach would be instead of using a
> lockfile, a lock-directory:
> 'mkdir' creates a directory only if it not already exists and if it
> does already exist, it will return an exit code. So here we have both
> checking and generating in one atomic operation.
> This is better explained here:
> https://wiki.bash-hackers.org/howto/mutex
> 
> Not sure if this can be translated to Perl in an atomic way..
> I did find this perl code snippet however:
> ---
> use strict;
> use warnings;
> use Fcntl ':flock';
> 
> flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one! [$0]";
> 
> 
> # mandatory line, flocking depends on DATA file handle
> __DATA__
> ---
> Which could be a possible solution, I think.
> 

Looks promising. Will look into this.

> I also found this, which seems quiet promising:
> https://metacpan.org/pod/Script::Singleton
> to perform locking by using shared memory.
> 
>>
>> I'll investigate this further. But the deletion of the lock should
>> happen anyways, as far I've seen till now.
> True, it should be deleted always and as said before, I could not
> reproduce this manually .. but my Zabbix agent seems to be able to
> trigger this problem at least once every 24h on my IPFire mini
> appliance, only by executing pakfire every 10 minutes. That is why I'm
> suspecting the abnormal termination of pakfire, leaving the lockfile in
> place, is actually caused by sudo.
> 
> On the other hand.. this can also happen when pakfire is running and
> suddenly the power is cut.. then the lockfile will still be present
> when the machine is back up.. So I think, if we stay with the lockfile,
> we at least need some check for a stale lockfile, like checking if the
> process that created the lockfile still exists or not and removing it
> if not.
> 

Because the lockfile is located in /tmp, I don't think it survives a reboot.

Regards
Bernhard

> Regards
> Robin
> 
>>
>> Regards,
>> Bernhard
>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 20:03     ` Bernhard Bitsch
@ 2022-09-15 20:30       ` Robin Roevens
  2022-09-15 23:27         ` Bernhard Bitsch
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Roevens @ 2022-09-15 20:30 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 3121 bytes --]

Hi Bernhard


Bernhard Bitsch schreef op do 15-09-2022 om 22:03 [+0200]:
> Hi Robin,
> 
> 
> Am 15.09.2022 um 21:43 schrieb Robin Roevens:
> > Hi Bernhard
> > 
> > Bernhard Bitsch schreef op do 15-09-2022 om 13:48 [+0200]:
> > > Hi all,
> > > 
> > > as an 'old real time programmer' this reminds me deeply at
> > > Dijkstra/Hoare's "Dining philosophers problem".
> > > 
> > > The check for presence of the lockfile and the generation of it
> > > are
> > > not
> > > 'atomic'. Means two programs can run in parallel.
> > Indeed..
> > In a shell script, a more atomic approach would be instead of using
> > a
> > lockfile, a lock-directory:
> > 'mkdir' creates a directory only if it not already exists and if it
> > does already exist, it will return an exit code. So here we have
> > both
> > checking and generating in one atomic operation.
> > This is better explained here:
> > https://wiki.bash-hackers.org/howto/mutex
> > 
> > Not sure if this can be translated to Perl in an atomic way..
> > I did find this perl code snippet however:
> > ---
> > use strict;
> > use warnings;
> > use Fcntl ':flock';
> > 
> > flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one! [$0]";
> > 
> > 
> > # mandatory line, flocking depends on DATA file handle
> > __DATA__
> > ---
> > Which could be a possible solution, I think.
> > 
> 
> Looks promising. Will look into this.
> 
> > I also found this, which seems quiet promising:
> > https://metacpan.org/pod/Script::Singleton
> > to perform locking by using shared memory.


Maybe yet another approach (idea from here:
https://unix.stackexchange.com/a/594126 ) could be to actually check if
another process named 'pakfire' is active (using Proc::ProcessTable ?)
instead of using a lock(file). As pakfire is single-threaded, I think
this may just do the job?

> > 
> > > 
> > > I'll investigate this further. But the deletion of the lock
> > > should
> > > happen anyways, as far I've seen till now.
> > True, it should be deleted always and as said before, I could not
> > reproduce this manually .. but my Zabbix agent seems to be able to
> > trigger this problem at least once every 24h on my IPFire mini
> > appliance, only by executing pakfire every 10 minutes. That is why
> > I'm
> > suspecting the abnormal termination of pakfire, leaving the
> > lockfile in
> > place, is actually caused by sudo.
> > 
> > On the other hand.. this can also happen when pakfire is running
> > and
> > suddenly the power is cut.. then the lockfile will still be present
> > when the machine is back up.. So I think, if we stay with the
> > lockfile,
> > we at least need some check for a stale lockfile, like checking if
> > the
> > process that created the lockfile still exists or not and removing
> > it
> > if not.
> > 
> 
> Because the lockfile is located in /tmp, I don't think it survives a
> reboot.

Right, I missed that for a moment :-). 

Regards
Robin

> 
> Regards
> Bernhard
> 
> > Regards
> > Robin
> > 
> > > 
> > > Regards,
> > > Bernhard
> > > 
> > 
> 

-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 20:30       ` Robin Roevens
@ 2022-09-15 23:27         ` Bernhard Bitsch
  2022-09-17 21:56           ` Robin Roevens
  0 siblings, 1 reply; 10+ messages in thread
From: Bernhard Bitsch @ 2022-09-15 23:27 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 4011 bytes --]

Hi Robin,

thanks for your suggestions.
I just 'playing' with the flock() solution.
Looks good so far. With a little program

try to get lock ( open(), flock() )
if successful do the job ( just sleep 60s ) and release the lock  (close())

Starting a couple of instances of this program shows only one program 
active at the same time.

I would prefer this solution, because the flock() functionality is near 
at the theoretical 'semaphore' by Dijkstra and Hoare.

I hope to be able to integrate this in the pakfire program tomorrow. 
I'll send you a copy for test. If it is really 'only' a racing condition 
problem, you should be able to prove that the issue is gone.

The further steps will be to present a patch and integrate it into the 
system.

If we don't succeed, we should create a ticket in bugzilla to discuss it 
further.


Am 15.09.2022 um 22:30 schrieb Robin Roevens:
> Hi Bernhard
> 
> 
> Bernhard Bitsch schreef op do 15-09-2022 om 22:03 [+0200]:
>> Hi Robin,
>>
>>
>> Am 15.09.2022 um 21:43 schrieb Robin Roevens:
>>> Hi Bernhard
>>>
>>> Bernhard Bitsch schreef op do 15-09-2022 om 13:48 [+0200]:
>>>> Hi all,
>>>>
>>>> as an 'old real time programmer' this reminds me deeply at
>>>> Dijkstra/Hoare's "Dining philosophers problem".
>>>>
>>>> The check for presence of the lockfile and the generation of it
>>>> are
>>>> not
>>>> 'atomic'. Means two programs can run in parallel.
>>> Indeed..
>>> In a shell script, a more atomic approach would be instead of using
>>> a
>>> lockfile, a lock-directory:
>>> 'mkdir' creates a directory only if it not already exists and if it
>>> does already exist, it will return an exit code. So here we have
>>> both
>>> checking and generating in one atomic operation.
>>> This is better explained here:
>>> https://wiki.bash-hackers.org/howto/mutex
>>>
>>> Not sure if this can be translated to Perl in an atomic way..
>>> I did find this perl code snippet however:
>>> ---
>>> use strict;
>>> use warnings;
>>> use Fcntl ':flock';
>>>
>>> flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one! [$0]";
>>>
>>>
>>> # mandatory line, flocking depends on DATA file handle
>>> __DATA__
>>> ---
>>> Which could be a possible solution, I think.
>>>
>>
>> Looks promising. Will look into this.
>>
>>> I also found this, which seems quiet promising:
>>> https://metacpan.org/pod/Script::Singleton
>>> to perform locking by using shared memory.
> 
> 
> Maybe yet another approach (idea from here:
> https://unix.stackexchange.com/a/594126 ) could be to actually check if
> another process named 'pakfire' is active (using Proc::ProcessTable ?)
> instead of using a lock(file). As pakfire is single-threaded, I think
> this may just do the job?
> 

I suspect, that only looking at the process table introduces just 
another race condition.

Regards,
Bernhard
>>>
>>>>
>>>> I'll investigate this further. But the deletion of the lock
>>>> should
>>>> happen anyways, as far I've seen till now.
>>> True, it should be deleted always and as said before, I could not
>>> reproduce this manually .. but my Zabbix agent seems to be able to
>>> trigger this problem at least once every 24h on my IPFire mini
>>> appliance, only by executing pakfire every 10 minutes. That is why
>>> I'm
>>> suspecting the abnormal termination of pakfire, leaving the
>>> lockfile in
>>> place, is actually caused by sudo.
>>>
>>> On the other hand.. this can also happen when pakfire is running
>>> and
>>> suddenly the power is cut.. then the lockfile will still be present
>>> when the machine is back up.. So I think, if we stay with the
>>> lockfile,
>>> we at least need some check for a stale lockfile, like checking if
>>> the
>>> process that created the lockfile still exists or not and removing
>>> it
>>> if not.
>>>
>>
>> Because the lockfile is located in /tmp, I don't think it survives a
>> reboot.
> 
> Right, I missed that for a moment :-).
> 
> Regards
> Robin
> 
>>
>> Regards
>> Bernhard
>>
>>> Regards
>>> Robin
>>>
>>>>
>>>> Regards,
>>>> Bernhard
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Stale pakfire lock-file causing pakfire to no longer work
  2022-09-15 23:27         ` Bernhard Bitsch
@ 2022-09-17 21:56           ` Robin Roevens
  0 siblings, 0 replies; 10+ messages in thread
From: Robin Roevens @ 2022-09-17 21:56 UTC (permalink / raw)
  To: development

[-- Attachment #1: Type: text/plain, Size: 6382 bytes --]

Hi Bernhard

Bernhard Bitsch schreef op vr 16-09-2022 om 01:27 [+0200]:
> Hi Robin,
> 
> thanks for your suggestions.
> I just 'playing' with the flock() solution.
> Looks good so far. With a little program
> 
> try to get lock ( open(), flock() )
> if successful do the job ( just sleep 60s ) and release the lock 
> (close())
> 
> Starting a couple of instances of this program shows only one program
> active at the same time.
> 
> I would prefer this solution, because the flock() functionality is
> near 
> at the theoretical 'semaphore' by Dijkstra and Hoare.
For as far as I understand the method of the Script::Singleton library,
this does the same, only using a identifier in shared memory instead of
on file(s). 
But I think if flock does the job, it is indeed the preferred way as it
only depends on Fcntl library, already present by default in IPFire,
while Script:Singleton would require the Script::Singleton package, but
also IPC::Shareable (which I don't think is currenly shipped in
IPFire). I think we could easily skip Script::Singleton as the code is
not that hard to duplicate and maintain straight in the pakfire code
(https://metacpan.org/dist/Script-Singleton/source/lib/Script/Singleton.pm)
but then still IPC::Shareable should be made available on IPFire.

> 
> I hope to be able to integrate this in the pakfire program tomorrow. 
> I'll send you a copy for test. If it is really 'only' a racing
> condition 
> problem, you should be able to prove that the issue is gone.
> 
> The further steps will be to present a patch and integrate it into
> the 
> system.
> 
> If we don't succeed, we should create a ticket in bugzilla to discuss
> it 
> further.

Sounds like a plan! I'm looking forward to test your version of
pakfire.

> 
> 
> Am 15.09.2022 um 22:30 schrieb Robin Roevens:
> > Hi Bernhard
> > 
> > 
> > Bernhard Bitsch schreef op do 15-09-2022 om 22:03 [+0200]:
> > > Hi Robin,
> > > 
> > > 
> > > Am 15.09.2022 um 21:43 schrieb Robin Roevens:
> > > > Hi Bernhard
> > > > 
> > > > Bernhard Bitsch schreef op do 15-09-2022 om 13:48 [+0200]:
> > > > > Hi all,
> > > > > 
> > > > > as an 'old real time programmer' this reminds me deeply at
> > > > > Dijkstra/Hoare's "Dining philosophers problem".
> > > > > 
> > > > > The check for presence of the lockfile and the generation of
> > > > > it
> > > > > are
> > > > > not
> > > > > 'atomic'. Means two programs can run in parallel.
> > > > Indeed..
> > > > In a shell script, a more atomic approach would be instead of
> > > > using
> > > > a
> > > > lockfile, a lock-directory:
> > > > 'mkdir' creates a directory only if it not already exists and
> > > > if it
> > > > does already exist, it will return an exit code. So here we
> > > > have
> > > > both
> > > > checking and generating in one atomic operation.
> > > > This is better explained here:
> > > > https://wiki.bash-hackers.org/howto/mutex
> > > > 
> > > > Not sure if this can be translated to Perl in an atomic way..
> > > > I did find this perl code snippet however:
> > > > ---
> > > > use strict;
> > > > use warnings;
> > > > use Fcntl ':flock';
> > > > 
> > > > flock(DATA, LOCK_EX|LOCK_NB) or die "There can be only one!
> > > > [$0]";
> > > > 
> > > > 
> > > > # mandatory line, flocking depends on DATA file handle
> > > > __DATA__
> > > > ---
> > > > Which could be a possible solution, I think.
> > > > 
> > > 
> > > Looks promising. Will look into this.
> > > 
> > > > I also found this, which seems quiet promising:
> > > > https://metacpan.org/pod/Script::Singleton
> > > > to perform locking by using shared memory.
> > 
> > 
> > Maybe yet another approach (idea from here:
> > https://unix.stackexchange.com/a/594126 ) could be to actually
> > check if
> > another process named 'pakfire' is active (using Proc::ProcessTable
> > ?)
> > instead of using a lock(file). As pakfire is single-threaded, I
> > think
> > this may just do the job?
> > 
> 
> I suspect, that only looking at the process table introduces just 
> another race condition.

I'm not certain about that. It never has to actively set a lock as
since as soon as the process is started, it has a 'lock' by it being
listed in the in the processtable automatically without even running
any line of code. Then, first thing it actively does is check for
another pakfire process in the process table. 
I can only see this go 'wrong' when 2 (or more) pakfire processes are
started simultaneously, where in the worse case, all will decide that
there is already another process active and exit. But I don't think
that would really pose a problem, as a subsequent start of a single
pakfire instance, should then just work again. 

But as you said, let's try flock and if unsuccessful we can move this
discussion to bugzilla and try other methods.

Regards
Robin

> 
> Regards,
> Bernhard
> > > > 
> > > > > 
> > > > > I'll investigate this further. But the deletion of the lock
> > > > > should
> > > > > happen anyways, as far I've seen till now.
> > > > True, it should be deleted always and as said before, I could
> > > > not
> > > > reproduce this manually .. but my Zabbix agent seems to be able
> > > > to
> > > > trigger this problem at least once every 24h on my IPFire mini
> > > > appliance, only by executing pakfire every 10 minutes. That is
> > > > why
> > > > I'm
> > > > suspecting the abnormal termination of pakfire, leaving the
> > > > lockfile in
> > > > place, is actually caused by sudo.
> > > > 
> > > > On the other hand.. this can also happen when pakfire is
> > > > running
> > > > and
> > > > suddenly the power is cut.. then the lockfile will still be
> > > > present
> > > > when the machine is back up.. So I think, if we stay with the
> > > > lockfile,
> > > > we at least need some check for a stale lockfile, like checking
> > > > if
> > > > the
> > > > process that created the lockfile still exists or not and
> > > > removing
> > > > it
> > > > if not.
> > > > 
> > > 
> > > Because the lockfile is located in /tmp, I don't think it
> > > survives a
> > > reboot.
> > 
> > Right, I missed that for a moment :-).
> > 
> > Regards
> > Robin
> > 
> > > 
> > > Regards
> > > Bernhard
> > > 
> > > > Regards
> > > > Robin
> > > > 
> > > > > 
> > > > > Regards,
> > > > > Bernhard
> > > > > 
> > > > 
> > > 
> > 
> 

-- 
Dit bericht is gescanned op virussen en andere gevaarlijke
inhoud door MailScanner en lijkt schoon te zijn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-17 21:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 19:48 Stale pakfire lock-file causing pakfire to no longer work Robin Roevens
2022-09-15  7:39 ` Peter Müller
2022-09-15 19:01   ` Robin Roevens
2022-09-15 19:09     ` Bernhard Bitsch
2022-09-15 11:48 ` Bernhard Bitsch
2022-09-15 19:43   ` Robin Roevens
2022-09-15 20:03     ` Bernhard Bitsch
2022-09-15 20:30       ` Robin Roevens
2022-09-15 23:27         ` Bernhard Bitsch
2022-09-17 21:56           ` Robin Roevens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox