From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bernhard Bitsch To: development@lists.ipfire.org Subject: Re: Stale pakfire lock-file causing pakfire to no longer work Date: Thu, 15 Sep 2022 21:09:30 +0200 Message-ID: <1fa562fd-03fb-458b-07c6-3a2558e5a310@ipfire.org> In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4777748847767972255==" List-Id: --===============4777748847767972255== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Agreed (see my other post). Am 15.09.2022 um 21:01 schrieb Robin Roevens: > Hi Peter >=20 > This is definitely _not_ a show-stopper for CU 170 as this is already > present in pakfire since the lock-file was introduced in commit > https://git.ipfire.org/?p=3Dipfire-2.x.git;a=3Dcommit;h=3Dd6c2e6715575c4d53= 1f1302ab6c7368329da8bd4 > (24/05/21) >=20 > I noticed this problem back then but didn't investigate it properly > until now. And since in the meantime nobody else seems to have noticed > or reported this problem here, in bugzilla, the forum nor on my github > page for my zabbix template. > So I can only assume it is quite obscure and possibly easier triggered > on an IPFire mini appliance (which is where I see the problem) than on > higher-end HW. > And yes, it depends on speed/performance as all race conditions. > So I see no reason to delay CU 170 for this, as it was already present > since CU 158. >=20 > Regards > Robin >=20 Regards Bernhard > Peter M=C3=BCller schreef op do 15-09-2022 om 07:39 [+0000]: >> Hello Robin, >> >> thank you for your detailed e-mail. >> >> Just to ensure I did not misunderstood/overlook anything: Is this bug >> a >> show-stopper to the release of Core Update 170? I.e., does it prevent >> (some) IPFire installations from conducting further Pakfire tasks? >> >> Thanks, and best regards, >> Peter M=C3=BCller >> >> >>> Hi all >>> >>> Since the introduction of the /tmp/pakfire_lock-file in pakfire, I >>> have >>> a problem with monitoring 'pakfire status' using Zabbix. >>> >>> Every 10 minutes, I execute "sudo /opt/pakfire/pakfire status" >>> using >>> the Zabbix Agent (which runs as user 'zabbix'); (this check was >>> actually implemented by Alex back when he maintained the >>> zabbix_agent >>> addon) >>> This works correctly for a while until pakfire suddenly refuses to >>> start because /tmp/pakfire_lock is still present. But there is no >>> (old) >>> pakfire proces active anymore and the lockfile is never cleared. I >>> have >>> to manually delete it, to have pakfire work again for a while. >>> >>> Zabbix agent has a built-in timeout of 30s waiting for output of a >>> called process; and if by then the process has not exited, it will >>> get >>> killed. >>> At first I thought that that could be the problem, so I modified >>> the >>> check so that instead of Zabbix agent calling pakfire, it calls a >>> custom script which in turn spawns a background process for >>> pakfire, >>> with the output redirected to zabbix_sender (a utility to directly >>> sent >>> data to Zabbix bypassing the agent). This way the agent won't kill >>> the >>> pakfire process as the custom script finishes almost instantly and >>> the >>> agent itself does not know of the spawned pakfire process. >>> Then when the background pakfire process finishes, zabbix_sender >>> just >>> sends the output to Zabbix and this works without any timeout. So >>> if it >>> would happen that pakfire hangs, it would stay so.. >>> But also using this method.. I get the exact same result. This >>> works >>> correctly for a while until suddenly the lockfile is not cleared >>> and >>> pakfire won't start anymore. >>> >>> I have tried to emulate this behaviour manually trying to kill >>> pakfire >>> aggressively while it is busy and executing pakfire many times >>> shortly >>> after each other and in parallel.. But I fail to reproduce this >>> behaviour. So I have no idea why this behavior happens when called >>> unattended by Zabbix. >>> >>> The only possible clue I found is this line in the agent logfile >>> (when >>> still using the 'normal' method of letting the agent call pakfire >>> directly): >>> failed to kill [sudo /opt/pakfire/pakfire status]: [1] Operation >>> not >>> permitted >>> which according some Chinese blogs I found, could be caused by sudo >>> bug >>> 447: >>> https://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-pro= gram-as-zombie/ >>> https://bugzilla.sudo.ws/show_bug.cgi?id=3D447 >>> However, that bug should no longer be present in sudo 1.9 which is >>> currently shipped with IPFire. >>> Despite that, I currently do suspect sudo to be the culprit. >>> >>> So I would like to propose a change to pakfire and its permissions, >>> to >>> allow for a non-root user to execute pakfire, and then within >>> pakfire >>> itself, check if the current user is root or not, and allow >>> informational commands like 'status' to be executed by a non-root >>> user >>> (all db files are world-readable anyway). >>> This way, sudo is no longer required for Zabbix to call 'pakfire >>> status'. Hoping this would fix the problem. >>> >>> Alternatively we could record the pid of the current process during >>> lock-file creation, and have a new pakfire process check if that >>> pid >>> still exists; if not, dump its own pid in the lockfile and continue >>> work instead of bailing out. But I'm not sure how to implement this >>> without again having a chance for some race conditions when >>> multiple >>> pakfire executions are performed in parallel. >>> >>> Or if anyone has better ideas to (try to) fix this ? >>> >>> Regards >>> Robin >>> >> >=20 --===============4777748847767972255==--