Hello Michael,
agreed. I will keep an eye on the mails emitted from the nightly builders and delete the ccache, if necessary.
If so, sending a short notice to this mailing list to inform fellow developers on this issue should be sufficient AFAIC, since we are only dealing with "next".
Thanks, and best regards, Peter Müller
Hello,
Okay. That will leave us with the question if we have destroyed the ccache on the nightly builders (or any others).
Since ccache might be unaware of the seed, we might have mixed files in the cache.
If builds still fail after reverting the RANDSTRUCT patch, we might need to wipe the cache.
-Michael
On 9 Aug 2022, at 10:28, Peter Müller peter.mueller@ipfire.org wrote:
Hello Arne,
thank you very much for reporting back.
Okay, then I will put the slab cache patch in again and leave randstruct disabled.
Thanks, and best regards, Peter Müller
A fresh build with empty ccache boots also with the slab cache patch so RANDRTRUCT should be the real problem.
Arne
Am 2022-08-09 08:23, schrieb Arne Fitzenreiter:
Am 2022-08-08 17:47, schrieb Peter Müller:
Hello Arne,
thanks for reporting back.
This means the slab cache patch is the problem.
Im not sure. I fear it could be the RANDSTRUCT because after a version update of the kernel it not use the ccache at first build and after a small config change it could break if parts of the kernel used from cache and some not.
At the moment i test a clean build without ccache but enabled slub cache patch. If this work it is the RANDSTRUCT change.
Arne
Unfortunately, my local C-cache appears to be completely messed up now, so I will have to start with a clean cache, hence it will probably take me until tomorrow to have some testing results ready.
Will keep you updated.
Thanks, and best regards, Peter Müller
With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43df4a03/ nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) After commit 06b4164dfe269704976b52421edbbbdf3b345679 Author: Peter Müller peter.mueller@ipfire.org Date: Mon Aug 1 17:39:59 2022 +0000
linux: Do not allow slab caches to be merged
it doesn't boot anymore. (also tested on x86_64 and aarch64)
Arne
Am 2022-08-08 12:22, schrieb Michael Tremer: > Hello, > >> On 8 Aug 2022, at 11:16, Peter Müller peter.mueller@ipfire.org wrote: >> >> Hello Michael, hello Arne, >> >> just a quick reply: I think we are dealing with the combination of two issues here, >> as kernel 5.15.59 without slab cache merging disabled won't even boot in a VM (the >> screen stays blank indefinitely), and it crashes straight away with the slab cache >> merging patch. >> >> Since kernel 5.15.57 is running perfectly fine here with randstruct enabled, and has >> been for days, I just reverted both the update to 5.15.59 and the slab cache patch. >> For the time being, I would leave randstruct enabled, since it does not seem to be a >> root cause for whatever bug(s) we are dealing with at the moment. > > Is that from the first build or a consecutive one? > >> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, did it also >> boot properly in a VirtualBox VM? >> >> Apologies for this coming up so unexpected. > > Well, things break. We should however be fast to have at least a > booting kernel in the tree so that we won’t crash any more systems. > > And if that requires to revert both patches until we know for certain > which one is the bad one, I find that the best option. > > -Michael > >> >> Thanks, and best regards, >> Peter Müller >> >>> Hello, >>> >>> You seem to have a very classic NULL pointer dereference. >>> >>> Something is trying to follow a NULL pointer. And that isn’t possible. >>> >>> Now it is interesting to know why that is. The cap_capable function hasn’t been touched in the 5.15 tree in a while. The same goes for ns_capable. >>> >>> I would therefore suspect that this is some issue from the RANDSTRUCT plugin which seems to be incompatible with ccache. >>> >>> If you have built a kernel with a random seed for the first time, that will be put into the cache. If the next build is unmodified, the kernel with come out of the cache and will be exactly the same as the previous build. >>> >>> If you however modify some parts of the kernel (a minor release for example) you will only compile the changed parts BUT with a different seed for the randstruct plugin. >>> >>> And I suspect that this has happened here where your code is now simply reading the wrong memory. >>> >>> I would recommend reverting the RANDSTRUCT patch and that should allow you to have a proper image again. >>> >>> If you want to keep that, the only option would be to disable the ccache for the kernel. The kernel is however one of the largest packages and ccache works really really well here. We can discuss this if we have identified RADNSTRUCT to be the culprit. >>> >>> -Michael >>> >>>> On 7 Aug 2022, at 19:08, Peter Müller peter.mueller@ipfire.org wrote: >>>> >>>> Hello *, >>>> >>>> enclosed is a screenshot of what booting the installer for Core Update 170 (dirty) >>>> with kernel 5.15.57 and slab merging disabled looks like. With kernel 5.15.59, the >>>> VM screen stays blank, so I had to revert this to get some results. >>>> >>>> Frankly, I don't see why the kernel suddenly does not know anything about efivarfs >>>> anymore, and what's sunrpc got to do with it. For the latter, >>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss_krb5.ko.xz >>>> is still there, just as it has been in C169 before. >>>> >>>> Any ideas are appreciated. :-) >>>> >>>> Thanks, and best regards, >>>> Peter Müller >>>> >>>> >>>>> Hello all, especially Arne, >>>>> >>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Build: next/06b4164d", >>>>> which primarily comes with Linux 5.15.59 and the slab cache merging disabled. On >>>>> my physical testing hardware, the boot process stalled after several kernel trace >>>>> message blocks being displayed. >>>>> >>>>> Unfortunately, I was unable to recover them in detail, but they occurred fairly >>>>> early, roughly around the mounting of the root file system. Since the machine is >>>>> semi-productive (we all test in production, don't we? ;-) ), I went back to C169 >>>>> and will now investigate further which change broke the update. >>>>> >>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc607716956daace413837a8da, >>>>> I believe, but it was definitely after the randstruct changes) ran fine for days here, >>>>> so it must be a pretty recent change. Will keep you updated. >>>>> >>>>> Thanks, and best regards, >>>>> Peter Müller >>>> <screenshot_c170_dirty_crash_on_boot_sunrpc_efivarfs.png> >>>