With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43df4a03/ nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) After commit 06b4164dfe269704976b52421edbbbdf3b345679 Author: Peter Müller Date: Mon Aug 1 17:39:59 2022 +0000 linux: Do not allow slab caches to be merged it doesn't boot anymore. (also tested on x86_64 and aarch64) Arne Am 2022-08-08 12:22, schrieb Michael Tremer: > Hello, > >> On 8 Aug 2022, at 11:16, Peter Müller >> wrote: >> >> Hello Michael, hello Arne, >> >> just a quick reply: I think we are dealing with the combination of two >> issues here, >> as kernel 5.15.59 without slab cache merging disabled won't even boot >> in a VM (the >> screen stays blank indefinitely), and it crashes straight away with >> the slab cache >> merging patch. >> >> Since kernel 5.15.57 is running perfectly fine here with randstruct >> enabled, and has >> been for days, I just reverted both the update to 5.15.59 and the slab >> cache patch. >> For the time being, I would leave randstruct enabled, since it does >> not seem to be a >> root cause for whatever bug(s) we are dealing with at the moment. > > Is that from the first build or a consecutive one? > >> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, >> did it also >> boot properly in a VirtualBox VM? >> >> Apologies for this coming up so unexpected. > > Well, things break. We should however be fast to have at least a > booting kernel in the tree so that we won’t crash any more systems. > > And if that requires to revert both patches until we know for certain > which one is the bad one, I find that the best option. > > -Michael > >> >> Thanks, and best regards, >> Peter Müller >> >>> Hello, >>> >>> You seem to have a very classic NULL pointer dereference. >>> >>> Something is trying to follow a NULL pointer. And that isn’t >>> possible. >>> >>> Now it is interesting to know why that is. The cap_capable function >>> hasn’t been touched in the 5.15 tree in a while. The same goes for >>> ns_capable. >>> >>> I would therefore suspect that this is some issue from the RANDSTRUCT >>> plugin which seems to be incompatible with ccache. >>> >>> If you have built a kernel with a random seed for the first time, >>> that will be put into the cache. If the next build is unmodified, the >>> kernel with come out of the cache and will be exactly the same as the >>> previous build. >>> >>> If you however modify some parts of the kernel (a minor release for >>> example) you will only compile the changed parts BUT with a different >>> seed for the randstruct plugin. >>> >>> And I suspect that this has happened here where your code is now >>> simply reading the wrong memory. >>> >>> I would recommend reverting the RANDSTRUCT patch and that should >>> allow you to have a proper image again. >>> >>> If you want to keep that, the only option would be to disable the >>> ccache for the kernel. The kernel is however one of the largest >>> packages and ccache works really really well here. We can discuss >>> this if we have identified RADNSTRUCT to be the culprit. >>> >>> -Michael >>> >>>> On 7 Aug 2022, at 19:08, Peter Müller >>>> wrote: >>>> >>>> Hello *, >>>> >>>> enclosed is a screenshot of what booting the installer for Core >>>> Update 170 (dirty) >>>> with kernel 5.15.57 and slab merging disabled looks like. With >>>> kernel 5.15.59, the >>>> VM screen stays blank, so I had to revert this to get some results. >>>> >>>> Frankly, I don't see why the kernel suddenly does not know anything >>>> about efivarfs >>>> anymore, and what's sunrpc got to do with it. For the latter, >>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss_krb5.ko.xz >>>> is still there, just as it has been in C169 before. >>>> >>>> Any ideas are appreciated. :-) >>>> >>>> Thanks, and best regards, >>>> Peter Müller >>>> >>>> >>>>> Hello all, especially Arne, >>>>> >>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development >>>>> Build: next/06b4164d", >>>>> which primarily comes with Linux 5.15.59 and the slab cache merging >>>>> disabled. On >>>>> my physical testing hardware, the boot process stalled after >>>>> several kernel trace >>>>> message blocks being displayed. >>>>> >>>>> Unfortunately, I was unable to recover them in detail, but they >>>>> occurred fairly >>>>> early, roughly around the mounting of the root file system. Since >>>>> the machine is >>>>> semi-productive (we all test in production, don't we? ;-) ), I went >>>>> back to C169 >>>>> and will now investigate further which change broke the update. >>>>> >>>>> An earlier version of Core Update 170 (commit >>>>> 668cf4c0d0c2dbbc607716956daace413837a8da, >>>>> I believe, but it was definitely after the randstruct changes) ran >>>>> fine for days here, >>>>> so it must be a pretty recent change. Will keep you updated. >>>>> >>>>> Thanks, and best regards, >>>>> Peter Müller >>>> >>>