Hello,
On 8 Aug 2022, at 11:16, Peter Müller peter.mueller@ipfire.org wrote:
Hello Michael, hello Arne,
just a quick reply: I think we are dealing with the combination of two issues here, as kernel 5.15.59 without slab cache merging disabled won't even boot in a VM (the screen stays blank indefinitely), and it crashes straight away with the slab cache merging patch.
Since kernel 5.15.57 is running perfectly fine here with randstruct enabled, and has been for days, I just reverted both the update to 5.15.59 and the slab cache patch. For the time being, I would leave randstruct enabled, since it does not seem to be a root cause for whatever bug(s) we are dealing with at the moment.
Is that from the first build or a consecutive one?
@Arne: Were you able to boot 5.15.59 successfully on hardware? If so, did it also boot properly in a VirtualBox VM?
Apologies for this coming up so unexpected.
Well, things break. We should however be fast to have at least a booting kernel in the tree so that we won’t crash any more systems.
And if that requires to revert both patches until we know for certain which one is the bad one, I find that the best option.
-Michael
Thanks, and best regards, Peter Müller
Hello,
You seem to have a very classic NULL pointer dereference.
Something is trying to follow a NULL pointer. And that isn’t possible.
Now it is interesting to know why that is. The cap_capable function hasn’t been touched in the 5.15 tree in a while. The same goes for ns_capable.
I would therefore suspect that this is some issue from the RANDSTRUCT plugin which seems to be incompatible with ccache.
If you have built a kernel with a random seed for the first time, that will be put into the cache. If the next build is unmodified, the kernel with come out of the cache and will be exactly the same as the previous build.
If you however modify some parts of the kernel (a minor release for example) you will only compile the changed parts BUT with a different seed for the randstruct plugin.
And I suspect that this has happened here where your code is now simply reading the wrong memory.
I would recommend reverting the RANDSTRUCT patch and that should allow you to have a proper image again.
If you want to keep that, the only option would be to disable the ccache for the kernel. The kernel is however one of the largest packages and ccache works really really well here. We can discuss this if we have identified RADNSTRUCT to be the culprit.
-Michael
On 7 Aug 2022, at 19:08, Peter Müller peter.mueller@ipfire.org wrote:
Hello *,
enclosed is a screenshot of what booting the installer for Core Update 170 (dirty) with kernel 5.15.57 and slab merging disabled looks like. With kernel 5.15.59, the VM screen stays blank, so I had to revert this to get some results.
Frankly, I don't see why the kernel suddenly does not know anything about efivarfs anymore, and what's sunrpc got to do with it. For the latter, /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss_krb5.ko.xz is still there, just as it has been in C169 before.
Any ideas are appreciated. :-)
Thanks, and best regards, Peter Müller
Hello all, especially Arne,
today, I upgraded to "IPFire 2.27 - Core Update 170 Development Build: next/06b4164d", which primarily comes with Linux 5.15.59 and the slab cache merging disabled. On my physical testing hardware, the boot process stalled after several kernel trace message blocks being displayed.
Unfortunately, I was unable to recover them in detail, but they occurred fairly early, roughly around the mounting of the root file system. Since the machine is semi-productive (we all test in production, don't we? ;-) ), I went back to C169 and will now investigate further which change broke the update.
An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc607716956daace413837a8da, I believe, but it was definitely after the randstruct changes) ran fine for days here, so it must be a pretty recent change. Will keep you updated.
Thanks, and best regards, Peter Müller
<screenshot_c170_dirty_crash_on_boot_sunrpc_efivarfs.png>