From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arne Fitzenreiter To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Mon, 08 Aug 2022 16:15:45 +0200 Message-ID: <7300c922548070c647e561cbbf7817f2@ipfire.org> In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4156465727310722721==" List-Id: --===============4156465727310722721== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable With this=20 https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43df4a03/ nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) After commit 06b4164dfe269704976b52421edbbbdf3b345679 Author: Peter M=C3=83=C2=BCller Date: Mon Aug 1 17:39:59 2022 +0000 linux: Do not allow slab caches to be merged it doesn't boot anymore. (also tested on x86_64 and aarch64) Arne Am 2022-08-08 12:22, schrieb Michael Tremer: > Hello, >=20 >> On 8 Aug 2022, at 11:16, Peter M=C3=BCller =20 >> wrote: >>=20 >> Hello Michael, hello Arne, >>=20 >> just a quick reply: I think we are dealing with the combination of two=20 >> issues here, >> as kernel 5.15.59 without slab cache merging disabled won't even boot=20 >> in a VM (the >> screen stays blank indefinitely), and it crashes straight away with=20 >> the slab cache >> merging patch. >>=20 >> Since kernel 5.15.57 is running perfectly fine here with randstruct=20 >> enabled, and has >> been for days, I just reverted both the update to 5.15.59 and the slab=20 >> cache patch. >> For the time being, I would leave randstruct enabled, since it does=20 >> not seem to be a >> root cause for whatever bug(s) we are dealing with at the moment. >=20 > Is that from the first build or a consecutive one? >=20 >> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so,=20 >> did it also >> boot properly in a VirtualBox VM? >>=20 >> Apologies for this coming up so unexpected. >=20 > Well, things break. We should however be fast to have at least a > booting kernel in the tree so that we won=E2=80=99t crash any more systems. >=20 > And if that requires to revert both patches until we know for certain > which one is the bad one, I find that the best option. >=20 > -Michael >=20 >>=20 >> Thanks, and best regards, >> Peter M=C3=BCller >>=20 >>> Hello, >>>=20 >>> You seem to have a very classic NULL pointer dereference. >>>=20 >>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t=20 >>> possible. >>>=20 >>> Now it is interesting to know why that is. The cap_capable function=20 >>> hasn=E2=80=99t been touched in the 5.15 tree in a while. The same goes fo= r=20 >>> ns_capable. >>>=20 >>> I would therefore suspect that this is some issue from the RANDSTRUCT=20 >>> plugin which seems to be incompatible with ccache. >>>=20 >>> If you have built a kernel with a random seed for the first time,=20 >>> that will be put into the cache. If the next build is unmodified, the=20 >>> kernel with come out of the cache and will be exactly the same as the=20 >>> previous build. >>>=20 >>> If you however modify some parts of the kernel (a minor release for=20 >>> example) you will only compile the changed parts BUT with a different=20 >>> seed for the randstruct plugin. >>>=20 >>> And I suspect that this has happened here where your code is now=20 >>> simply reading the wrong memory. >>>=20 >>> I would recommend reverting the RANDSTRUCT patch and that should=20 >>> allow you to have a proper image again. >>>=20 >>> If you want to keep that, the only option would be to disable the=20 >>> ccache for the kernel. The kernel is however one of the largest=20 >>> packages and ccache works really really well here. We can discuss=20 >>> this if we have identified RADNSTRUCT to be the culprit. >>>=20 >>> -Michael >>>=20 >>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller = >>>> wrote: >>>>=20 >>>> Hello *, >>>>=20 >>>> enclosed is a screenshot of what booting the installer for Core=20 >>>> Update 170 (dirty) >>>> with kernel 5.15.57 and slab merging disabled looks like. With=20 >>>> kernel 5.15.59, the >>>> VM screen stays blank, so I had to revert this to get some results. >>>>=20 >>>> Frankly, I don't see why the kernel suddenly does not know anything=20 >>>> about efivarfs >>>> anymore, and what's sunrpc got to do with it. For the latter, >>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss_= krb5.ko.xz >>>> is still there, just as it has been in C169 before. >>>>=20 >>>> Any ideas are appreciated. :-) >>>>=20 >>>> Thanks, and best regards, >>>> Peter M=C3=BCller >>>>=20 >>>>=20 >>>>> Hello all, especially Arne, >>>>>=20 >>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development=20 >>>>> Build: next/06b4164d", >>>>> which primarily comes with Linux 5.15.59 and the slab cache merging=20 >>>>> disabled. On >>>>> my physical testing hardware, the boot process stalled after=20 >>>>> several kernel trace >>>>> message blocks being displayed. >>>>>=20 >>>>> Unfortunately, I was unable to recover them in detail, but they=20 >>>>> occurred fairly >>>>> early, roughly around the mounting of the root file system. Since=20 >>>>> the machine is >>>>> semi-productive (we all test in production, don't we? ;-) ), I went=20 >>>>> back to C169 >>>>> and will now investigate further which change broke the update. >>>>>=20 >>>>> An earlier version of Core Update 170 (commit=20 >>>>> 668cf4c0d0c2dbbc607716956daace413837a8da, >>>>> I believe, but it was definitely after the randstruct changes) ran=20 >>>>> fine for days here, >>>>> so it must be a pretty recent change. Will keep you updated. >>>>>=20 >>>>> Thanks, and best regards, >>>>> Peter M=C3=BCller >>>> >>>=20 --===============4156465727310722721==--