From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arne Fitzenreiter To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 10:27:45 +0200 Message-ID: <93be3c121d5cd2287091924c3d92884a@ipfire.org> In-Reply-To: <0b179153c471b0fe6b9d87486bdef423@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4059287123946593258==" List-Id: --===============4059287123946593258== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable A fresh build with empty ccache boots also with the slab cache patch so RANDRTRUCT should be the real problem. Arne Am 2022-08-09 08:23, schrieb Arne Fitzenreiter: > Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: >> Hello Arne, >>=20 >> thanks for reporting back. >>=20 >> This means the slab cache patch is the problem. >=20 > Im not sure. I fear it could be the RANDSTRUCT because after a version > update of the kernel > it not use the ccache at first build and after a small config change > it could break if parts of > the kernel used from cache and some not. >=20 > At the moment i test a clean build without ccache but enabled slub > cache patch. If this work > it is the RANDSTRUCT change. >=20 > Arne >=20 >>=20 >> Unfortunately, my local C-cache appears to be completely messed up=20 >> now, so I >> will have to start with a clean cache, hence it will probably take me=20 >> until >> tomorrow to have some testing results ready. >>=20 >> Will keep you updated. >>=20 >> Thanks, and best regards, >> Peter M=C3=BCller >>=20 >>=20 >>> With this=20 >>> https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43df4a03/ >>> nightly the kernel 5.15.59 boots on real hardware (x86_64 and=20 >>> aarch64) >>> After >>> commit 06b4164dfe269704976b52421edbbbdf3b345679 >>> Author: Peter M=C3=83=C2=BCller >>> Date:=C2=A0=C2=A0 Mon Aug 1 17:39:59 2022 +0000 >>>=20 >>> =C2=A0=C2=A0=C2=A0 linux: Do not allow slab caches to be merged >>>=20 >>>=20 >>> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>>=20 >>> Arne >>>=20 >>>=20 >>> Am 2022-08-08 12:22, schrieb Michael Tremer: >>>> Hello, >>>>=20 >>>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller = =20 >>>>> wrote: >>>>>=20 >>>>> Hello Michael, hello Arne, >>>>>=20 >>>>> just a quick reply: I think we are dealing with the combination of=20 >>>>> two issues here, >>>>> as kernel 5.15.59 without slab cache merging disabled won't even=20 >>>>> boot in a VM (the >>>>> screen stays blank indefinitely), and it crashes straight away with=20 >>>>> the slab cache >>>>> merging patch. >>>>>=20 >>>>> Since kernel 5.15.57 is running perfectly fine here with randstruct=20 >>>>> enabled, and has >>>>> been for days, I just reverted both the update to 5.15.59 and the=20 >>>>> slab cache patch. >>>>> For the time being, I would leave randstruct enabled, since it does=20 >>>>> not seem to be a >>>>> root cause for whatever bug(s) we are dealing with at the moment. >>>>=20 >>>> Is that from the first build or a consecutive one? >>>>=20 >>>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If=20 >>>>> so, did it also >>>>> boot properly in a VirtualBox VM? >>>>>=20 >>>>> Apologies for this coming up so unexpected. >>>>=20 >>>> Well, things break. We should however be fast to have at least a >>>> booting kernel in the tree so that we won=E2=80=99t crash any more syste= ms. >>>>=20 >>>> And if that requires to revert both patches until we know for=20 >>>> certain >>>> which one is the bad one, I find that the best option. >>>>=20 >>>> -Michael >>>>=20 >>>>>=20 >>>>> Thanks, and best regards, >>>>> Peter M=C3=BCller >>>>>=20 >>>>>> Hello, >>>>>>=20 >>>>>> You seem to have a very classic NULL pointer dereference. >>>>>>=20 >>>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t=20 >>>>>> possible. >>>>>>=20 >>>>>> Now it is interesting to know why that is. The cap_capable=20 >>>>>> function hasn=E2=80=99t been touched in the 5.15 tree in a while. The = same=20 >>>>>> goes for ns_capable. >>>>>>=20 >>>>>> I would therefore suspect that this is some issue from the=20 >>>>>> RANDSTRUCT plugin which seems to be incompatible with ccache. >>>>>>=20 >>>>>> If you have built a kernel with a random seed for the first time,=20 >>>>>> that will be put into the cache. If the next build is unmodified,=20 >>>>>> the kernel with come out of the cache and will be exactly the same=20 >>>>>> as the previous build. >>>>>>=20 >>>>>> If you however modify some parts of the kernel (a minor release=20 >>>>>> for example) you will only compile the changed parts BUT with a=20 >>>>>> different seed for the randstruct plugin. >>>>>>=20 >>>>>> And I suspect that this has happened here where your code is now=20 >>>>>> simply reading the wrong memory. >>>>>>=20 >>>>>> I would recommend reverting the RANDSTRUCT patch and that should=20 >>>>>> allow you to have a proper image again. >>>>>>=20 >>>>>> If you want to keep that, the only option would be to disable the=20 >>>>>> ccache for the kernel. The kernel is however one of the largest=20 >>>>>> packages and ccache works really really well here. We can discuss=20 >>>>>> this if we have identified RADNSTRUCT to be the culprit. >>>>>>=20 >>>>>> -Michael >>>>>>=20 >>>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller =20 >>>>>>> wrote: >>>>>>>=20 >>>>>>> Hello *, >>>>>>>=20 >>>>>>> enclosed is a screenshot of what booting the installer for Core=20 >>>>>>> Update 170 (dirty) >>>>>>> with kernel 5.15.57 and slab merging disabled looks like. With=20 >>>>>>> kernel 5.15.59, the >>>>>>> VM screen stays blank, so I had to revert this to get some=20 >>>>>>> results. >>>>>>>=20 >>>>>>> Frankly, I don't see why the kernel suddenly does not know=20 >>>>>>> anything about efivarfs >>>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_g= ss_krb5.ko.xz >>>>>>> is still there, just as it has been in C169 before. >>>>>>>=20 >>>>>>> Any ideas are appreciated. :-) >>>>>>>=20 >>>>>>> Thanks, and best regards, >>>>>>> Peter M=C3=BCller >>>>>>>=20 >>>>>>>=20 >>>>>>>> Hello all, especially Arne, >>>>>>>>=20 >>>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development=20 >>>>>>>> Build: next/06b4164d", >>>>>>>> which primarily comes with Linux 5.15.59 and the slab cache=20 >>>>>>>> merging disabled. On >>>>>>>> my physical testing hardware, the boot process stalled after=20 >>>>>>>> several kernel trace >>>>>>>> message blocks being displayed. >>>>>>>>=20 >>>>>>>> Unfortunately, I was unable to recover them in detail, but they=20 >>>>>>>> occurred fairly >>>>>>>> early, roughly around the mounting of the root file system.=20 >>>>>>>> Since the machine is >>>>>>>> semi-productive (we all test in production, don't we? ;-) ), I=20 >>>>>>>> went back to C169 >>>>>>>> and will now investigate further which change broke the update. >>>>>>>>=20 >>>>>>>> An earlier version of Core Update 170 (commit=20 >>>>>>>> 668cf4c0d0c2dbbc607716956daace413837a8da, >>>>>>>> I believe, but it was definitely after the randstruct changes)=20 >>>>>>>> ran fine for days here, >>>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>>=20 >>>>>>>> Thanks, and best regards, >>>>>>>> Peter M=C3=BCller >>>>>>> >>>>>>=20 --===============4059287123946593258==--