From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arne Fitzenreiter To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 08:23:33 +0200 Message-ID: <0b179153c471b0fe6b9d87486bdef423@ipfire.org> In-Reply-To: <21efe16c-bad8-c9a0-dede-10762d269cd0@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8468523077630694322==" List-Id: --===============8468523077630694322== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: > Hello Arne, >=20 > thanks for reporting back. >=20 > This means the slab cache patch is the problem. Im not sure. I fear it could be the RANDSTRUCT because after a version=20 update of the kernel it not use the ccache at first build and after a small config change it=20 could break if parts of the kernel used from cache and some not. At the moment i test a clean build without ccache but enabled slub cache=20 patch. If this work it is the RANDSTRUCT change. Arne >=20 > Unfortunately, my local C-cache appears to be completely messed up now,=20 > so I > will have to start with a clean cache, hence it will probably take me=20 > until > tomorrow to have some testing results ready. >=20 > Will keep you updated. >=20 > Thanks, and best regards, > Peter M=C3=BCller >=20 >=20 >> With this=20 >> https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43df4a03/ >> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) >> After >> commit 06b4164dfe269704976b52421edbbbdf3b345679 >> Author: Peter M=C3=83=C2=BCller >> Date:=C2=A0=C2=A0 Mon Aug 1 17:39:59 2022 +0000 >>=20 >> =C2=A0=C2=A0=C2=A0 linux: Do not allow slab caches to be merged >>=20 >>=20 >> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>=20 >> Arne >>=20 >>=20 >> Am 2022-08-08 12:22, schrieb Michael Tremer: >>> Hello, >>>=20 >>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller = >>>> wrote: >>>>=20 >>>> Hello Michael, hello Arne, >>>>=20 >>>> just a quick reply: I think we are dealing with the combination of=20 >>>> two issues here, >>>> as kernel 5.15.59 without slab cache merging disabled won't even=20 >>>> boot in a VM (the >>>> screen stays blank indefinitely), and it crashes straight away with=20 >>>> the slab cache >>>> merging patch. >>>>=20 >>>> Since kernel 5.15.57 is running perfectly fine here with randstruct=20 >>>> enabled, and has >>>> been for days, I just reverted both the update to 5.15.59 and the=20 >>>> slab cache patch. >>>> For the time being, I would leave randstruct enabled, since it does=20 >>>> not seem to be a >>>> root cause for whatever bug(s) we are dealing with at the moment. >>>=20 >>> Is that from the first build or a consecutive one? >>>=20 >>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If=20 >>>> so, did it also >>>> boot properly in a VirtualBox VM? >>>>=20 >>>> Apologies for this coming up so unexpected. >>>=20 >>> Well, things break. We should however be fast to have at least a >>> booting kernel in the tree so that we won=E2=80=99t crash any more system= s. >>>=20 >>> And if that requires to revert both patches until we know for certain >>> which one is the bad one, I find that the best option. >>>=20 >>> -Michael >>>=20 >>>>=20 >>>> Thanks, and best regards, >>>> Peter M=C3=BCller >>>>=20 >>>>> Hello, >>>>>=20 >>>>> You seem to have a very classic NULL pointer dereference. >>>>>=20 >>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t=20 >>>>> possible. >>>>>=20 >>>>> Now it is interesting to know why that is. The cap_capable function=20 >>>>> hasn=E2=80=99t been touched in the 5.15 tree in a while. The same goes = for=20 >>>>> ns_capable. >>>>>=20 >>>>> I would therefore suspect that this is some issue from the=20 >>>>> RANDSTRUCT plugin which seems to be incompatible with ccache. >>>>>=20 >>>>> If you have built a kernel with a random seed for the first time,=20 >>>>> that will be put into the cache. If the next build is unmodified,=20 >>>>> the kernel with come out of the cache and will be exactly the same=20 >>>>> as the previous build. >>>>>=20 >>>>> If you however modify some parts of the kernel (a minor release for=20 >>>>> example) you will only compile the changed parts BUT with a=20 >>>>> different seed for the randstruct plugin. >>>>>=20 >>>>> And I suspect that this has happened here where your code is now=20 >>>>> simply reading the wrong memory. >>>>>=20 >>>>> I would recommend reverting the RANDSTRUCT patch and that should=20 >>>>> allow you to have a proper image again. >>>>>=20 >>>>> If you want to keep that, the only option would be to disable the=20 >>>>> ccache for the kernel. The kernel is however one of the largest=20 >>>>> packages and ccache works really really well here. We can discuss=20 >>>>> this if we have identified RADNSTRUCT to be the culprit. >>>>>=20 >>>>> -Michael >>>>>=20 >>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller =20 >>>>>> wrote: >>>>>>=20 >>>>>> Hello *, >>>>>>=20 >>>>>> enclosed is a screenshot of what booting the installer for Core=20 >>>>>> Update 170 (dirty) >>>>>> with kernel 5.15.57 and slab merging disabled looks like. With=20 >>>>>> kernel 5.15.59, the >>>>>> VM screen stays blank, so I had to revert this to get some=20 >>>>>> results. >>>>>>=20 >>>>>> Frankly, I don't see why the kernel suddenly does not know=20 >>>>>> anything about efivarfs >>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gs= s_krb5.ko.xz >>>>>> is still there, just as it has been in C169 before. >>>>>>=20 >>>>>> Any ideas are appreciated. :-) >>>>>>=20 >>>>>> Thanks, and best regards, >>>>>> Peter M=C3=BCller >>>>>>=20 >>>>>>=20 >>>>>>> Hello all, especially Arne, >>>>>>>=20 >>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development=20 >>>>>>> Build: next/06b4164d", >>>>>>> which primarily comes with Linux 5.15.59 and the slab cache=20 >>>>>>> merging disabled. On >>>>>>> my physical testing hardware, the boot process stalled after=20 >>>>>>> several kernel trace >>>>>>> message blocks being displayed. >>>>>>>=20 >>>>>>> Unfortunately, I was unable to recover them in detail, but they=20 >>>>>>> occurred fairly >>>>>>> early, roughly around the mounting of the root file system. Since=20 >>>>>>> the machine is >>>>>>> semi-productive (we all test in production, don't we? ;-) ), I=20 >>>>>>> went back to C169 >>>>>>> and will now investigate further which change broke the update. >>>>>>>=20 >>>>>>> An earlier version of Core Update 170 (commit=20 >>>>>>> 668cf4c0d0c2dbbc607716956daace413837a8da, >>>>>>> I believe, but it was definitely after the randstruct changes)=20 >>>>>>> ran fine for days here, >>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>=20 >>>>>>> Thanks, and best regards, >>>>>>> Peter M=C3=BCller >>>>>> >>>>>=20 --===============8468523077630694322==--