From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter =?utf-8?q?M=C3=BCller?= To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 09:28:27 +0000 Message-ID: In-Reply-To: <93be3c121d5cd2287091924c3d92884a@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6865327977742302811==" List-Id: --===============6865327977742302811== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello Arne, thank you very much for reporting back. Okay, then I will put the slab cache patch in again and leave randstruct disa= bled. Thanks, and best regards, Peter M=C3=BCller > A fresh build with empty ccache boots also with the slab cache patch > so RANDRTRUCT should be the real problem. >=20 > Arne >=20 > Am 2022-08-09 08:23, schrieb Arne Fitzenreiter: >> Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: >>> Hello Arne, >>> >>> thanks for reporting back. >>> >>> This means the slab cache patch is the problem. >> >> Im not sure. I fear it could be the RANDSTRUCT because after a version >> update of the kernel >> it not use the ccache at first build and after a small config change >> it could break if parts of >> the kernel used from cache and some not. >> >> At the moment i test a clean build without ccache but enabled slub >> cache patch. If this work >> it is the RANDSTRUCT change. >> >> Arne >> >>> >>> Unfortunately, my local C-cache appears to be completely messed up now, s= o I >>> will have to start with a clean cache, hence it will probably take me unt= il >>> tomorrow to have some testing results ready. >>> >>> Will keep you updated. >>> >>> Thanks, and best regards, >>> Peter M=C3=BCller >>> >>> >>>> With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-= 43df4a03/ >>>> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) >>>> After >>>> commit 06b4164dfe269704976b52421edbbbdf3b345679 >>>> Author: Peter M=C3=83=C2=BCller >>>> Date:=C2=A0=C2=A0 Mon Aug 1 17:39:59 2022 +0000 >>>> >>>> =C2=A0=C2=A0=C2=A0 linux: Do not allow slab caches to be merged >>>> >>>> >>>> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>>> >>>> Arne >>>> >>>> >>>> Am 2022-08-08 12:22, schrieb Michael Tremer: >>>>> Hello, >>>>> >>>>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller wrote: >>>>>> >>>>>> Hello Michael, hello Arne, >>>>>> >>>>>> just a quick reply: I think we are dealing with the combination of two= issues here, >>>>>> as kernel 5.15.59 without slab cache merging disabled won't even boot = in a VM (the >>>>>> screen stays blank indefinitely), and it crashes straight away with th= e slab cache >>>>>> merging patch. >>>>>> >>>>>> Since kernel 5.15.57 is running perfectly fine here with randstruct en= abled, and has >>>>>> been for days, I just reverted both the update to 5.15.59 and the slab= cache patch. >>>>>> For the time being, I would leave randstruct enabled, since it does no= t seem to be a >>>>>> root cause for whatever bug(s) we are dealing with at the moment. >>>>> >>>>> Is that from the first build or a consecutive one? >>>>> >>>>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, = did it also >>>>>> boot properly in a VirtualBox VM? >>>>>> >>>>>> Apologies for this coming up so unexpected. >>>>> >>>>> Well, things break. We should however be fast to have at least a >>>>> booting kernel in the tree so that we won=E2=80=99t crash any more syst= ems. >>>>> >>>>> And if that requires to revert both patches until we know for certain >>>>> which one is the bad one, I find that the best option. >>>>> >>>>> -Michael >>>>> >>>>>> >>>>>> Thanks, and best regards, >>>>>> Peter M=C3=BCller >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> You seem to have a very classic NULL pointer dereference. >>>>>>> >>>>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t = possible. >>>>>>> >>>>>>> Now it is interesting to know why that is. The cap_capable function h= asn=E2=80=99t been touched in the 5.15 tree in a while. The same goes for ns_= capable. >>>>>>> >>>>>>> I would therefore suspect that this is some issue from the RANDSTRUCT= plugin which seems to be incompatible with ccache. >>>>>>> >>>>>>> If you have built a kernel with a random seed for the first time, tha= t will be put into the cache. If the next build is unmodified, the kernel wit= h come out of the cache and will be exactly the same as the previous build. >>>>>>> >>>>>>> If you however modify some parts of the kernel (a minor release for e= xample) you will only compile the changed parts BUT with a different seed for= the randstruct plugin. >>>>>>> >>>>>>> And I suspect that this has happened here where your code is now simp= ly reading the wrong memory. >>>>>>> >>>>>>> I would recommend reverting the RANDSTRUCT patch and that should allo= w you to have a proper image again. >>>>>>> >>>>>>> If you want to keep that, the only option would be to disable the cca= che for the kernel. The kernel is however one of the largest packages and cca= che works really really well here. We can discuss this if we have identified = RADNSTRUCT to be the culprit. >>>>>>> >>>>>>> -Michael >>>>>>> >>>>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller wrote: >>>>>>>> >>>>>>>> Hello *, >>>>>>>> >>>>>>>> enclosed is a screenshot of what booting the installer for Core Upda= te 170 (dirty) >>>>>>>> with kernel 5.15.57 and slab merging disabled looks like. With kerne= l 5.15.59, the >>>>>>>> VM screen stays blank, so I had to revert this to get some results. >>>>>>>> >>>>>>>> Frankly, I don't see why the kernel suddenly does not know anything = about efivarfs >>>>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_= gss_krb5.ko.xz >>>>>>>> is still there, just as it has been in C169 before. >>>>>>>> >>>>>>>> Any ideas are appreciated. :-) >>>>>>>> >>>>>>>> Thanks, and best regards, >>>>>>>> Peter M=C3=BCller >>>>>>>> >>>>>>>> >>>>>>>>> Hello all, especially Arne, >>>>>>>>> >>>>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Bui= ld: next/06b4164d", >>>>>>>>> which primarily comes with Linux 5.15.59 and the slab cache merging= disabled. On >>>>>>>>> my physical testing hardware, the boot process stalled after severa= l kernel trace >>>>>>>>> message blocks being displayed. >>>>>>>>> >>>>>>>>> Unfortunately, I was unable to recover them in detail, but they occ= urred fairly >>>>>>>>> early, roughly around the mounting of the root file system. Since t= he machine is >>>>>>>>> semi-productive (we all test in production, don't we? ;-) ), I went= back to C169 >>>>>>>>> and will now investigate further which change broke the update. >>>>>>>>> >>>>>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc60771= 6956daace413837a8da, >>>>>>>>> I believe, but it was definitely after the randstruct changes) ran = fine for days here, >>>>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>>> >>>>>>>>> Thanks, and best regards, >>>>>>>>> Peter M=C3=BCller >>>>>>>> >>>>>>> --===============6865327977742302811==--