From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 10:31:33 +0100 Message-ID: <34648E8A-0D09-4B21-AE60-D76BC8ACEB5E@ipfire.org> In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3187419449648433700==" List-Id: --===============3187419449648433700== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, Okay. That will leave us with the question if we have destroyed the ccache on= the nightly builders (or any others). Since ccache might be unaware of the seed, we might have mixed files in the c= ache. If builds still fail after reverting the RANDSTRUCT patch, we might need to w= ipe the cache. -Michael > On 9 Aug 2022, at 10:28, Peter M=C3=BCller wro= te: >=20 > Hello Arne, >=20 > thank you very much for reporting back. >=20 > Okay, then I will put the slab cache patch in again and leave randstruct di= sabled. >=20 > Thanks, and best regards, > Peter M=C3=BCller >=20 >=20 >> A fresh build with empty ccache boots also with the slab cache patch >> so RANDRTRUCT should be the real problem. >>=20 >> Arne >>=20 >> Am 2022-08-09 08:23, schrieb Arne Fitzenreiter: >>> Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: >>>> Hello Arne, >>>>=20 >>>> thanks for reporting back. >>>>=20 >>>> This means the slab cache patch is the problem. >>>=20 >>> Im not sure. I fear it could be the RANDSTRUCT because after a version >>> update of the kernel >>> it not use the ccache at first build and after a small config change >>> it could break if parts of >>> the kernel used from cache and some not. >>>=20 >>> At the moment i test a clean build without ccache but enabled slub >>> cache patch. If this work >>> it is the RANDSTRUCT change. >>>=20 >>> Arne >>>=20 >>>>=20 >>>> Unfortunately, my local C-cache appears to be completely messed up now, = so I >>>> will have to start with a clean cache, hence it will probably take me un= til >>>> tomorrow to have some testing results ready. >>>>=20 >>>> Will keep you updated. >>>>=20 >>>> Thanks, and best regards, >>>> Peter M=C3=BCller >>>>=20 >>>>=20 >>>>> With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000= -43df4a03/ >>>>> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) >>>>> After >>>>> commit 06b4164dfe269704976b52421edbbbdf3b345679 >>>>> Author: Peter M=C3=83=C2=BCller >>>>> Date: Mon Aug 1 17:39:59 2022 +0000 >>>>>=20 >>>>> linux: Do not allow slab caches to be merged >>>>>=20 >>>>>=20 >>>>> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>>>>=20 >>>>> Arne >>>>>=20 >>>>>=20 >>>>> Am 2022-08-08 12:22, schrieb Michael Tremer: >>>>>> Hello, >>>>>>=20 >>>>>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller wrote: >>>>>>>=20 >>>>>>> Hello Michael, hello Arne, >>>>>>>=20 >>>>>>> just a quick reply: I think we are dealing with the combination of tw= o issues here, >>>>>>> as kernel 5.15.59 without slab cache merging disabled won't even boot= in a VM (the >>>>>>> screen stays blank indefinitely), and it crashes straight away with t= he slab cache >>>>>>> merging patch. >>>>>>>=20 >>>>>>> Since kernel 5.15.57 is running perfectly fine here with randstruct e= nabled, and has >>>>>>> been for days, I just reverted both the update to 5.15.59 and the sla= b cache patch. >>>>>>> For the time being, I would leave randstruct enabled, since it does n= ot seem to be a >>>>>>> root cause for whatever bug(s) we are dealing with at the moment. >>>>>>=20 >>>>>> Is that from the first build or a consecutive one? >>>>>>=20 >>>>>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so,= did it also >>>>>>> boot properly in a VirtualBox VM? >>>>>>>=20 >>>>>>> Apologies for this coming up so unexpected. >>>>>>=20 >>>>>> Well, things break. We should however be fast to have at least a >>>>>> booting kernel in the tree so that we won=E2=80=99t crash any more sys= tems. >>>>>>=20 >>>>>> And if that requires to revert both patches until we know for certain >>>>>> which one is the bad one, I find that the best option. >>>>>>=20 >>>>>> -Michael >>>>>>=20 >>>>>>>=20 >>>>>>> Thanks, and best regards, >>>>>>> Peter M=C3=BCller >>>>>>>=20 >>>>>>>> Hello, >>>>>>>>=20 >>>>>>>> You seem to have a very classic NULL pointer dereference. >>>>>>>>=20 >>>>>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t= possible. >>>>>>>>=20 >>>>>>>> Now it is interesting to know why that is. The cap_capable function = hasn=E2=80=99t been touched in the 5.15 tree in a while. The same goes for ns= _capable. >>>>>>>>=20 >>>>>>>> I would therefore suspect that this is some issue from the RANDSTRUC= T plugin which seems to be incompatible with ccache. >>>>>>>>=20 >>>>>>>> If you have built a kernel with a random seed for the first time, th= at will be put into the cache. If the next build is unmodified, the kernel wi= th come out of the cache and will be exactly the same as the previous build. >>>>>>>>=20 >>>>>>>> If you however modify some parts of the kernel (a minor release for = example) you will only compile the changed parts BUT with a different seed fo= r the randstruct plugin. >>>>>>>>=20 >>>>>>>> And I suspect that this has happened here where your code is now sim= ply reading the wrong memory. >>>>>>>>=20 >>>>>>>> I would recommend reverting the RANDSTRUCT patch and that should all= ow you to have a proper image again. >>>>>>>>=20 >>>>>>>> If you want to keep that, the only option would be to disable the cc= ache for the kernel. The kernel is however one of the largest packages and cc= ache works really really well here. We can discuss this if we have identified= RADNSTRUCT to be the culprit. >>>>>>>>=20 >>>>>>>> -Michael >>>>>>>>=20 >>>>>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller wrote: >>>>>>>>>=20 >>>>>>>>> Hello *, >>>>>>>>>=20 >>>>>>>>> enclosed is a screenshot of what booting the installer for Core Upd= ate 170 (dirty) >>>>>>>>> with kernel 5.15.57 and slab merging disabled looks like. With kern= el 5.15.59, the >>>>>>>>> VM screen stays blank, so I had to revert this to get some results. >>>>>>>>>=20 >>>>>>>>> Frankly, I don't see why the kernel suddenly does not know anything= about efivarfs >>>>>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec= _gss_krb5.ko.xz >>>>>>>>> is still there, just as it has been in C169 before. >>>>>>>>>=20 >>>>>>>>> Any ideas are appreciated. :-) >>>>>>>>>=20 >>>>>>>>> Thanks, and best regards, >>>>>>>>> Peter M=C3=BCller >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>> Hello all, especially Arne, >>>>>>>>>>=20 >>>>>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Bu= ild: next/06b4164d", >>>>>>>>>> which primarily comes with Linux 5.15.59 and the slab cache mergin= g disabled. On >>>>>>>>>> my physical testing hardware, the boot process stalled after sever= al kernel trace >>>>>>>>>> message blocks being displayed. >>>>>>>>>>=20 >>>>>>>>>> Unfortunately, I was unable to recover them in detail, but they oc= curred fairly >>>>>>>>>> early, roughly around the mounting of the root file system. Since = the machine is >>>>>>>>>> semi-productive (we all test in production, don't we? ;-) ), I wen= t back to C169 >>>>>>>>>> and will now investigate further which change broke the update. >>>>>>>>>>=20 >>>>>>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc6077= 16956daace413837a8da, >>>>>>>>>> I believe, but it was definitely after the randstruct changes) ran= fine for days here, >>>>>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>>>>=20 >>>>>>>>>> Thanks, and best regards, >>>>>>>>>> Peter M=C3=BCller >>>>>>>>> >>>>>>>>=20 --===============3187419449648433700==--