From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 11:37:57 +0100 Message-ID: In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2261272698792504834==" List-Id: --===============2261272698792504834== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, > On 9 Aug 2022, at 11:26, Peter M=C3=BCller wro= te: >=20 > Hello Michael, >=20 > agreed. I will keep an eye on the mails emitted from the nightly builders a= nd > delete the ccache, if necessary. >=20 > If so, sending a short notice to this mailing list to inform fellow develop= ers > on this issue should be sufficient AFAIC, since we are only dealing with "n= ext". Yes. >=20 > Thanks, and best regards, > Peter M=C3=BCller >=20 >=20 >> Hello, >>=20 >> Okay. That will leave us with the question if we have destroyed the ccache= on the nightly builders (or any others). >>=20 >> Since ccache might be unaware of the seed, we might have mixed files in th= e cache. >>=20 >> If builds still fail after reverting the RANDSTRUCT patch, we might need t= o wipe the cache. >>=20 >> -Michael >>=20 >>> On 9 Aug 2022, at 10:28, Peter M=C3=BCller w= rote: >>>=20 >>> Hello Arne, >>>=20 >>> thank you very much for reporting back. >>>=20 >>> Okay, then I will put the slab cache patch in again and leave randstruct = disabled. >>>=20 >>> Thanks, and best regards, >>> Peter M=C3=BCller >>>=20 >>>=20 >>>> A fresh build with empty ccache boots also with the slab cache patch >>>> so RANDRTRUCT should be the real problem. >>>>=20 >>>> Arne >>>>=20 >>>> Am 2022-08-09 08:23, schrieb Arne Fitzenreiter: >>>>> Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: >>>>>> Hello Arne, >>>>>>=20 >>>>>> thanks for reporting back. >>>>>>=20 >>>>>> This means the slab cache patch is the problem. >>>>>=20 >>>>> Im not sure. I fear it could be the RANDSTRUCT because after a version >>>>> update of the kernel >>>>> it not use the ccache at first build and after a small config change >>>>> it could break if parts of >>>>> the kernel used from cache and some not. >>>>>=20 >>>>> At the moment i test a clean build without ccache but enabled slub >>>>> cache patch. If this work >>>>> it is the RANDSTRUCT change. >>>>>=20 >>>>> Arne >>>>>=20 >>>>>>=20 >>>>>> Unfortunately, my local C-cache appears to be completely messed up now= , so I >>>>>> will have to start with a clean cache, hence it will probably take me = until >>>>>> tomorrow to have some testing results ready. >>>>>>=20 >>>>>> Will keep you updated. >>>>>>=20 >>>>>> Thanks, and best regards, >>>>>> Peter M=C3=BCller >>>>>>=20 >>>>>>=20 >>>>>>> With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+00= 00-43df4a03/ >>>>>>> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) >>>>>>> After >>>>>>> commit 06b4164dfe269704976b52421edbbbdf3b345679 >>>>>>> Author: Peter M=C3=83=C2=BCller >>>>>>> Date: Mon Aug 1 17:39:59 2022 +0000 >>>>>>>=20 >>>>>>> linux: Do not allow slab caches to be merged >>>>>>>=20 >>>>>>>=20 >>>>>>> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>>>>>>=20 >>>>>>> Arne >>>>>>>=20 >>>>>>>=20 >>>>>>> Am 2022-08-08 12:22, schrieb Michael Tremer: >>>>>>>> Hello, >>>>>>>>=20 >>>>>>>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller wrote: >>>>>>>>>=20 >>>>>>>>> Hello Michael, hello Arne, >>>>>>>>>=20 >>>>>>>>> just a quick reply: I think we are dealing with the combination of = two issues here, >>>>>>>>> as kernel 5.15.59 without slab cache merging disabled won't even bo= ot in a VM (the >>>>>>>>> screen stays blank indefinitely), and it crashes straight away with= the slab cache >>>>>>>>> merging patch. >>>>>>>>>=20 >>>>>>>>> Since kernel 5.15.57 is running perfectly fine here with randstruct= enabled, and has >>>>>>>>> been for days, I just reverted both the update to 5.15.59 and the s= lab cache patch. >>>>>>>>> For the time being, I would leave randstruct enabled, since it does= not seem to be a >>>>>>>>> root cause for whatever bug(s) we are dealing with at the moment. >>>>>>>>=20 >>>>>>>> Is that from the first build or a consecutive one? >>>>>>>>=20 >>>>>>>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If s= o, did it also >>>>>>>>> boot properly in a VirtualBox VM? >>>>>>>>>=20 >>>>>>>>> Apologies for this coming up so unexpected. >>>>>>>>=20 >>>>>>>> Well, things break. We should however be fast to have at least a >>>>>>>> booting kernel in the tree so that we won=E2=80=99t crash any more s= ystems. >>>>>>>>=20 >>>>>>>> And if that requires to revert both patches until we know for certain >>>>>>>> which one is the bad one, I find that the best option. >>>>>>>>=20 >>>>>>>> -Michael >>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> Thanks, and best regards, >>>>>>>>> Peter M=C3=BCller >>>>>>>>>=20 >>>>>>>>>> Hello, >>>>>>>>>>=20 >>>>>>>>>> You seem to have a very classic NULL pointer dereference. >>>>>>>>>>=20 >>>>>>>>>> Something is trying to follow a NULL pointer. And that isn=E2=80= =99t possible. >>>>>>>>>>=20 >>>>>>>>>> Now it is interesting to know why that is. The cap_capable functio= n hasn=E2=80=99t been touched in the 5.15 tree in a while. The same goes for = ns_capable. >>>>>>>>>>=20 >>>>>>>>>> I would therefore suspect that this is some issue from the RANDSTR= UCT plugin which seems to be incompatible with ccache. >>>>>>>>>>=20 >>>>>>>>>> If you have built a kernel with a random seed for the first time, = that will be put into the cache. If the next build is unmodified, the kernel = with come out of the cache and will be exactly the same as the previous build. >>>>>>>>>>=20 >>>>>>>>>> If you however modify some parts of the kernel (a minor release fo= r example) you will only compile the changed parts BUT with a different seed = for the randstruct plugin. >>>>>>>>>>=20 >>>>>>>>>> And I suspect that this has happened here where your code is now s= imply reading the wrong memory. >>>>>>>>>>=20 >>>>>>>>>> I would recommend reverting the RANDSTRUCT patch and that should a= llow you to have a proper image again. >>>>>>>>>>=20 >>>>>>>>>> If you want to keep that, the only option would be to disable the = ccache for the kernel. The kernel is however one of the largest packages and = ccache works really really well here. We can discuss this if we have identifi= ed RADNSTRUCT to be the culprit. >>>>>>>>>>=20 >>>>>>>>>> -Michael >>>>>>>>>>=20 >>>>>>>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller wrote: >>>>>>>>>>>=20 >>>>>>>>>>> Hello *, >>>>>>>>>>>=20 >>>>>>>>>>> enclosed is a screenshot of what booting the installer for Core U= pdate 170 (dirty) >>>>>>>>>>> with kernel 5.15.57 and slab merging disabled looks like. With ke= rnel 5.15.59, the >>>>>>>>>>> VM screen stays blank, so I had to revert this to get some result= s. >>>>>>>>>>>=20 >>>>>>>>>>> Frankly, I don't see why the kernel suddenly does not know anythi= ng about efivarfs >>>>>>>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>>>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcs= ec_gss_krb5.ko.xz >>>>>>>>>>> is still there, just as it has been in C169 before. >>>>>>>>>>>=20 >>>>>>>>>>> Any ideas are appreciated. :-) >>>>>>>>>>>=20 >>>>>>>>>>> Thanks, and best regards, >>>>>>>>>>> Peter M=C3=BCller >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>> Hello all, especially Arne, >>>>>>>>>>>>=20 >>>>>>>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development = Build: next/06b4164d", >>>>>>>>>>>> which primarily comes with Linux 5.15.59 and the slab cache merg= ing disabled. On >>>>>>>>>>>> my physical testing hardware, the boot process stalled after sev= eral kernel trace >>>>>>>>>>>> message blocks being displayed. >>>>>>>>>>>>=20 >>>>>>>>>>>> Unfortunately, I was unable to recover them in detail, but they = occurred fairly >>>>>>>>>>>> early, roughly around the mounting of the root file system. Sinc= e the machine is >>>>>>>>>>>> semi-productive (we all test in production, don't we? ;-) ), I w= ent back to C169 >>>>>>>>>>>> and will now investigate further which change broke the update. >>>>>>>>>>>>=20 >>>>>>>>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc60= 7716956daace413837a8da, >>>>>>>>>>>> I believe, but it was definitely after the randstruct changes) r= an fine for days here, >>>>>>>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>>>>>>=20 >>>>>>>>>>>> Thanks, and best regards, >>>>>>>>>>>> Peter M=C3=BCller >>>>>>>>>>> >>>>>>>>>>=20 >>=20 --===============2261272698792504834==--