From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter =?utf-8?q?M=C3=BCller?= To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Tue, 09 Aug 2022 10:26:14 +0000 Message-ID: In-Reply-To: <34648E8A-0D09-4B21-AE60-D76BC8ACEB5E@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5409505158339099440==" List-Id: --===============5409505158339099440== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello Michael, agreed. I will keep an eye on the mails emitted from the nightly builders and delete the ccache, if necessary. If so, sending a short notice to this mailing list to inform fellow developers on this issue should be sufficient AFAIC, since we are only dealing with "nex= t". Thanks, and best regards, Peter M=C3=BCller > Hello, >=20 > Okay. That will leave us with the question if we have destroyed the ccache = on the nightly builders (or any others). >=20 > Since ccache might be unaware of the seed, we might have mixed files in the= cache. >=20 > If builds still fail after reverting the RANDSTRUCT patch, we might need to= wipe the cache. >=20 > -Michael >=20 >> On 9 Aug 2022, at 10:28, Peter M=C3=BCller wr= ote: >> >> Hello Arne, >> >> thank you very much for reporting back. >> >> Okay, then I will put the slab cache patch in again and leave randstruct d= isabled. >> >> Thanks, and best regards, >> Peter M=C3=BCller >> >> >>> A fresh build with empty ccache boots also with the slab cache patch >>> so RANDRTRUCT should be the real problem. >>> >>> Arne >>> >>> Am 2022-08-09 08:23, schrieb Arne Fitzenreiter: >>>> Am 2022-08-08 17:47, schrieb Peter M=C3=BCller: >>>>> Hello Arne, >>>>> >>>>> thanks for reporting back. >>>>> >>>>> This means the slab cache patch is the problem. >>>> >>>> Im not sure. I fear it could be the RANDSTRUCT because after a version >>>> update of the kernel >>>> it not use the ccache at first build and after a small config change >>>> it could break if parts of >>>> the kernel used from cache and some not. >>>> >>>> At the moment i test a clean build without ccache but enabled slub >>>> cache patch. If this work >>>> it is the RANDSTRUCT change. >>>> >>>> Arne >>>> >>>>> >>>>> Unfortunately, my local C-cache appears to be completely messed up now,= so I >>>>> will have to start with a clean cache, hence it will probably take me u= ntil >>>>> tomorrow to have some testing results ready. >>>>> >>>>> Will keep you updated. >>>>> >>>>> Thanks, and best regards, >>>>> Peter M=C3=BCller >>>>> >>>>> >>>>>> With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+000= 0-43df4a03/ >>>>>> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) >>>>>> After >>>>>> commit 06b4164dfe269704976b52421edbbbdf3b345679 >>>>>> Author: Peter M=C3=83=C2=BCller >>>>>> Date: Mon Aug 1 17:39:59 2022 +0000 >>>>>> >>>>>> linux: Do not allow slab caches to be merged >>>>>> >>>>>> >>>>>> it doesn't boot anymore. (also tested on x86_64 and aarch64) >>>>>> >>>>>> Arne >>>>>> >>>>>> >>>>>> Am 2022-08-08 12:22, schrieb Michael Tremer: >>>>>>> Hello, >>>>>>> >>>>>>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller wrote: >>>>>>>> >>>>>>>> Hello Michael, hello Arne, >>>>>>>> >>>>>>>> just a quick reply: I think we are dealing with the combination of t= wo issues here, >>>>>>>> as kernel 5.15.59 without slab cache merging disabled won't even boo= t in a VM (the >>>>>>>> screen stays blank indefinitely), and it crashes straight away with = the slab cache >>>>>>>> merging patch. >>>>>>>> >>>>>>>> Since kernel 5.15.57 is running perfectly fine here with randstruct = enabled, and has >>>>>>>> been for days, I just reverted both the update to 5.15.59 and the sl= ab cache patch. >>>>>>>> For the time being, I would leave randstruct enabled, since it does = not seem to be a >>>>>>>> root cause for whatever bug(s) we are dealing with at the moment. >>>>>>> >>>>>>> Is that from the first build or a consecutive one? >>>>>>> >>>>>>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so= , did it also >>>>>>>> boot properly in a VirtualBox VM? >>>>>>>> >>>>>>>> Apologies for this coming up so unexpected. >>>>>>> >>>>>>> Well, things break. We should however be fast to have at least a >>>>>>> booting kernel in the tree so that we won=E2=80=99t crash any more sy= stems. >>>>>>> >>>>>>> And if that requires to revert both patches until we know for certain >>>>>>> which one is the bad one, I find that the best option. >>>>>>> >>>>>>> -Michael >>>>>>> >>>>>>>> >>>>>>>> Thanks, and best regards, >>>>>>>> Peter M=C3=BCller >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> You seem to have a very classic NULL pointer dereference. >>>>>>>>> >>>>>>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99= t possible. >>>>>>>>> >>>>>>>>> Now it is interesting to know why that is. The cap_capable function= hasn=E2=80=99t been touched in the 5.15 tree in a while. The same goes for n= s_capable. >>>>>>>>> >>>>>>>>> I would therefore suspect that this is some issue from the RANDSTRU= CT plugin which seems to be incompatible with ccache. >>>>>>>>> >>>>>>>>> If you have built a kernel with a random seed for the first time, t= hat will be put into the cache. If the next build is unmodified, the kernel w= ith come out of the cache and will be exactly the same as the previous build. >>>>>>>>> >>>>>>>>> If you however modify some parts of the kernel (a minor release for= example) you will only compile the changed parts BUT with a different seed f= or the randstruct plugin. >>>>>>>>> >>>>>>>>> And I suspect that this has happened here where your code is now si= mply reading the wrong memory. >>>>>>>>> >>>>>>>>> I would recommend reverting the RANDSTRUCT patch and that should al= low you to have a proper image again. >>>>>>>>> >>>>>>>>> If you want to keep that, the only option would be to disable the c= cache for the kernel. The kernel is however one of the largest packages and c= cache works really really well here. We can discuss this if we have identifie= d RADNSTRUCT to be the culprit. >>>>>>>>> >>>>>>>>> -Michael >>>>>>>>> >>>>>>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller wrote: >>>>>>>>>> >>>>>>>>>> Hello *, >>>>>>>>>> >>>>>>>>>> enclosed is a screenshot of what booting the installer for Core Up= date 170 (dirty) >>>>>>>>>> with kernel 5.15.57 and slab merging disabled looks like. With ker= nel 5.15.59, the >>>>>>>>>> VM screen stays blank, so I had to revert this to get some results. >>>>>>>>>> >>>>>>>>>> Frankly, I don't see why the kernel suddenly does not know anythin= g about efivarfs >>>>>>>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>>>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcse= c_gss_krb5.ko.xz >>>>>>>>>> is still there, just as it has been in C169 before. >>>>>>>>>> >>>>>>>>>> Any ideas are appreciated. :-) >>>>>>>>>> >>>>>>>>>> Thanks, and best regards, >>>>>>>>>> Peter M=C3=BCller >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hello all, especially Arne, >>>>>>>>>>> >>>>>>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development B= uild: next/06b4164d", >>>>>>>>>>> which primarily comes with Linux 5.15.59 and the slab cache mergi= ng disabled. On >>>>>>>>>>> my physical testing hardware, the boot process stalled after seve= ral kernel trace >>>>>>>>>>> message blocks being displayed. >>>>>>>>>>> >>>>>>>>>>> Unfortunately, I was unable to recover them in detail, but they o= ccurred fairly >>>>>>>>>>> early, roughly around the mounting of the root file system. Since= the machine is >>>>>>>>>>> semi-productive (we all test in production, don't we? ;-) ), I we= nt back to C169 >>>>>>>>>>> and will now investigate further which change broke the update. >>>>>>>>>>> >>>>>>>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc607= 716956daace413837a8da, >>>>>>>>>>> I believe, but it was definitely after the randstruct changes) ra= n fine for days here, >>>>>>>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>>>>>>> >>>>>>>>>>> Thanks, and best regards, >>>>>>>>>>> Peter M=C3=BCller >>>>>>>>>> >>>>>>>>> >=20 --===============5409505158339099440==--