From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter =?utf-8?q?M=C3=BCller?= To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Mon, 08 Aug 2022 15:47:51 +0000 Message-ID: <21efe16c-bad8-c9a0-dede-10762d269cd0@ipfire.org> In-Reply-To: <7300c922548070c647e561cbbf7817f2@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8954580781679705562==" List-Id: --===============8954580781679705562== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello Arne, thanks for reporting back. This means the slab cache patch is the problem. Unfortunately, my local C-cache appears to be completely messed up now, so I will have to start with a clean cache, hence it will probably take me until tomorrow to have some testing results ready. Will keep you updated. Thanks, and best regards, Peter M=C3=BCller > With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43d= f4a03/ > nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64) > After > commit 06b4164dfe269704976b52421edbbbdf3b345679 > Author: Peter M=C3=83=C2=BCller > Date:=C2=A0=C2=A0 Mon Aug 1 17:39:59 2022 +0000 >=20 > =C2=A0=C2=A0=C2=A0 linux: Do not allow slab caches to be merged >=20 >=20 > it doesn't boot anymore. (also tested on x86_64 and aarch64) >=20 > Arne >=20 >=20 > Am 2022-08-08 12:22, schrieb Michael Tremer: >> Hello, >> >>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller w= rote: >>> >>> Hello Michael, hello Arne, >>> >>> just a quick reply: I think we are dealing with the combination of two is= sues here, >>> as kernel 5.15.59 without slab cache merging disabled won't even boot in = a VM (the >>> screen stays blank indefinitely), and it crashes straight away with the s= lab cache >>> merging patch. >>> >>> Since kernel 5.15.57 is running perfectly fine here with randstruct enabl= ed, and has >>> been for days, I just reverted both the update to 5.15.59 and the slab ca= che patch. >>> For the time being, I would leave randstruct enabled, since it does not s= eem to be a >>> root cause for whatever bug(s) we are dealing with at the moment. >> >> Is that from the first build or a consecutive one? >> >>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, did= it also >>> boot properly in a VirtualBox VM? >>> >>> Apologies for this coming up so unexpected. >> >> Well, things break. We should however be fast to have at least a >> booting kernel in the tree so that we won=E2=80=99t crash any more systems. >> >> And if that requires to revert both patches until we know for certain >> which one is the bad one, I find that the best option. >> >> -Michael >> >>> >>> Thanks, and best regards, >>> Peter M=C3=BCller >>> >>>> Hello, >>>> >>>> You seem to have a very classic NULL pointer dereference. >>>> >>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t pos= sible. >>>> >>>> Now it is interesting to know why that is. The cap_capable function hasn= =E2=80=99t been touched in the 5.15 tree in a while. The same goes for ns_cap= able. >>>> >>>> I would therefore suspect that this is some issue from the RANDSTRUCT pl= ugin which seems to be incompatible with ccache. >>>> >>>> If you have built a kernel with a random seed for the first time, that w= ill be put into the cache. If the next build is unmodified, the kernel with c= ome out of the cache and will be exactly the same as the previous build. >>>> >>>> If you however modify some parts of the kernel (a minor release for exam= ple) you will only compile the changed parts BUT with a different seed for th= e randstruct plugin. >>>> >>>> And I suspect that this has happened here where your code is now simply = reading the wrong memory. >>>> >>>> I would recommend reverting the RANDSTRUCT patch and that should allow y= ou to have a proper image again. >>>> >>>> If you want to keep that, the only option would be to disable the ccache= for the kernel. The kernel is however one of the largest packages and ccache= works really really well here. We can discuss this if we have identified RAD= NSTRUCT to be the culprit. >>>> >>>> -Michael >>>> >>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller = wrote: >>>>> >>>>> Hello *, >>>>> >>>>> enclosed is a screenshot of what booting the installer for Core Update = 170 (dirty) >>>>> with kernel 5.15.57 and slab merging disabled looks like. With kernel 5= .15.59, the >>>>> VM screen stays blank, so I had to revert this to get some results. >>>>> >>>>> Frankly, I don't see why the kernel suddenly does not know anything abo= ut efivarfs >>>>> anymore, and what's sunrpc got to do with it. For the latter, >>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss= _krb5.ko.xz >>>>> is still there, just as it has been in C169 before. >>>>> >>>>> Any ideas are appreciated. :-) >>>>> >>>>> Thanks, and best regards, >>>>> Peter M=C3=BCller >>>>> >>>>> >>>>>> Hello all, especially Arne, >>>>>> >>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Build:= next/06b4164d", >>>>>> which primarily comes with Linux 5.15.59 and the slab cache merging di= sabled. On >>>>>> my physical testing hardware, the boot process stalled after several k= ernel trace >>>>>> message blocks being displayed. >>>>>> >>>>>> Unfortunately, I was unable to recover them in detail, but they occurr= ed fairly >>>>>> early, roughly around the mounting of the root file system. Since the = machine is >>>>>> semi-productive (we all test in production, don't we? ;-) ), I went ba= ck to C169 >>>>>> and will now investigate further which change broke the update. >>>>>> >>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc60771695= 6daace413837a8da, >>>>>> I believe, but it was definitely after the randstruct changes) ran fin= e for days here, >>>>>> so it must be a pretty recent change. Will keep you updated. >>>>>> >>>>>> Thanks, and best regards, >>>>>> Peter M=C3=BCller >>>>> >>>> --===============8954580781679705562==--