From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tremer To: development@lists.ipfire.org Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my x86_64 testing machine Date: Mon, 08 Aug 2022 11:22:36 +0100 Message-ID: In-Reply-To: <83b41711-f866-a3b8-e401-f78e2ca01611@ipfire.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4667349477246459786==" List-Id: --===============4667349477246459786== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, > On 8 Aug 2022, at 11:16, Peter M=C3=BCller wro= te: >=20 > Hello Michael, hello Arne, >=20 > just a quick reply: I think we are dealing with the combination of two issu= es here, > as kernel 5.15.59 without slab cache merging disabled won't even boot in a = VM (the > screen stays blank indefinitely), and it crashes straight away with the sla= b cache > merging patch. >=20 > Since kernel 5.15.57 is running perfectly fine here with randstruct enabled= , and has > been for days, I just reverted both the update to 5.15.59 and the slab cach= e patch. > For the time being, I would leave randstruct enabled, since it does not see= m to be a > root cause for whatever bug(s) we are dealing with at the moment. Is that from the first build or a consecutive one? > @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, did i= t also > boot properly in a VirtualBox VM? >=20 > Apologies for this coming up so unexpected. Well, things break. We should however be fast to have at least a booting kern= el in the tree so that we won=E2=80=99t crash any more systems. And if that requires to revert both patches until we know for certain which o= ne is the bad one, I find that the best option. -Michael >=20 > Thanks, and best regards, > Peter M=C3=BCller >=20 >> Hello, >>=20 >> You seem to have a very classic NULL pointer dereference. >>=20 >> Something is trying to follow a NULL pointer. And that isn=E2=80=99t possi= ble. >>=20 >> Now it is interesting to know why that is. The cap_capable function hasn= =E2=80=99t been touched in the 5.15 tree in a while. The same goes for ns_cap= able. >>=20 >> I would therefore suspect that this is some issue from the RANDSTRUCT plug= in which seems to be incompatible with ccache. >>=20 >> If you have built a kernel with a random seed for the first time, that wil= l be put into the cache. If the next build is unmodified, the kernel with com= e out of the cache and will be exactly the same as the previous build. >>=20 >> If you however modify some parts of the kernel (a minor release for exampl= e) you will only compile the changed parts BUT with a different seed for the = randstruct plugin. >>=20 >> And I suspect that this has happened here where your code is now simply re= ading the wrong memory. >>=20 >> I would recommend reverting the RANDSTRUCT patch and that should allow you= to have a proper image again. >>=20 >> If you want to keep that, the only option would be to disable the ccache f= or the kernel. The kernel is however one of the largest packages and ccache w= orks really really well here. We can discuss this if we have identified RADNS= TRUCT to be the culprit. >>=20 >> -Michael >>=20 >>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller w= rote: >>>=20 >>> Hello *, >>>=20 >>> enclosed is a screenshot of what booting the installer for Core Update 17= 0 (dirty) >>> with kernel 5.15.57 and slab merging disabled looks like. With kernel 5.1= 5.59, the >>> VM screen stays blank, so I had to revert this to get some results. >>>=20 >>> Frankly, I don't see why the kernel suddenly does not know anything about= efivarfs >>> anymore, and what's sunrpc got to do with it. For the latter, >>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss_k= rb5.ko.xz >>> is still there, just as it has been in C169 before. >>>=20 >>> Any ideas are appreciated. :-) >>>=20 >>> Thanks, and best regards, >>> Peter M=C3=BCller >>>=20 >>>=20 >>>> Hello all, especially Arne, >>>>=20 >>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Build: n= ext/06b4164d", >>>> which primarily comes with Linux 5.15.59 and the slab cache merging disa= bled. On >>>> my physical testing hardware, the boot process stalled after several ker= nel trace >>>> message blocks being displayed. >>>>=20 >>>> Unfortunately, I was unable to recover them in detail, but they occurred= fairly >>>> early, roughly around the mounting of the root file system. Since the ma= chine is >>>> semi-productive (we all test in production, don't we? ;-) ), I went back= to C169 >>>> and will now investigate further which change broke the update. >>>>=20 >>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc607716956d= aace413837a8da, >>>> I believe, but it was definitely after the randstruct changes) ran fine = for days here, >>>> so it must be a pretty recent change. Will keep you updated. >>>>=20 >>>> Thanks, and best regards, >>>> Peter M=C3=BCller >>> >>=20 --===============4667349477246459786==--