From mboxrd@z Thu Jan  1 00:00:00 1970
From: Peter =?utf-8?q?M=C3=BCller?= <peter.mueller@ipfire.org>
To: development@lists.ipfire.org
Subject: Re: Core Update 170 testing report - "next/06b4164d" crashes on my
 x86_64 testing machine
Date: Mon, 08 Aug 2022 15:47:51 +0000
Message-ID: <21efe16c-bad8-c9a0-dede-10762d269cd0@ipfire.org>
In-Reply-To: <7300c922548070c647e561cbbf7817f2@ipfire.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8954580781679705562=="
List-Id: <development.lists.ipfire.org>

--===============8954580781679705562==
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Hello Arne,

thanks for reporting back.

This means the slab cache patch is the problem.

Unfortunately, my local C-cache appears to be completely messed up now, so I
will have to start with a clean cache, hence it will probably take me until
tomorrow to have some testing results ready.

Will keep you updated.

Thanks, and best regards,
Peter M=C3=BCller


> With this https://nightly.ipfire.org/next/2022-08-06%2007:45:02%20+0000-43d=
f4a03/
> nightly the kernel 5.15.59 boots on real hardware (x86_64 and aarch64)
> After
> commit 06b4164dfe269704976b52421edbbbdf3b345679
> Author: Peter M=C3=83=C2=BCller <peter.mueller(a)ipfire.org>
> Date:=C2=A0=C2=A0 Mon Aug 1 17:39:59 2022 +0000
>=20
> =C2=A0=C2=A0=C2=A0 linux: Do not allow slab caches to be merged
>=20
>=20
> it doesn't boot anymore. (also tested on x86_64 and aarch64)
>=20
> Arne
>=20
>=20
> Am 2022-08-08 12:22, schrieb Michael Tremer:
>> Hello,
>>
>>> On 8 Aug 2022, at 11:16, Peter M=C3=BCller <peter.mueller(a)ipfire.org> w=
rote:
>>>
>>> Hello Michael, hello Arne,
>>>
>>> just a quick reply: I think we are dealing with the combination of two is=
sues here,
>>> as kernel 5.15.59 without slab cache merging disabled won't even boot in =
a VM (the
>>> screen stays blank indefinitely), and it crashes straight away with the s=
lab cache
>>> merging patch.
>>>
>>> Since kernel 5.15.57 is running perfectly fine here with randstruct enabl=
ed, and has
>>> been for days, I just reverted both the update to 5.15.59 and the slab ca=
che patch.
>>> For the time being, I would leave randstruct enabled, since it does not s=
eem to be a
>>> root cause for whatever bug(s) we are dealing with at the moment.
>>
>> Is that from the first build or a consecutive one?
>>
>>> @Arne: Were you able to boot 5.15.59 successfully on hardware? If so, did=
 it also
>>> boot properly in a VirtualBox VM?
>>>
>>> Apologies for this coming up so unexpected.
>>
>> Well, things break. We should however be fast to have at least a
>> booting kernel in the tree so that we won=E2=80=99t crash any more systems.
>>
>> And if that requires to revert both patches until we know for certain
>> which one is the bad one, I find that the best option.
>>
>> -Michael
>>
>>>
>>> Thanks, and best regards,
>>> Peter M=C3=BCller
>>>
>>>> Hello,
>>>>
>>>> You seem to have a very classic NULL pointer dereference.
>>>>
>>>> Something is trying to follow a NULL pointer. And that isn=E2=80=99t pos=
sible.
>>>>
>>>> Now it is interesting to know why that is. The cap_capable function hasn=
=E2=80=99t been touched in the 5.15 tree in a while. The same goes for ns_cap=
able.
>>>>
>>>> I would therefore suspect that this is some issue from the RANDSTRUCT pl=
ugin which seems to be incompatible with ccache.
>>>>
>>>> If you have built a kernel with a random seed for the first time, that w=
ill be put into the cache. If the next build is unmodified, the kernel with c=
ome out of the cache and will be exactly the same as the previous build.
>>>>
>>>> If you however modify some parts of the kernel (a minor release for exam=
ple) you will only compile the changed parts BUT with a different seed for th=
e randstruct plugin.
>>>>
>>>> And I suspect that this has happened here where your code is now simply =
reading the wrong memory.
>>>>
>>>> I would recommend reverting the RANDSTRUCT patch and that should allow y=
ou to have a proper image again.
>>>>
>>>> If you want to keep that, the only option would be to disable the ccache=
 for the kernel. The kernel is however one of the largest packages and ccache=
 works really really well here. We can discuss this if we have identified RAD=
NSTRUCT to be the culprit.
>>>>
>>>> -Michael
>>>>
>>>>> On 7 Aug 2022, at 19:08, Peter M=C3=BCller <peter.mueller(a)ipfire.org>=
 wrote:
>>>>>
>>>>> Hello *,
>>>>>
>>>>> enclosed is a screenshot of what booting the installer for Core Update =
170 (dirty)
>>>>> with kernel 5.15.57 and slab merging disabled looks like. With kernel 5=
.15.59, the
>>>>> VM screen stays blank, so I had to revert this to get some results.
>>>>>
>>>>> Frankly, I don't see why the kernel suddenly does not know anything abo=
ut efivarfs
>>>>> anymore, and what's sunrpc got to do with it. For the latter,
>>>>> /build/lib/modules/5.15.57-ipfire/kernel/net/sunrpc/auth_gss/rpcsec_gss=
_krb5.ko.xz
>>>>> is still there, just as it has been in C169 before.
>>>>>
>>>>> Any ideas are appreciated. :-)
>>>>>
>>>>> Thanks, and best regards,
>>>>> Peter M=C3=BCller
>>>>>
>>>>>
>>>>>> Hello all, especially Arne,
>>>>>>
>>>>>> today, I upgraded to "IPFire 2.27 - Core Update 170 Development Build:=
 next/06b4164d",
>>>>>> which primarily comes with Linux 5.15.59 and the slab cache merging di=
sabled. On
>>>>>> my physical testing hardware, the boot process stalled after several k=
ernel trace
>>>>>> message blocks being displayed.
>>>>>>
>>>>>> Unfortunately, I was unable to recover them in detail, but they occurr=
ed fairly
>>>>>> early, roughly around the mounting of the root file system. Since the =
machine is
>>>>>> semi-productive (we all test in production, don't we? ;-) ), I went ba=
ck to C169
>>>>>> and will now investigate further which change broke the update.
>>>>>>
>>>>>> An earlier version of Core Update 170 (commit 668cf4c0d0c2dbbc60771695=
6daace413837a8da,
>>>>>> I believe, but it was definitely after the randstruct changes) ran fin=
e for days here,
>>>>>> so it must be a pretty recent change. Will keep you updated.
>>>>>>
>>>>>> Thanks, and best regards,
>>>>>> Peter M=C3=BCller
>>>>> <screenshot_c170_dirty_crash_on_boot_sunrpc_efivarfs.png>
>>>>

--===============8954580781679705562==--