-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Summary
There are new RHEL 9.2 based GPU drivers to provision Intel GPU Flex and Max Series. Good news: the new drivers now do not have an incompatibility with ast driver. On RHEL 8.6 based OCP 4.12, ast driver needed to be unloaded or blacklisted (via machine config which triggers reboot) prior to loading out of tree GPU drivers.
Challenges:
In-tree i915 and intel_vsec drivers have to unloaded prior to loading of out of tree drivers. KMM can only unload one in-tree driver as of now. Now, it is found that we have a use case for unloading more than one in-tree driver. Short term potential solution: unload intel_vsec outside of KMM most likely using machine config.
Once the out of tree drivers are loaded, it is observed that unloading the drivers is difficult as they are always in use by GUI subcomponent i.e. framebuffer. The exact root cause is not determined but once the out of tree drivers are loaded, the GPU is actively used by a component in the system that prevents it from being unloaded. More exploration needed due to complexity to find root cause. lsof command was used to determine what was using the driver but did not provide any additional information.
Details:
2 components have changed:
- New GPU drivers/FW for RHEL 9.2
- New kernel for RHEL 9.2
KMM has a feature available on version 1.1.1 that can be used to unload 1 in-tree driver.
We can use this feature to unload in-tree i915. We cannot unload more than one kmod. We now have a use case to unload more than 1 in-tree driver. This includes i915 and intel_vsec for now and potentially cse in future.
3 Main Drivers for GPU: i915, intel_vsec (this is a prerequisite for i915), CSE (MEI)
Out of tree drivers behavior: Loading i915 driver will load the intel_vsec driver. Unloading i915 will unload intel_vsec.
In-tree driver behavior: Loading i915 does not load intel_vsec. Unloading i915 does not unload intel_vsec.
RHEL 9.2 OCP 4.13 has a new kernel based on 5.14.z upstream kernel. This is a huge jump from RHEL 8.6 based OCP 4.12 which used 4.18.z upstream kernel.
Initial smoke test analysis and Observed Impact:
There is an i915 and intel_vsec in-tree driver in RHEL 9.2 (not loaded by default, it is only loaded by kernel when it detects the GPU card via PCI device ID). These above 2 in-tree drivers do not support Intel GPU Flex or Max series. The in-tree i915 driver provides display support functionality for Intel Client Arc GPUs. As a result, customers will notice on dmesg the following message:
sh-5.1# dmesg | grep graphics
[ 12.385679] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this
[ 478.732896] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this
Intel® Data Center GPU Flex 170 -> PCI ID is 56c0.
Observation 1:
If in-tree intel_vsec is not unloaded prior to loading out of tree i915 driver, then unknown symbol errors observed in dmesg.
3238.466900] compat: loading out-of-tree module taints kernel.
[ 3238.466931] compat: module verification failed: signature and/or required key missing - tainting kernel
[ 3238.468361] COMPAT BACKPORTED INIT
[ 3238.468362] Loading modules backported from I915-23.6.37
[ 3238.468363] Backport generated by backports.git I915_23.6.37_PSB_230425.49
[ 3239.444973] i915: Unknown symbol intel_vsec_register (err -2)
[ 3271.091366] i915: Unknown symbol intel_vsec_register (err -2)
[ 3317.364301] i915: Unknown symbol intel_vsec_register (err -2)
[ 3376.362727] i915: Unknown symbol intel_vsec_register (err -2)
When we unload the in-tree intel_vsec driver and do nothing else different, the above issue is not observed.
Observation 2:
When you delete the KMM module CR, it unloads the out of tree i915 driver via a PreStop Hook, but it does not reload the in-tree i915 driver. This is by KMM design. Essentially, the kernel is tainted. When KMM tries to clean up, it is unable to unload the out of tree i915 driver as it says it is in use.
We are also unable to manually unload the out of tree i915 or intel_vsec driver.
sh-5.1# modprobe -rv intel_vsec
modprobe: FATAL: Module intel_vsec is in use.
sh-5.1# modprobe -rv i915
modprobe: FATAL: Module i915 is in use.
lsmod output after out of tree drivers loaded, keep an eye on the resource counts which is the 3rd column.
sh-5.1# lsmod | grep i915
i915 3977216 4
intel_vsec 20480 1 i915
intel_gtt 24576 1 i915
compat 24576 2 intel_vsec,i915
video 61440 1 i915
drm_display_helper 172032 2 compat,i915
cec 61440 2 drm_display_helper,i915
i2c_algo_bit 16384 2 ast,i915
drm_kms_helper 192512 5 ast,drm_display_helper,i915
drm 581632 7 drm_kms_helper,compat,ast,drm_shmem_helper,drm_display_helper,i915
sh-5.1# lsmod | grep intel_vsec
intel_vsec 20480 1 i915
compat 24576 2 intel_vsec,i915
It has been noted to document a dependency list diagram for out of tree GPU drivers as a future exercise.