drm/xe/vf: Workaround for race condition in GuC firmware during VF pause

A race condition exists where a paused VF's H2G request can be processed
and subsequently rejected. This rejection results in a FAST_REQ failure
being delivered to the KMD, which then terminates the CT via a dead
worker and triggers a GT reset—an undesirable outcome.

This workaround mitigates the issue by checking if a VF post-migration
recovery is in progress and aborting these adverse actions accordingly.
The GuC firmware will address this bug in an upcoming release. Once that
version is available and VF migration depends on it, this workaround can
be safely removed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-30-matthew.brost@intel.com
This commit is contained in:
Matthew Brost 2025-10-08 14:45:27 -07:00
parent 1521fad9ad
commit 3b56911960

View File

@ -1398,6 +1398,10 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
fast_req_report(ct, fence);
/* FIXME: W/A race in the GuC, will get in firmware soon */
if (xe_gt_recovery_pending(gt))
return 0;
CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
return -EPROTO;