linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-23 22:52:19 +02:00

Author	SHA1	Message	Date
Christian König	b6fe4ff340	drm/amdgpu: fix handling in amdgpu_userq_create Well mostly the same issues the other code had as well: 1. Memory allocation while holding the userq_mutex lock is forbidden! 2. Things were created/started/published in the wrong order. 3. The reset lock was taken in the wrong order and seems to be unecessary in the first place. 4. Error messages on invalid input parameters can spam the logs. 5. Error messages on memory allocation failures are usually superflous as well. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 89e50de5654dbe7a137e03d78629542e17ba7202)	2026-05-19 12:25:32 -04:00
Sunil Khatri	b6a28b77b8	drm/amdgpu: userq_va_mapped should remain true once done Multiple queues needs these bo_va objects belonging to the same uq_mgr. So once they are mapped lets not unmap them as at any point of time any of the queues might be using it. Also userq_va_mapped should be a boolean than atomic. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5c02889ea22575c3bcfdf212e65fac316cbc6c6a)	2026-05-19 12:15:49 -04:00
Sunil Khatri	d892a6eca7	drm/amdgpu: remove va cursors for all mappings va_cursor struct needs to be cleaned even if the mapping has been removed already. Also simplify it by make it a void function as return value check isn't needed as its called during tear down. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4d35a45c9b4c1ac5b6e3219f83c3db706b675fa2)	2026-05-19 12:09:01 -04:00
Sunil Khatri	0be97436bf	drm/amdgpu/userq: update the vm task info during signal ioctl Pagefaults does not have process information correctly populated as vm->task is not set during vm_init but should be updated while real submission. So setting that up during signal_ioctl to get the correct submission process details. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a9b14d88b4d83e21ab965f23d1fb7b07b87e0517)	2026-05-19 12:08:02 -04:00
Sunil Khatri	291df3dc7d	drm/amdgpu/userq: cancel reset work while tear down in progress While tear down of a userq_mgr is happening when all the queues are free we should cancel any reset work if pending before exiting. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 160164609f71f774c4f661227a9b7a370a86b112)	2026-05-19 12:07:52 -04:00
Christian König	c8ed2de0f2	drm/amdgpu: rework userq reset work handling It is illegal to schedule reset work from another reset work! Fix this by scheduling the userq reset work directly on the work queue of the reset domain. Not fully tested, I leave that to the IGT test cases. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fd9200ccefab94f27877d1943761d6b0ccbd89c8)	2026-05-19 12:07:42 -04:00
Sunil Khatri	be045c5c83	drm/amdgpu/userq: pin mqd and fw object bo to avoid eviction mqd and fw objects are queue core objects which should remain valid and never be unmapped and evicted for user queues to work properly. During eviction if these buffers are evicted the hw continue to use the invalid addresses and caused page faults and system hung. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a3bbf32a336939a1d21b9561f8e53333b684b7ef)	2026-05-19 12:07:36 -04:00
Christian König	0071e01c61	drm/amdgpu: fix userq hang detection and reset Fix lock inversions pointed out by Prike and Sunil. The hang detection timeout CAN'T grab locks under which we wait for fences, especially not the userq_mutex lock. Then instead of this completely broken handling with the hang_detect_fence just cancel the work when fences are processed and re-start if necessary. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1b62077f045ac6ffde7c97005c6659569ac5c1ec)	2026-05-11 17:47:11 -04:00
Christian König	d0053441ad	drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues Well the reset handling seems broken on multiple levels. As first step of fixing this remove most calls to the hang detection. That function should only be called after we run into a timeout! And NOT as random check spread over the code in multiple places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 71bea36b54ccfb14cbc90f94267af6369af4e702)	2026-05-11 17:47:04 -04:00
Christian König	44e5bc73bd	drm/amdgpu: rework amdgpu_userq_signal_ioctl v3 This one was fortunately not looking so bad as the wait ioctl path, but there were still a few things which could be fixed/improved: 1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that before taking the userq_lock. 2. Use a new mutex as protection for the fence_drv_xa so that we can do memory allocations while holding it. 3. Starting the reset timer is unnecessary when the fence is already signaled when we create it. 4. Cleanup error handling, avoid trying to free the queue when we don't even got one. v2: fix incorrect usage of xa_find, destroy the new mutex on error v3: cleanup ref ordering Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1609eb0f81a609d350169839128cecf298c84e7a)	2026-05-11 17:46:43 -04:00
Christian König	d5971c5c34	drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset The purpose of a GPU reset is to make sure that fence can be signaled again and the signal and resume workers can make progress again. So waiting for the resume worker or any fence in the GPU reset path is just utterly nonsense. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)	2026-05-11 17:46:34 -04:00
Prike Liang	8f935acbc1	drm/amdgpu: clean up the userq unmap error handler amdgpu_userq_unmap_helper() already handles the unmap error case. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 66cb6579990b633ccc7300c27011d837b9a58da0)	2026-04-28 15:51:18 -04:00
Christian König	d2f272a36e	drm/amdgpu: rework userq fence signal processing Move more code into a common userq function. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 12f52fab11500d0dce7d23c71909eaf0cf9aa701)	2026-04-28 15:43:57 -04:00
Hongyan Xu	508babf310	drm/amdgpu: avoid double drm_exec_fini() in userq validate When new_addition is true, amdgpu_userq_vm_validate() calls drm_exec_fini(&exec) before iterating over the collected HMM ranges and calling amdgpu_ttm_tt_get_user_pages(). If amdgpu_ttm_tt_get_user_pages() fails in that path, the code jumps to unlock_all and calls drm_exec_fini(&exec) a second time on the same exec object. drm_exec_fini() is not idempotent: it frees exec->objects and may also drop exec->contended and finalize the ww acquire context. Route that error path directly to the range cleanup once exec has already been finalized. Fixes: `42f1487884` ("drm/amdgpu/userqueue: validate userptrs for userqueues") Issue found using a prototype static analysis tool and confirmed by code review. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Hongyan Xu <getshell@seu.edu.cn> Signed-off-by: Slavin Liu <220245772@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2802952e4a07306da6ebe813ff1acacc5691851a)	2026-04-24 11:08:58 -04:00
Sunil Khatri	b250a43bf5	drm/amdgpu/userq: unpin and unref doorbell and wptr outside mutex In amdgpu_userq_destroy once unmap_helpder is called within mutex there is no need to hold mutex. This helps in avoiding a deadlock between doorbell and wptr ww mutex and we could unpin and unref these bos outside mutex safely. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	d3a9fe4584	drm/amdgpu/userq: use pm_runtime_resume_and_get and fix err handling Use pm_runtime_resume_and_get instead of pm_runtime_get_sync as it return error but put the reference in the function itself. In goto statements we need to drop the pm reference too. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	1e8b7062d2	drm/amdgpu/userq: unmap_helper dont return the queue state We check for return value of amdgpu_userq_unmap_helper and compare it against the queue->state which is logically wrong and we should just check for failure and do the needfull. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	810df8de2f	drm/amdgpu/userq: unmap is to be called before freeing doorbell/wptr bo Unmap the queue after freeing doorbell and wptr memory is completely wrong. Any operation on the queue needs the doorbell and wptr to be valid and hence fixing the ordering. Also since we are using amdgpu_bo_reserve in non interruptrable mode so there is no need to check for its return values. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	85653fe2e5	drm/amdgpu/userq: hold root bo lock in caller of input_va_validate Caller should hold the reservation lock for root.bo in func amdgpu_userq_input_va_validate. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	168178b0cb	drm/amdgpu/userq: caller to take reserv lock for vas_list_cleanup In function amdgpu_userq_buffer_vas_list_cleanup, remove the reservation lock for vm and caller should make sure it's taken before locking userq_mutex. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	469e6fea09	drm/amdgpu/userq: create_mqd does not need userq_mutex Reshuffle the code to run create_mqd outside the mutex. code here is mostly setting up software structure init before actually registering the userqueue in the xa and to the driver. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	51358444d1	drm/amdgpu/userq: dont lock root bo with userq_mutex held Do not hold reservation lock for root bo if userq_mutex is already held in the call flow this cause a lock issue with ttm_bo_delayed_delete. Its better to lock the vm->root.bo first and then go ahead with userq_mutex so userq_mutex threads dont get stuck until the reservation lock is held. In this case it helps in the function amdgpu_userq_buffer_vas_mapped for each queue during restore_all. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:12 -04:00
Sunil Khatri	1eb90c7403	drm/amdgpu/userq: fix kerneldoc for amdgpu_userq_ensure_ev_fence Move the comment for the caller to the definition for amdgpu_userq_ensure_ev_fence in kerneldoc format. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:11 -04:00
Sunil Khatri	dc87834e9a	drm/amdgpu/userq: clean the VA mapping list for failed queue creation If the queue creation failed during mapping of the important VA's like queue_va, rptr_va and wptr_va. These needs to be cleaned as queue destroy will not be called for such queues as user never get call to creation failure. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:11 -04:00
Sunil Khatri	a7fe8c1b6c	drm/amdgpu/userq: avoid uneccessary locking in amdgpu_userq_create Reorganise code to avoid holding mutex userq_mutex while also trying to grab exec lock ww_mutex where its not needed for function amdgpu_userq_input_va_validate Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 15:41:11 -04:00
Prike Liang	48c33af0b6	drm/amdgpu: make userq fence_drv drop explicit in queue destroy amdgpu_userq_fence_driver_free() is now responsible only for releasing per-queue ancillary state (last_fence, fence_drv_xa) and no longer touches the ownership reference, making each function's contract clear. v2: Get the userq fence driver from amdgpu_userq_fence_driver_alloc() directly and dropping the userq fence driver reference after removing userq_doorbell_xa entry.(Christian) Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 14:03:56 -04:00
Sunil Khatri	05ce444171	drm/amdgpu/userq: use dma_fence_wait_timeout without test for signalled In function amdgpu_userq_wait_for_last_fence use dma_fence_wait to wait infinitely. Also there is no need to print error as we wont be timing out anymore. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:59:22 -04:00
Sunil Khatri	38476bde59	drm/amdgpu/userq: call dma_resv_wait_timeout without test for signalled In function amdgpu_userq_gem_va_unmap_validate call dma_resv_wait_timeout directly. Also since we are waiting forever we should not be having any return value and hence no handling needed. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:59:15 -04:00
Sunil Khatri	4c86e12ab1	drm/amdgpu/userq: add the return code too in error condition In function amdgpu_userq_restore a. amdgpu_userq_vm_validate: add return code in error condition b. amdgpu_userq_restore_all: It already prints the error log, just update the erorr log in the function and remove it from caller. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:59:10 -04:00
Sunil Khatri	1e57e72ae3	drm/amdgpu/userq: fence wait for max time in amdgpu_userq_wait_for_signal wait for infinite time for fences in function amdgpu_userq_wait_for_signal and for that use dma_fence_wait(f, false); Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:58:53 -04:00
Sunil Khatri	1d4ade3646	drm/amdgpu/userq: dont need check for return values in amdgpu_userq_evict Function of amdgpu_userq_evict function do not need to check for return values as we dont use them and no need to log errors as we are already logging in called functions. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:56:15 -04:00
Sunil Khatri	4eaf5d2c31	drm/amdgpu/userq: Fix the code alignment for readability Fix the code alignment for if condition and also provide a line space between multiline if condition and next statement. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:38:04 -04:00
Sunil Khatri	de95eda05f	drm/amdgpu/userq: amdgpu_userq_vm_validate does not need userq mutex amdgpu_userq_vm_validate function does not need userq_mutex and exec lock is good enough to locking all bos and updating the eviction fence. Also since we only need userq_mutex for amdgpu_userq_restore_all so move the locks in the function itself. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:31:55 -04:00
Junrui Luo	de1ef4ffd7	drm/amdgpu: validate doorbell_offset in user queue creation amdgpu_userq_get_doorbell_index() passes the user-provided doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds checking. An arbitrarily large doorbell_offset can cause the calculated doorbell index to fall outside the allocated doorbell BO, potentially corrupting kernel doorbell space. Validate that doorbell_offset falls within the doorbell BO before computing the BAR index, using u64 arithmetic to prevent overflow. Fixes: `f09c1e6077` ("drm/amdgpu: generate doorbell index for userqueue") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:30:55 -04:00
Sunil Khatri	a3057aa926	drm/amdgpu/userq: schedule_delayed_work should be after fence signalled Reorganise the amdgpu_eviction_fence_suspend_worker code so schedule_delayed_work is the last thing we do after amdgpu_userq_evict is complete and the eviction fence is signalled. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-24 13:35:23 -04:00
Sunil Khatri	473527e70e	drm/amdgpu/userq: dont use goto to jump when at end of function In function amdgpu_userq_restore_worker we dont need to use goto as we already in the end of function and it will exit naturally. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-24 13:32:37 -04:00
Sunil Khatri	8f402ddd4f	drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed amdgpu_userq_put/get are not needed in case we already holding the userq_mutex and reference is valid already from queue create time or from signal ioctl. These additional get/put could be a potential reason for deadlock in case the ref count reaches zero and destroy is called which again try to take the userq_mutex. Due to the above change we avoid deadlock between suspend/restore calling destroy queues trying to take userq_mutex again. Cc: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-23 14:17:31 -04:00
Srinivasan Shanmugam	9a62a097a7	drm/amdgpu: Drop redundant queue NULL check in hang detect worker amdgpu_userq_hang_detect_work() retrieves the queue pointer using container_of() from the embedded work item. Since the work structure is part of struct amdgpu_usermode_queue, the returned queue pointer cannot be NULL in normal execution. Remove the redundant !queue check and keep the validation for queue->userq_mgr. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c:159 amdgpu_userq_hang_detect_work() warn: can 'queue' even be NULL? Fixes: `290f46cf57` ("drm/amdgpu: Implement user queue reset functionality") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Acked-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:44 -04:00
Christian König	99f30a0607	drm/amdgpu: fix eviction fence and userq manager shutdown That is a really complicated dance and wasn't implemented fully correct. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:21 -04:00
Christian König	2cd7284ba5	drm/amdgpu: completely rework eviction fence handling v2 Well that was broken on multiple levels. First of all a lot of checks were placed at incorrect locations, especially if the resume worker should run or not. Then a bunch of code was just mid-layering because of incorrect assignment who should do what. And finally comments explaining what happens instead of why. Just re-write it from scratch, that should at least fix some of the hangs we are seeing. Use RCU for the eviction fence pointer in the manager, the spinlock usage was mostly incorrect as well. Then finally remove all the nonsense checks and actually add them in the correct locations. v2: some typo fixes and cleanups suggested by Sunil Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 17:46:13 -04:00
Sunil Khatri	f802f7b0bc	drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work cancel_delayed_work_sync for work hand_detect_work should not be locked since the amdgpu_userq_hang_detect_work also need the same mutex and when they run together it could be a deadlock. we do not need to hold the mutex for cancel_delayed_work_sync(&queue->hang_detect_work). With this in place if cancel and worker thread run at same time they will not deadlock. Due to any failures if there is a hand detect and reset that there a deadlock scenarios between cancel and running the main thread. [ 243.118276] task:kworker/9:0 state:D stack:0 pid:73 tgid:73 ppid:2 task_flags:0x4208060 flags:0x00080000 [ 243.118283] Workqueue: events amdgpu_userq_hang_detect_work [amdgpu] [ 243.118636] Call Trace: [ 243.118639] <TASK> [ 243.118644] __schedule+0x581/0x1810 [ 243.118649] ? srso_return_thunk+0x5/0x5f [ 243.118656] ? srso_return_thunk+0x5/0x5f [ 243.118659] ? wake_up_process+0x15/0x20 [ 243.118665] schedule+0x64/0xe0 [ 243.118668] schedule_preempt_disabled+0x15/0x30 [ 243.118671] __mutex_lock+0x346/0x950 [ 243.118677] __mutex_lock_slowpath+0x13/0x20 [ 243.118681] mutex_lock+0x2c/0x40 [ 243.118684] amdgpu_userq_hang_detect_work+0x63/0x90 [amdgpu] [ 243.118888] process_scheduled_works+0x1f0/0x450 [ 243.118894] worker_thread+0x27f/0x370 [ 243.118899] kthread+0x1ed/0x210 [ 243.118903] ? __pfx_worker_thread+0x10/0x10 [ 243.118906] ? srso_return_thunk+0x5/0x5f [ 243.118909] ? __pfx_kthread+0x10/0x10 [ 243.118913] ret_from_fork+0x10f/0x1b0 [ 243.118916] ? __pfx_kthread+0x10/0x10 [ 243.118920] ret_from_fork_asm+0x1a/0x30 Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:42:39 -04:00
Christian König	98dc529a27	drm/amdgpu: fix amdgpu_userq_evict Canceling the resume worker synchonized can deadlock because it can in turn wait for the eviction worker through the userq_mutex. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:33:39 -04:00
Sunil Khatri	3fd20c149e	Revert "drm/amdgpu: revert to old status lock handling v4" This reverts commit `7a9419ab42`. Reverting due to some of the probable issues caused by this change and CI is blocked. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:28:47 -04:00
Christian König	7a9419ab42	drm/amdgpu: revert to old status lock handling v4 It turned out that protecting the status of each bo_va with a spinlock was just hiding problems instead of solving them. Revert the whole approach, add a separate stats_lock and lockdep assertions that the correct reservation lock is held all over the place. This not only allows for better checks if a state transition is properly protected by a lock, but also switching back to using list macros to iterate over the state of lists protected by the dma_resv lock of the root PD. v2: re-add missing check v3: split into two patches v4: re-apply by fixing holding the VM lock at the right places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-11 13:58:08 -04:00
Sunil Khatri	eb2e7f20c1	drm/amdgpu: push userq debugfs function in amdgpu_debugfs files Debugfs files for amdgpu are better to be handled in the dedicated amdgpu_debugfs.c/.h files. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:19 -05:00
Sunil Khatri	a07930e4db	drm/amdgpu/userq: declutter the code with goto Clean up the amdgpu_userq_create function clean up in failure condition using goto method. This avoid replication of cleanup for every failure condition. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:15 -05:00
Sunil Khatri	28cacaace5	drm/amdgpu/userq: defer queue publication until create completes The userq create path publishes queues to global xarrays such as userq_doorbell_xa and userq_xa before creation was fully complete. Later on if create queue fails, teardown could free an already visible queue, opening a UAF race with concurrent queue walkers. Also calling amdgpu_userq_put in such cases complicates the cleanup. Solution is to defer queue publication until create succeeds and no partially initialized queue is exposed. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:34:06 -05:00
Sunil Khatri	a978ed3d64	drm/amdgpu/userq: remove queue from doorbell xa during clean up If function amdgpu_userq_map_helper fails we do need to clean up and remove the queue from the userq_doorbell_xa. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:25:52 -05:00
Sunil Khatri	f0e46fd06c	drm/amdgpu/userq: remove queue from doorbell xarray In case of failure in xa_alloc, remove the queue during clean up from the userq_doorbell_xa. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:25:50 -05:00
Sunil Khatri	4952189b28	drm/amdgpu/userq: refcount userqueues to avoid any race conditions To avoid race condition and avoid UAF cases, implement kref based queues and protect the below operations using xa lock a. Getting a queue from xarray b. Increment/Decrement it's refcount Every time some one want to access a queue, always get via amdgpu_userq_get to make sure we have locks in place and get the object if active. A userqueue is destroyed on the last refcount is dropped which typically would be via IOCTL or during fini. v2: Add the missing drop in one the condition in the signal ioclt [Alex] v3: remove the queue from the xarray first in the free queue ioctl path [Christian] - Pass queue to the amdgpu_userq_put directly. - make amdgpu_userq_put xa_lock free since we are doing put for each get only and final put is done via destroy and we remove the queue from xa with lock. - use userq_put in fini too so cleanup is done fully. v4: Use xa_erase directly rather than doing load and erase in free ioctl. Also remove some of the error logs which could be exploited by the user to flood the logs [Christian] Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-04 11:50:56 -05:00

1 2 3

108 Commits