Well mostly the same issues the other code had as well:
1. Memory allocation while holding the userq_mutex lock is forbidden!
2. Things were created/started/published in the wrong order.
3. The reset lock was taken in the wrong order and seems to be
unecessary in the first place.
4. Error messages on invalid input parameters can spam the logs.
5. Error messages on memory allocation failures are usually superflous
as well.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 89e50de5654dbe7a137e03d78629542e17ba7202)
Multiple queues needs these bo_va objects belonging to
the same uq_mgr. So once they are mapped lets not unmap
them as at any point of time any of the queues might be
using it.
Also userq_va_mapped should be a boolean than atomic.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5c02889ea22575c3bcfdf212e65fac316cbc6c6a)
va_cursor struct needs to be cleaned even if the mapping
has been removed already.
Also simplify it by make it a void function as return value
check isn't needed as its called during tear down.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4d35a45c9b4c1ac5b6e3219f83c3db706b675fa2)
Pagefaults does not have process information correctly populated
as vm->task is not set during vm_init but should be updated while
real submission. So setting that up during signal_ioctl to get
the correct submission process details.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a9b14d88b4d83e21ab965f23d1fb7b07b87e0517)
While tear down of a userq_mgr is happening when all the queues
are free we should cancel any reset work if pending before exiting.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 160164609f71f774c4f661227a9b7a370a86b112)
It is illegal to schedule reset work from another reset work!
Fix this by scheduling the userq reset work directly on the work queue
of the reset domain.
Not fully tested, I leave that to the IGT test cases.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fd9200ccefab94f27877d1943761d6b0ccbd89c8)
mqd and fw objects are queue core objects which should remain
valid and never be unmapped and evicted for user queues to work
properly.
During eviction if these buffers are evicted the hw continue to
use the invalid addresses and caused page faults and system hung.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a3bbf32a336939a1d21b9561f8e53333b684b7ef)
Fix lock inversions pointed out by Prike and Sunil. The hang detection
timeout *CAN'T* grab locks under which we wait for fences, especially
not the userq_mutex lock.
Then instead of this completely broken handling with the
hang_detect_fence just cancel the work when fences are processed and
re-start if necessary.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1b62077f045ac6ffde7c97005c6659569ac5c1ec)
Well the reset handling seems broken on multiple levels.
As first step of fixing this remove most calls to the hang detection.
That function should only be called after we run into a timeout! And *NOT*
as random check spread over the code in multiple places.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 71bea36b54ccfb14cbc90f94267af6369af4e702)
This one was fortunately not looking so bad as the wait ioctl path, but
there were still a few things which could be fixed/improved:
1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that
before taking the userq_lock.
2. Use a new mutex as protection for the fence_drv_xa so that we can do
memory allocations while holding it.
3. Starting the reset timer is unnecessary when the fence is already
signaled when we create it.
4. Cleanup error handling, avoid trying to free the queue when we don't
even got one.
v2: fix incorrect usage of xa_find, destroy the new mutex on error
v3: cleanup ref ordering
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1609eb0f81a609d350169839128cecf298c84e7a)
The purpose of a GPU reset is to make sure that fence can be signaled
again and the signal and resume workers can make progress again.
So waiting for the resume worker or any fence in the GPU reset path is
just utterly nonsense.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
amdgpu_userq_unmap_helper() already handles the unmap error case.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 66cb6579990b633ccc7300c27011d837b9a58da0)
Move more code into a common userq function.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 12f52fab11500d0dce7d23c71909eaf0cf9aa701)
When new_addition is true, amdgpu_userq_vm_validate() calls
drm_exec_fini(&exec) before iterating over the collected HMM ranges and
calling amdgpu_ttm_tt_get_user_pages().
If amdgpu_ttm_tt_get_user_pages() fails in that path, the code jumps to
unlock_all and calls drm_exec_fini(&exec) a second time on the same
exec object. drm_exec_fini() is not idempotent: it frees exec->objects
and may also drop exec->contended and finalize the ww acquire context.
Route that error path directly to the range cleanup once exec has
already been finalized.
Fixes: 42f1487884 ("drm/amdgpu/userqueue: validate userptrs for userqueues")
Issue found using a prototype static analysis tool
and confirmed by code review.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Hongyan Xu <getshell@seu.edu.cn>
Signed-off-by: Slavin Liu <220245772@seu.edu.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2802952e4a07306da6ebe813ff1acacc5691851a)
In amdgpu_userq_destroy once unmap_helpder is called within mutex
there is no need to hold mutex.
This helps in avoiding a deadlock between doorbell and wptr ww mutex
and we could unpin and unref these bos outside mutex safely.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use pm_runtime_resume_and_get instead of pm_runtime_get_sync as it
return error but put the reference in the function itself.
In goto statements we need to drop the pm reference too.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We check for return value of amdgpu_userq_unmap_helper and
compare it against the queue->state which is logically
wrong and we should just check for failure and do the needfull.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Unmap the queue after freeing doorbell and wptr memory is completely
wrong. Any operation on the queue needs the doorbell and wptr to be
valid and hence fixing the ordering.
Also since we are using amdgpu_bo_reserve in non interruptrable mode
so there is no need to check for its return values.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Caller should hold the reservation lock for root.bo in func
amdgpu_userq_input_va_validate.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function amdgpu_userq_buffer_vas_list_cleanup, remove the
reservation lock for vm and caller should make sure it's taken
before locking userq_mutex.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reshuffle the code to run create_mqd outside the mutex.
code here is mostly setting up software structure init
before actually registering the userqueue in the xa and
to the driver.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do not hold reservation lock for root bo if userq_mutex
is already held in the call flow this cause a lock
issue with ttm_bo_delayed_delete.
Its better to lock the vm->root.bo first and then go ahead
with userq_mutex so userq_mutex threads dont get stuck until
the reservation lock is held.
In this case it helps in the function amdgpu_userq_buffer_vas_mapped
for each queue during restore_all.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Move the comment for the caller to the definition for
amdgpu_userq_ensure_ev_fence in kerneldoc format.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the queue creation failed during mapping of the important VA's
like queue_va, rptr_va and wptr_va. These needs to be cleaned
as queue destroy will not be called for such queues as user never
get call to creation failure.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reorganise code to avoid holding mutex userq_mutex while
also trying to grab exec lock ww_mutex where its not needed
for function amdgpu_userq_input_va_validate
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_userq_fence_driver_free() is now responsible only for releasing
per-queue ancillary state (last_fence, fence_drv_xa) and no longer
touches the ownership reference, making each function's contract clear.
v2: Get the userq fence driver from amdgpu_userq_fence_driver_alloc()
directly and dropping the userq fence driver reference after removing
userq_doorbell_xa entry.(Christian)
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function amdgpu_userq_wait_for_last_fence use
dma_fence_wait to wait infinitely.
Also there is no need to print error as we wont be
timing out anymore.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function amdgpu_userq_gem_va_unmap_validate call
dma_resv_wait_timeout directly. Also since we are waiting
forever we should not be having any return value and hence
no handling needed.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function amdgpu_userq_restore
a. amdgpu_userq_vm_validate: add return code in error condition
b. amdgpu_userq_restore_all: It already prints the error log, just
update the erorr log in the function and remove it from caller.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
wait for infinite time for fences in function amdgpu_userq_wait_for_signal
and for that use dma_fence_wait(f, false);
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Function of amdgpu_userq_evict function do not need to check
for return values as we dont use them and no need to log errors
as we are already logging in called functions.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the code alignment for if condition and also provide
a line space between multiline if condition and next
statement.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_userq_vm_validate function does not need userq_mutex and exec
lock is good enough to locking all bos and updating the eviction fence.
Also since we only need userq_mutex for amdgpu_userq_restore_all
so move the locks in the function itself.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_userq_get_doorbell_index() passes the user-provided
doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds
checking. An arbitrarily large doorbell_offset can cause the
calculated doorbell index to fall outside the allocated doorbell BO,
potentially corrupting kernel doorbell space.
Validate that doorbell_offset falls within the doorbell BO before
computing the BAR index, using u64 arithmetic to prevent overflow.
Fixes: f09c1e6077 ("drm/amdgpu: generate doorbell index for userqueue")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reorganise the amdgpu_eviction_fence_suspend_worker code so
schedule_delayed_work is the last thing we do after amdgpu_userq_evict
is complete and the eviction fence is signalled.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function amdgpu_userq_restore_worker we dont need to use
goto as we already in the end of function and it will exit
naturally.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_userq_put/get are not needed in case we already holding
the userq_mutex and reference is valid already from queue create
time or from signal ioctl. These additional get/put could be a
potential reason for deadlock in case the ref count reaches zero
and destroy is called which again try to take the userq_mutex.
Due to the above change we avoid deadlock between suspend/restore
calling destroy queues trying to take userq_mutex again.
Cc: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_userq_hang_detect_work() retrieves the queue pointer using
container_of() from the embedded work item.
Since the work structure is part of struct amdgpu_usermode_queue,
the returned queue pointer cannot be NULL in normal execution.
Remove the redundant !queue check and keep the validation for
queue->userq_mgr.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c:159 amdgpu_userq_hang_detect_work() warn: can 'queue' even be NULL?
Fixes: 290f46cf57 ("drm/amdgpu: Implement user queue reset functionality")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
That is a really complicated dance and wasn't implemented fully correct.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Well that was broken on multiple levels.
First of all a lot of checks were placed at incorrect locations, especially if
the resume worker should run or not.
Then a bunch of code was just mid-layering because of incorrect assignment who
should do what.
And finally comments explaining what happens instead of why.
Just re-write it from scratch, that should at least fix some of the hangs we
are seeing.
Use RCU for the eviction fence pointer in the manager, the spinlock usage was
mostly incorrect as well. Then finally remove all the nonsense checks and
actually add them in the correct locations.
v2: some typo fixes and cleanups suggested by Sunil
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
cancel_delayed_work_sync for work hand_detect_work should not be
locked since the amdgpu_userq_hang_detect_work also need the same
mutex and when they run together it could be a deadlock.
we do not need to hold the mutex for
cancel_delayed_work_sync(&queue->hang_detect_work). With this in place
if cancel and worker thread run at same time they will not deadlock.
Due to any failures if there is a hand detect and reset that there a
deadlock scenarios between cancel and running the main thread.
[ 243.118276] task:kworker/9:0 state:D stack:0 pid:73 tgid:73 ppid:2 task_flags:0x4208060 flags:0x00080000
[ 243.118283] Workqueue: events amdgpu_userq_hang_detect_work [amdgpu]
[ 243.118636] Call Trace:
[ 243.118639] <TASK>
[ 243.118644] __schedule+0x581/0x1810
[ 243.118649] ? srso_return_thunk+0x5/0x5f
[ 243.118656] ? srso_return_thunk+0x5/0x5f
[ 243.118659] ? wake_up_process+0x15/0x20
[ 243.118665] schedule+0x64/0xe0
[ 243.118668] schedule_preempt_disabled+0x15/0x30
[ 243.118671] __mutex_lock+0x346/0x950
[ 243.118677] __mutex_lock_slowpath+0x13/0x20
[ 243.118681] mutex_lock+0x2c/0x40
[ 243.118684] amdgpu_userq_hang_detect_work+0x63/0x90 [amdgpu]
[ 243.118888] process_scheduled_works+0x1f0/0x450
[ 243.118894] worker_thread+0x27f/0x370
[ 243.118899] kthread+0x1ed/0x210
[ 243.118903] ? __pfx_worker_thread+0x10/0x10
[ 243.118906] ? srso_return_thunk+0x5/0x5f
[ 243.118909] ? __pfx_kthread+0x10/0x10
[ 243.118913] ret_from_fork+0x10f/0x1b0
[ 243.118916] ? __pfx_kthread+0x10/0x10
[ 243.118920] ret_from_fork_asm+0x1a/0x30
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Canceling the resume worker synchonized can deadlock because it can in
turn wait for the eviction worker through the userq_mutex.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 7a9419ab42.
Reverting due to some of the probable issues caused by this change
and CI is blocked.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It turned out that protecting the status of each bo_va with a
spinlock was just hiding problems instead of solving them.
Revert the whole approach, add a separate stats_lock and lockdep
assertions that the correct reservation lock is held all over the place.
This not only allows for better checks if a state transition is properly
protected by a lock, but also switching back to using list macros to
iterate over the state of lists protected by the dma_resv lock of the
root PD.
v2: re-add missing check
v3: split into two patches
v4: re-apply by fixing holding the VM lock at the right places.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Debugfs files for amdgpu are better to be handled in the dedicated
amdgpu_debugfs.c/.h files.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Clean up the amdgpu_userq_create function clean up in
failure condition using goto method. This avoid replication
of cleanup for every failure condition.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The userq create path publishes queues to global xarrays such as
userq_doorbell_xa and userq_xa before creation was fully complete.
Later on if create queue fails, teardown could free an already
visible queue, opening a UAF race with concurrent queue walkers.
Also calling amdgpu_userq_put in such cases complicates the cleanup.
Solution is to defer queue publication until create succeeds and no
partially initialized queue is exposed.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If function amdgpu_userq_map_helper fails we do need to clean
up and remove the queue from the userq_doorbell_xa.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In case of failure in xa_alloc, remove the queue during
clean up from the userq_doorbell_xa.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To avoid race condition and avoid UAF cases, implement kref
based queues and protect the below operations using xa lock
a. Getting a queue from xarray
b. Increment/Decrement it's refcount
Every time some one want to access a queue, always get via
amdgpu_userq_get to make sure we have locks in place and get
the object if active.
A userqueue is destroyed on the last refcount is dropped which
typically would be via IOCTL or during fini.
v2: Add the missing drop in one the condition in the signal ioclt [Alex]
v3: remove the queue from the xarray first in the free queue ioctl path
[Christian]
- Pass queue to the amdgpu_userq_put directly.
- make amdgpu_userq_put xa_lock free since we are doing put for each get
only and final put is done via destroy and we remove the queue from xa
with lock.
- use userq_put in fini too so cleanup is done fully.
v4: Use xa_erase directly rather than doing load and erase in free
ioctl. Also remove some of the error logs which could be exploited
by the user to flood the logs [Christian]
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>