linux/drivers/gpu/drm/amd/amdkfd
Dennis Li df9c8d1aa2 drm/amdgpu: fix system hang issue during GPU reset
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: Luben Tukov <luben.tuikov@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-27 16:21:37 -04:00
..
cik_event_interrupt.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
cik_int.h drm/amdkfd: Clean up reference of radeon 2018-07-11 22:33:08 -04:00
cik_regs.h drm/amdkfd: Delete a duplicate statement in set_pasid_vmid_mapping() 2018-11-05 14:21:13 -05:00
cwsr_trap_handler_gfx8.asm drm/amdkfd: Remove dead code from gfx8/gfx9 trap handlers 2019-07-30 23:22:18 -05:00
cwsr_trap_handler_gfx9.asm drm/amdkfd: Remove dead code from gfx8/gfx9 trap handlers 2019-07-30 23:22:18 -05:00
cwsr_trap_handler_gfx10.asm drm/amdkfd: Unify gfx9/gfx10 context save area layouts 2020-07-27 16:20:47 -04:00
cwsr_trap_handler.h drm/amdkfd: Unify gfx9/gfx10 context save area layouts 2020-07-27 16:20:47 -04:00
Kconfig drm/amdgpu: fix license on Kconfig and Makefiles 2019-12-11 15:22:08 -05:00
kfd_chardev.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_crat.c drm/amdkfd: Support navy_flounder KFD 2020-07-15 12:46:55 -04:00
kfd_crat.h drm/amdkfd: Adjust weight to represent num_hops info when report xgmi iolink 2019-05-24 12:20:48 -05:00
kfd_dbgdev.c drm/amdkfd: Eliminate unnecessary kernel queue function pointers 2019-12-05 16:24:36 -05:00
kfd_dbgdev.h drm/amdkfd: Clean up reference of radeon 2018-07-11 22:33:08 -04:00
kfd_dbgmgr.c drm/amdkfd: Use hex print format for pasid 2019-10-03 09:11:03 -05:00
kfd_dbgmgr.h drm/amdkfd: Clean up KFD style errors and warnings v2 2017-08-15 23:00:04 -04:00
kfd_debugfs.c drm/amdkfd: Fix permissions of hang_hws 2020-01-07 11:54:30 -05:00
kfd_device_queue_manager_cik.c drm/amdkfd: Introduce asic-specific mqd_manager_init function 2019-05-24 12:21:02 -05:00
kfd_device_queue_manager_v9.c drm/amdkfd: Consistently apply noretry setting 2019-07-16 13:02:55 -05:00
kfd_device_queue_manager_v10.c drm/amdkfd: Add navi10 support to amdkfd. (v3) 2019-06-21 18:59:24 -05:00
kfd_device_queue_manager_vi.c drm/amdkfd: Introduce asic-specific mqd_manager_init function 2019-05-24 12:21:02 -05:00
kfd_device_queue_manager.c drm/amdgpu: fix system hang issue during GPU reset 2020-07-27 16:21:37 -04:00
kfd_device_queue_manager.h drm/amdkfd: Fix circular locking dependency warning 2020-07-01 01:59:27 -04:00
kfd_device.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_doorbell.c drm/amdkfd: Use better name to indicate the offset is in dwords 2019-11-13 15:29:45 -05:00
kfd_events.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
kfd_events.h drm/amdkfd: Implement GPU reset handlers in KFD 2018-07-11 22:32:56 -04:00
kfd_flat_memory.c drm/amdkfd: Support navy_flounder KFD 2020-07-15 12:46:55 -04:00
kfd_int_process_v9.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_interrupt.c drm/amdkfd: fix a potential NULL pointer dereference (v2) 2019-10-03 09:11:00 -05:00
kfd_iommu.c drm/amdkfd: report the real PCI bus number 2020-05-21 12:48:42 -04:00
kfd_iommu.h drm/amdkfd: Centralize IOMMUv2 code and make it conditional 2017-12-08 19:22:12 -05:00
kfd_kernel_queue.c drm/amdkfd: Enable over-subscription with >1 GWS queue 2020-04-28 16:20:30 -04:00
kfd_kernel_queue.h drm/amdkfd: Eliminate unnecessary kernel queue function pointers 2019-12-05 16:24:36 -05:00
kfd_module.c drm/amdkfd: add missing void argument to function kgd2kfd_init 2019-10-07 15:10:26 -05:00
kfd_mqd_manager_cik.c drm/amdkfd: DIQ should not use HIQ way to allocate memory 2019-11-22 14:27:11 -05:00
kfd_mqd_manager_v9.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_mqd_manager_v10.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_mqd_manager_vi.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_mqd_manager.c drm/amdkfd: Extend CU mask to 8 SEs (v3) 2019-08-02 10:19:11 -05:00
kfd_mqd_manager.h drm/amdkfd: Extend CU mask to 8 SEs (v3) 2019-08-02 10:19:11 -05:00
kfd_packet_manager_v9.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_packet_manager_vi.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_packet_manager.c drm/amdkfd: Support navy_flounder KFD 2020-07-15 12:46:55 -04:00
kfd_pasid.c drm/amdkfd: Remove redundant kfd2kgd interface lookup 2020-07-08 09:02:54 -04:00
kfd_pm4_headers_ai.h drm/amdkfd: Support bigger gds size 2019-07-18 14:18:03 -05:00
kfd_pm4_headers_diq.h
kfd_pm4_headers_vi.h drm/amdkfd: Delete alloc_format field from map_queue struct 2019-05-24 12:21:03 -05:00
kfd_pm4_headers.h drm/amdkfd: Update PM4 packet headers 2017-08-15 23:00:15 -04:00
kfd_pm4_opcodes.h
kfd_priv.h drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_process_queue_manager.c drm/amdkfd: New IOCTL to allocate queue GWS (v2) 2020-04-28 16:20:30 -04:00
kfd_process.c drm/amdgpu: fix system hang issue during GPU reset 2020-07-27 16:21:37 -04:00
kfd_queue.c drm/amdkfd: use %px to print user space address instead of %p 2018-05-01 17:56:04 -04:00
kfd_smi_events.c drm/amd/amdkfd: Fix large framesize for kfd_smi_ev_read() 2020-07-15 13:27:34 -04:00
kfd_smi_events.h drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_topology.c drm/amdkfd: Support navy_flounder KFD 2020-07-15 12:46:55 -04:00
kfd_topology.h drm/amdkfd: Report domain with topology 2020-05-01 09:59:51 -04:00
Makefile drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
soc15_int.h drm/amdkfd: Add SOC15 interrupt processing support 2018-04-10 17:33:10 -04:00