linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 18:13:41 +02:00

Author	SHA1	Message	Date
Kent Russell	b56922fc37	drm/amdgpu: Only send RMA CPER when threshold is exceeded According to our documentation, the RMA should only occur when the threshold has been exceeded, not met. Fixes: `5028a24aa8` ("drm/amdgpu: Send applicable RMA CPERs at end of RAS init") Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8bc09a7d0e90ec45a0b4865661cf45cbbce1c3d7)	2026-04-24 11:10:31 -04:00
Gangliang Xie	5c36fd7fc6	drm/amdgpu: reset ras eeprom table when it is invalid reset ras eeprom table when it is invalid Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:37:27 -04:00
Tao Zhou	3d77ca68eb	drm/amdgpu: clear related counter after RAS eeprom reset Make eeprom data and its counter consistent. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:42:35 -05:00
Gangliang Xie	044f8d3b1f	drm/amdgpu: return when ras table checksum is error end the function flow when ras table checksum is error Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:21:56 -05:00
Kent Russell	5028a24aa8	drm/amdgpu: Send applicable RMA CPERs at end of RAS init Firmware and monitoring tools may not be ready to receive a CPER when we read the bad pages, so send the CPERs at the end of RAS initialization to ensure that the FW is ready to receive and process the CPER. This removes the previous CPER submission that was added during bad page load, and sends both in-band and out-of-band at the same time. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:28:34 -05:00
Kent Russell	e0d11bdb29	drm/amdgpu: Send RMA CPER at bad page loading Some older builds weren't sending RMA CPERs when the bad page threshold was exceeded. Newer builds have resolved this, but there could be systems out there with bad page numbers higher than the threshold, that haven't sent out an RMA CPER. To be thorough and safe, send an RMA CPER when we load the table, if the threshold is met or exceeded, instead of waiting for the next UE to trigger the CPER. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-27 18:13:24 -05:00
Tao Zhou	7fb41ab3c9	drm/amdgpu: optimize timeout implemention in ras_eeprom_update_record_num The busy status returned by ras_eeprom_update_record_num may not be an error, increase timeout to exclude false busy status. Also add more comments to make the code readable. v2: define a macro for the timeout value. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:14 -05:00
Tao Zhou	eed3015274	drm/amdgpu: add RAS bad page threshold handling for PMFW manages eeprom Check if bad page threshold is reached and take actions accordingly. v2: remove rma message sent to smu when pmfw manages eeprom. v3: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:14 -05:00
Tao Zhou	334b27bf71	drm/amdgpu: try for more times if RAS bad page number is not updated RAS info update in PMFW is time cost, wait for it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:14 -05:00
Tao Zhou	e84835940e	drm/amdgpu: get RAS bad page address from MCA address Instead of from physical address. v2: add comment to make the code more readable Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-11 21:54:14 -05:00
Tao Zhou	541414065c	drm/amdgpu: skip writing eeprom when PMFW manages RAS data Only update bad page number in legacy eeprom write path. v2: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 10:02:15 -05:00
Tao Zhou	7f34ddf77d	drm/amdgpu: add ras_eeprom_read_idx interface PMFW will manage RAS eeprom data by itself, add new interface to read eeprom data via PMFW, we can read part of records by setting index. v2: use IPID parse interface. pa is not used and set it to a fixed value. v3: optimize the null pointer check for IPID parse interface. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:58:35 -05:00
Gangliang Xie	1349b31313	drm/amdgpu: initialize max record count after table reset initialize max record count and record offset after table reset Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:56:22 -05:00
Gangliang Xie	cd5b28a040	drm/amdgpu: add check function for pmfw eeprom add check function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:56:15 -05:00
Gangliang Xie	19c815d516	drm/amdgpu: add initialization function for pmfw eeprom add initialization function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:56:04 -05:00
Gangliang Xie	9ce015e5fd	drm/amdgpu: adapt reset function for pmfw eeprom adapt reset function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-06 09:55:58 -05:00
Gangliang Xie	d4432f16d3	drm/amdgpu: add wrapper functions for pmfw eeprom interface add wrapper functions for pmfw eeprom interface, for these interfaces to be easily and safely called Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:58 -05:00
Gangliang Xie	f6cdcbd2c0	drm/amdgpu: add function to check if pmfw eeprom is supported add function to check if pmfw is supported, skip eeprom check and recover when pmfw eeprom is supported Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:53:58 -05:00
Xiang Liu	527e3d4033	drm/amd/ras: Add CPER ring read for uniras Read CPER raw data from debugfs node "/sys/kernel/debug/dri/*/ amdgpu_ring_cper". Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-11-04 11:33:54 -05:00
YiPeng Chai	04226ae1bc	drm/amdgpu: Add ras module eeprom safety watermark check Add ras module eeprom safety watermark check. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-10-20 18:19:02 -04:00
Xiang Liu	c8d6e90abe	drm/amdgpu: Notify pmfw bad page threshold exceeded Notify pmfw when bad page threshold is exceeded, no matter the module parameter 'bad_page_threshold' is set or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:07:19 -04:00
Xiang Liu	b3505c2c48	drm/amdgpu: Generate BP threshold exceed CPER once threshold exceeded The bad pages threshold exceed CPER should be generated once threshold exceeded, no matter the bad_page_threshold setted or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:21:01 -04:00
Tao Zhou	d45c5e6845	drm/amdgpu: adjust the update of RAS bad page number One eeprom record may not map to unit number of bad pages, the accurate bad page number is gotten after bad page address check. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-07-28 16:40:06 -04:00
ganglxie	660261df61	drm/amdgpu: refine eeprom data check add eeprom data checksum check before driver unload. reset eeprom and save correct data to eeprom when check failed Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-07-15 14:07:53 -04:00
Lijo Lazar	a3e510fd69	drm/amdgpu: Convert from DRM_* to dev_* Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-06-30 11:54:55 -04:00
ganglxie	e2d1e96c53	drm/amdgpu: refine usage of amdgpu_bad_page_threshold when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message when threshold is reached and continue runtime services. Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-06-18 12:19:21 -04:00
ganglxie	d0cc8d2b7d	drm/amdgpu: clear pa and mca record counter when resetting eeprom clear pa and mca record counter when resetting eeprom, so that ras_num_bad_pages can be calculated correctly Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-06-18 12:19:15 -04:00
Candice Li	d6b22b1dff	drm/amdgpu: Set RAS EEPROM table version to v3 for umc v12_5 Set RAS EEPROM table version to v3 for umc v12_5. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-04-11 17:00:50 -04:00
Lijo Lazar	6ffc6e056f	drm/amdgpu: Reset RAS table if header is invalid If a valid header is not found during RAS eeprom init, consider it as new and reset RAS table info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-04-08 16:48:14 -04:00
Lijo Lazar	5df0d6addb	drm/amdgpu: Add basic validation for RAS header If RAS header read from EEPROM is corrupted, it could result in trying to allocate huge memory for reading the records. Add some validation to header fields. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-04-07 15:18:59 -04:00
Candice Li	5762f9dcf7	drm/amdgpu: Add EEPROM I2C address support for smu v13_0_12 Add EEPROM I2C address support for smu v13_0_12. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-03-19 15:56:00 -04:00
Tao Zhou	05d50ea3ea	drm/amdgpu: format old RAS eeprom data into V3 version Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-03-18 14:03:38 -04:00
ganglxie	a4b6e990d7	drm/amdgpu: Save PA of bad pages for old asics for old asics that do not support mca translating, we just save PA for them Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-03-13 23:13:08 -04:00
Tao Zhou	334dc5fcc3	drm/amdgpu: increase RAS bad page threshold For default policy, driver will issue an RMA event when the number of bad pages is greater than 8 physical rows, rather than reaches 8 physical rows, don't rely on threshold configurable parameters in default mode. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-03-07 12:53:39 -05:00
ganglxie	a8f921a10a	drm/amdgpu: Change page/record number calculation based on nps save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-25 11:45:12 -05:00
ganglxie	f2510355fb	drm/amdgpu: Save nps to eeprom nps info saved together with bad page makes bad page parsing more efficient Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-25 11:45:12 -05:00
Lijo Lazar	6e8ca38ebc	drm/amdgpu: Add flag to make VBIOS read optional Certain SOCs may not need much data from VBIOS. Some data like VBIOS version used will be missed but it doesn't affect functionality. Add a flag to make VBIOS image optional. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:04:08 -05:00
Hawking Zhang	16b85a0942	drm/amdgpu: Update usage for bad page threshold The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-02-12 21:02:59 -05:00
Dheeraj Reddy Jonnalagadda	69b54d7c7c	drm/amdgpu: simplify return statement in amdgpu_ras_eeprom_init Remove the logically dead code in the last return statement of amdgpu_ras_eeprom_init. The condition res < 0 is redundant since res is already checked for a negative value earlier. Replace return res < 0 ? res : 0; with return 0 to improve clarity. Fixes: `63d4c081a5` ("drm/amdgpu: Optimize EEPROM RAS table I/O") Closes: https://scan7.scan.coverity.com/#/project-view/52337/11354?selectedIssue=1602413 Signed-off-by: Dheeraj Reddy Jonnalagadda <dheeraj.linuxdev@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-12-18 12:16:05 -05:00
Tao Zhou	ae756cd853	drm/amdgpu: correct the calculation of RAS bad page After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-12-10 10:26:51 -05:00
Tao Zhou	1f06e7f344	drm/amdgpu: split ras_eeprom_init into init and check functions Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-12-10 10:26:51 -05:00
Jinzhou Su	9db3aed8ea	drm/amdgpu: return error when eeprom checksum failed Return eeprom table checksum error result, otherwise it might be overwritten by next call. V2: replace DRM_ERROR with dev_err Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-12-10 10:26:50 -05:00
Tao Zhou	2206daa1f9	drm/amdgpu: add a flag to indicate UMC channel index version v1 (legacy way): store channel index within a UMC instance in eeprom v2: store global channel index in eeprom V2: only save the flag on eeprom, clear it after saving. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-12-10 10:26:46 -05:00
Andrew Kreimer	c400ec6990	drm/amdgpu: Fix a typo Fix a typo in comments. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Kreimer <algonell@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-09-18 16:14:26 -04:00
Stanley.Yang	1a8825259a	drm/amdgpu: Fix eeprom max record count The eeprom table is empty before initializing, set eeprom table version first before initializing. Changed from V1: Reuse amdgpu_ras_set_eeprom_table_version function Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `015b8a2fdf`)	2024-07-24 17:30:23 -04:00
Tao Zhou	b95fa494d6	drm/amdgpu: add RAS is_rma flag Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-06-05 11:25:14 -04:00
Hawking Zhang	a6bcffa596	drm/amdgpu: Add smu v13_0_14 ip block Add smu v13_0_14 ip block support Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-05-02 15:49:11 -04:00
Candice Li	f26c4e3fc9	drm/amdgpu: Update setting EEPROM table version Use helper function instead of umc callback to set EEPROM table version. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-03-20 13:38:15 -04:00
Yang Wang	f579c06bdc	drm/amdgpu: send smu rma reason event in ras eeprom driver send smu rma reason event to smu in ras eeprom driver. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2024-02-12 16:08:27 -05:00
Candice Li	ca0ad76089	drm/amdgpu: Update EEPROM I2C address for smu v13_0_0 Check smu v13_0_0 SKU type to select EEPROM I2C address. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2023-11-29 16:49:23 -05:00

1 2 3

131 Commits