mirror of
https://github.com/torvalds/linux.git
synced 2026-05-26 08:02:27 +02:00
v6.15
1261 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c1d9dac0db |
vfio/pci: Align huge faults to order
The vfio-pci huge_fault handler doesn't make any attempt to insert a
mapping containing the faulting address, it only inserts mappings if the
faulting address and resulting pfn are aligned. This works in a lot of
cases, particularly in conjunction with QEMU where DMA mappings linearly
fault the mmap. However, there are configurations where we don't get
that linear faulting and pages are faulted on-demand.
The scenario reported in the bug below is such a case, where the physical
address width of the CPU is greater than that of the IOMMU, resulting in a
VM where guest firmware has mapped device MMIO beyond the address width of
the IOMMU. In this configuration, the MMIO is faulted on demand and
tracing indicates that occasionally the faults generate a VM_FAULT_OOM.
Given the use case, this results in a "error: kvm run failed Bad address",
killing the VM.
The host is not under memory pressure in this test, therefore it's
suspected that VM_FAULT_OOM is actually the result of a NULL return from
__pte_offset_map_lock() in the get_locked_pte() path from insert_pfn().
This suggests a potential race inserting a pte concurrent to a pmd, and
maybe indicates some deficiency in the mm layer properly handling such a
case.
Nevertheless, Peter noted the inconsistency of vfio-pci's huge_fault
handler where our mapping granularity depends on the alignment of the
faulting address relative to the order rather than aligning the faulting
address to the order to more consistently insert huge mappings. This
change not only uses the page tables more consistently and efficiently, but
as any fault to an aligned page results in the same mapping, the race
condition suspected in the VM_FAULT_OOM is avoided.
Reported-by: Adolfo <adolfotregosa@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220057
Fixes:
|
||
|
|
2bd42b03ab |
vfio/pci: Virtualize zero INTx PIN if no pdev->irq
Typically pdev->irq is consistent with whether the device itself supports INTx, where device support is reported via the PIN register. Therefore the PIN register is often already zero if pdev->irq is zero. Recently virtualization of the PIN register was expanded to include the case where the device supports INTx but the platform does not route the interrupt. This is reported by a value of IRQ_NOTCONNECTED on some architectures. Other architectures just report zero for pdev->irq. We already disallow INTx setup if pdev->irq is zero, therefore add this to the PIN register virtualization criteria so that a consistent view is provided to userspace through virtualized config space and ioctls. Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Link: https://lore.kernel.org/all/174231895238.2295.12586708771396482526.stgit@linux.ibm.com/ Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Link: https://lore.kernel.org/r/20250320194145.2816379-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
3491aa0478 |
VFIO updates for v6.15-rc1
- Relax IGD support code to match display class device rather than
specifically requiring a VGA device. (Tomita Moeko)
- Accelerate DMA mapping of device MMIO by iterating at PMD and PUD
levels to take advantage of huge pfnmap support added in v6.12.
(Alex Williamson)
- Extend virtio vfio-pci variant driver to include migration support
for block devices where enabled by the PF. (Yishai Hadas)
- Virtualize INTx PIN register for devices where the platform does
not route legacy PCI interrupts for the device and the interrupt
is reported as IRQ_NOTCONNECTED. (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmfq5nEbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsi3KAP/2MQcQaKTZ6/+dG6YdKT
ZFaY4+xJ14DnUN/z96UlIWLk8bWgSyDFxdoFMbtFGENKRslEWxZ7In9Caow7f6ux
7/usBjSvJa5Yx9YWRGsblrx7IyYfSW6R1V+jH3xPd+K8Ir4K7SUvb1CJLVPdfEYh
OWer8eRpZ5tw3R2X4o+QxZ+H4Fx1zVQourW35h4daqrjnn7kOQMJIzGYOwHSDlCy
lW0X0yD3sGgw9w7qAmEDmw9UbKGf245AVylIl5T1a7c3RaO+eKdKPZfNa18g0J/Q
5pRMK+2PvZ+S0OTYxotcF9GtEJ3iBxY8W4QnlLiyTs9XyZ7tLMzGvLEKmCDKA0U8
yAtoJ5T00PVXjMxkZx1+oMGja9Hx+b7gABTYpbf5wRtab6EdNWln++I1HCLKgZZ+
yStvQNsMYGbJsLfwiGouMwD24JT+xg3A+Dv2Cx+Ai4NVJebxTD8Lhc0lz2I6IpOh
wFBpBzBIPpcG53oQ1Syb9GLESQ0Acb4LUMjsSxIg7QFSrWgAAlq/PiLXv852S3xJ
pUEh7r/YByQytUsQajgE7ekKqyXw0gn99Z+UTk0LUIq/y7SxrIPeqzq2qRf490RV
wnkOrMxrAWj84lkIv8hLCiLsXMmvV4rsMJgV+s8KQZ+hPv38mPbkFYGdHj4h5RPA
5J5h32dDkHLK+X5u//gBY8xh
=fJPO
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.15-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Relax IGD support code to match display class device rather than
specifically requiring a VGA device (Tomita Moeko)
- Accelerate DMA mapping of device MMIO by iterating at PMD and PUD
levels to take advantage of huge pfnmap support added in v6.12
(Alex Williamson)
- Extend virtio vfio-pci variant driver to include migration support
for block devices where enabled by the PF (Yishai Hadas)
- Virtualize INTx PIN register for devices where the platform does not
route legacy PCI interrupts for the device and the interrupt is
reported as IRQ_NOTCONNECTED (Alex Williamson)
* tag 'vfio-v6.15-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: Handle INTx IRQ_NOTCONNECTED
vfio/virtio: Enable support for virtio-block live migration
vfio/type1: Use mapping page mask for pfnmaps
mm: Provide address mask in struct follow_pfnmap_args
vfio/type1: Use consistent types for page counts
vfio/type1: Use vfio_batch for vaddr_get_pfns()
vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
vfio/type1: Catch zero from pin_user_pages_remote()
vfio/pci: match IGD devices in display controller class
|
||
|
|
48552153cf |
iommufd 6.15 merge window pull
Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are clearly
linked to a device. This is linked to the VIOMMU object and is intended to
be used by a VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered through
a simple FD returning event records on read().
- PASID support in VFIO. "Process Address Space ID" is a PCI feature that
allows the device to tag all PCI DMA operations with an ID. The IOMMU
will then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
manage each PASID entry. The support is generic so any VFIO user could
attach any translation to a PASID, and the support should work on ARM
SMMUv3 as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate what
owns the various opaque owner fields
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
=CIM3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
|
||
|
|
7d06015d93 |
pci-v6.15-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmfmt1AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxqNw//cC1BlVe4UUVR5r9nfpoFeGAZeDJz
32naGvZKjwL0tR6dStO/BEZx4QrBp+smVfJfuxtQxfzLHLgMigM2jVhfa7XUmaun
7yZlJZu4Jmydc57sPf56CFOYOFP6zyPzSaE8u1Eb4IIqpvuoYpvDayDt6PSsLmFS
PDzqmicT3nuNbbcfE4rYLyL6JsXooKCR1h+NNcDjy7Run9DvQbE6N2PPvXCu6O97
aC3+kYUydEpgn9DfjBDghe+GBQCkBPldwnWqXBxKDFmYj5bKFujNccS9/IDSEuuX
oWntDRAXgLWg048sBC1AuJQajF3UaqffRGJkzUBaZWbU/jB9t5N/Z3GpYlXzizRx
CAqnt/ciGUKVbaESKwoeeIqgK+wG1bnrmoEaJHXFGqjr6sjm2A2T5EzyBMJ1hFwE
wUq6SDnkp5igG7rWtsBPo/lGa5h/pNlaXng11570ikD2ZfHVfRgwy2MpXYxChrkt
X2q/lRYU7yyfNJQ8O5LQJ6bYztatjxT0TxXNFv+cxfVrdI7vnQMuaeBE352jn+Lo
aZ8fTJDFbRXtrZolcoetZjBdHwPIS42wYQtvxo/ylUl64xKEzZEzN09XODjwn74v
nuOzDtbM0TAjZWxi6bwRFPnemTEAQxPuv1i4VdPQ7yblvC0at2agNBsz6Fdq60GS
s80YMJVUU0UcVoc=
=gM1i
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Enable Configuration RRS SV, which makes device readiness visible,
early instead of during child bus scanning (Bjorn Helgaas)
- Log debug messages about reset methods being used (Bjorn Helgaas)
- Avoid reset when it has been disabled via sysfs (Nishanth
Aravamudan)
- Add common pci-ep-bus.yaml schema for exporting several peripherals
of a single PCI function via devicetree (Andrea della Porta)
- Create DT nodes for PCI host bridges to enable loading device tree
overlays to create platform devices for PCI devices that have
several features that require multiple drivers (Herve Codina)
Resource management:
- Enlarge devres table[] to accommodate bridge windows, ROM, IOV
BARs, etc., and validate BAR index in devres interfaces (Philipp
Stanner)
- Fix typo that repeatedly distributed resources to a bridge instead
of iterating over subordinate bridges, which resulted in too little
space to assign some BARs (Kai-Heng Feng)
- Relax bridge window tail sizing for optional resources, e.g., IOV
BARs, to avoid failures when removing and re-adding devices (Ilpo
Järvinen)
- Allow drivers to enable devices even if we haven't assigned
optional IOV resources to them (Ilpo Järvinen)
- Rework handling of optional resources (IOV BARs, ROMs) to reduce
failures if we can't allocate them (Ilpo Järvinen)
- Fix a NULL dereference in the SR-IOV VF creation error path (Shay
Drory)
- Fix s390 mmio_read/write syscalls, which didn't cause page faults
in some cases, which broke vfio-pci lazy mapping on first access
(Niklas Schnelle)
- Add pdev->non_mappable_bars to replace CONFIG_VFIO_PCI_MMAP, which
was disabled only for s390 (Niklas Schnelle)
- Support mmap of PCI resources on s390 except for ISM devices
(Niklas Schnelle)
ASPM:
- Delay pcie_link_state deallocation to avoid dangling pointers that
cause invalid references during hot-unplug (Daniel Stodden)
Power management:
- Allow PCI bridges to go to D3Hot when suspending on all non-x86
systems (Manivannan Sadhasivam)
Power control:
- Create pwrctrl devices in pci_scan_device() to make it more
symmetric with pci_pwrctrl_unregister() and make pwrctrl devices
for PCI bridges possible (Manivannan Sadhasivam)
- Unregister pwrctrl devices in pci_destroy_dev() so DOE, ASPM, etc.
can still access devices after pci_stop_dev() (Manivannan
Sadhasivam)
- If there's a pwrctrl device for a PCI device, skip scanning it
because the pwrctrl core will rescan the bus after the device is
powered on (Manivannan Sadhasivam)
- Add a pwrctrl driver for PCI slots based on voltage regulators
described via devicetree (Manivannan Sadhasivam)
Bandwidth control:
- Add set_pcie_speed.sh to TEST_PROGS to fix issue when executing the
set_pcie_cooling_state.sh test case (Yi Lai)
- Avoid a NULL pointer dereference when we run out of bus numbers to
assign for a bridge secondary bus (Lukas Wunner)
Hotplug:
- Drop superfluous pci_hotplug_slot_list, try_module_get() calls, and
NULL pointer checks (Lukas Wunner)
- Drop shpchp module init/exit logging, replace shpchp dbg() with
ctrl_dbg(), and remove unused dbg(), err(), info(), warn() wrappers
(Ilpo Järvinen)
- Drop 'shpchp_debug' module parameter in favor of standard dynamic
debugging (Ilpo Järvinen)
- Drop unused cpcihp .get_power(), .set_power() function pointers
(Guilherme Giacomo Simoes)
- Disable hotplug interrupts in portdrv only when pciehp is not
enabled to avoid issuing two hotplug commands too close together
(Feng Tang)
- Skip pciehp 'device replaced' check if the device has been removed
to address a deadlock when resuming after a device was removed
during system sleep (Lukas Wunner)
- Don't enable pciehp hotplug interupt when resuming in poll mode
(Ilpo Järvinen)
Virtualization:
- Fix bugs in 'pci=config_acs=' kernel command line parameter (Tushar
Dave)
DOE:
- Expose supported DOE features via sysfs (Alistair Francis)
- Allow DOE support to be enabled even if CXL isn't enabled (Alistair
Francis)
Endpoint framework:
- Convert PCI device data so pci-epf-test works correctly on
big-endian endpoint systems (Niklas Cassel)
- Add BAR_RESIZABLE type to endpoint framework and add DWC core
support for EPF drivers to set BAR_RESIZABLE type and size (Niklas
Cassel)
- Fix pci-epf-test double free that causes an oops if the host
reboots and PERST# deassertion restarts endpoint BAR allocation
(Christian Bruel)
- Fix endpoint BAR testing so tests can skip disabled BARs instead of
reporting them as failures (Niklas Cassel)
- Widen endpoint test BAR size variable to accommodate BARs larger
than INT_MAX (Niklas Cassel)
- Remove unused tools 'pci' build target left over after moving tests
to tools/testing/selftests/pci_endpoint (Jianfeng Liu)
Altera PCIe controller driver:
- Add DT binding and driver support for Agilex family (P-Tile,
F-Tile, R-Tile) (Matthew Gerlach and D M, Sharath Kumar)
AMD MDB PCIe controller driver:
- Add DT binding and driver for AMD MDB (Multimedia DMA Bridge)
(Thippeswamy Havalige)
Broadcom STB PCIe controller driver:
- Add BCM2712 MSI-X DT binding and interrupt controller drivers and
add softdep on irq_bcm2712_mip driver to ensure that it is loaded
first (Stanimir Varbanov)
- Expand inbound window map to 64GB so it can accommodate BCM2712
(Stanimir Varbanov)
- Add BCM2712 support and DT updates (Stanimir Varbanov)
- Apply link speed restriction before bringing link up, not after
(Jim Quinlan)
- Update Max Link Speed in Link Capabilities via the internal
writable register, not the read-only config register (Jim Quinlan)
- Handle regulator_bulk_get() error to avoid panic when we call
regulator_bulk_free() later (Jim Quinlan)
- Disable regulators only when removing the bus immediately below a
Root Port because we don't support regulators deeper in the
hierarchy (Jim Quinlan)
- Make const read-only arrays static (Colin Ian King)
Cadence PCIe endpoint driver:
- Correct MSG TLP generation so endpoints can generate INTx messages
(Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Identify the second controller on i.MX8MQ based on devicetree
'linux,pci-domain' instead of DBI 'reg' address (Richard Zhu)
- Remove imx_pcie_cpu_addr_fixup() since dwc core can now derive the
ATU input address (using parent_bus_offset) from devicetree (Frank
Li)
Freescale Layerscape PCIe controller driver:
- Drop deprecated 'num-ib-windows' and 'num-ob-windows' and
unnecessary 'status' from example (Krzysztof Kozlowski)
- Correct the syscon_regmap_lookup_by_phandle_args("fsl,pcie-scfg")
arg_count to fix probe failure on LS1043A (Ioana Ciornei)
HiSilicon STB PCIe controller driver:
- Call phy_exit() to clean up if histb_pcie_probe() fails (Christophe
JAILLET)
Intel Gateway PCIe controller driver:
- Remove intel_pcie_cpu_addr() since dwc core can now derive the ATU
input address (using parent_bus_offset) from devicetree (Frank Li)
Intel VMD host bridge driver:
- Convert vmd_dev.cfg_lock from spinlock_t to raw_spinlock_t so
pci_ops.read() will never sleep, even on PREEMPT_RT where
spinlock_t becomes a sleepable lock, to avoid calling a sleeping
function from invalid context (Ryo Takakura)
MediaTek PCIe Gen3 controller driver:
- Remove leftover mac_reset assert for Airoha EN7581 SoC (Lorenzo
Bianconi)
- Add EN7581 PBUS controller 'mediatek,pbus-csr' DT property and
program host bridge memory aperture to this syscon node (Lorenzo
Bianconi)
Qualcomm PCIe controller driver:
- Add qcom,pcie-ipq5332 binding (Varadarajan Narayanan)
- Add qcom i.MX8QM and i.MX8QXP/DXP optional DMA interrupt (Alexander
Stein)
- Add optional dma-coherent DT property for Qualcomm SA8775P (Dmitry
Baryshkov)
- Make DT iommu property required for SA8775P and prohibited for
SDX55 (Dmitry Baryshkov)
- Add DT IOMMU and DMA-related properties for Qualcomm SM8450 (Dmitry
Baryshkov)
- Add endpoint DT properties for SAR2130P and enable endpoint mode in
driver (Dmitry Baryshkov)
- Describe endpoint BAR0 and BAR2 as 64-bit only and BAR1 and BAR3 as
RESERVED (Manivannan Sadhasivam)
Rockchip DesignWare PCIe controller driver:
- Describe rk3568 and rk3588 BARs as Resizable, not Fixed (Niklas
Cassel)
Synopsys DesignWare PCIe controller driver:
- Add debugfs-based Silicon Debug, Error Injection, Statistical
Counter support for DWC (Shradha Todi)
- Add debugfs property to expose LTSSM status of DWC PCIe link (Hans
Zhang)
- Add Rockchip support for DWC debugfs features (Niklas Cassel)
- Add dw_pcie_parent_bus_offset() to look up the parent bus address
of a specified 'reg' property and return the offset from the CPU
physical address (Frank Li)
- Use dw_pcie_parent_bus_offset() to derive CPU -> ATU addr offset
via 'reg[config]' for host controllers and 'reg[addr_space]' for
endpoint controllers (Frank Li)
- Apply struct dw_pcie.parent_bus_offset in ATU users to remove use
of .cpu_addr_fixup() when programming ATU (Frank Li)
TI J721E PCIe driver:
- Correct the 'link down' interrupt bit for J784S4 (Siddharth
Vadapalli)
TI Keystone PCIe controller driver:
- Describe AM65x BARs 2 and 5 as Resizable (not Fixed) and reduce
alignment requirement from 1MB to 64KB (Niklas Cassel)
Xilinx Versal CPM PCIe controller driver:
- Free IRQ domain in probe error path to avoid leaking it
(Thippeswamy Havalige)
- Add DT .compatible "xlnx,versal-cpm5nc-host" and driver support for
Versal Net CPM5NC Root Port controller (Thippeswamy Havalige)
- Add driver support for CPM5_HOST1 (Thippeswamy Havalige)
Miscellaneous:
- Convert fsl,mpc83xx-pcie binding to YAML (J. Neuschäfer)
- Use for_each_available_child_of_node_scoped() to simplify apple,
kirin, mediatek, mt7621, tegra drivers (Zhang Zekun)"
* tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (197 commits)
PCI: layerscape: Fix arg_count to syscon_regmap_lookup_by_phandle_args()
PCI: j721e: Fix the value of .linkdown_irq_regfield for J784S4
misc: pci_endpoint_test: Add support for PCITEST_IRQ_TYPE_AUTO
PCI: endpoint: pci-epf-test: Expose supported IRQ types in CAPS register
PCI: dw-rockchip: Endpoint mode cannot raise INTx interrupts
PCI: endpoint: Add intx_capable to epc_features struct
dt-bindings: PCI: Add common schema for devices accessible through PCI BARs
PCI: intel-gw: Remove intel_pcie_cpu_addr()
PCI: imx6: Remove imx_pcie_cpu_addr_fixup()
PCI: dwc: Use parent_bus_offset to remove need for .cpu_addr_fixup()
PCI: dwc: ep: Ensure proper iteration over outbound map windows
PCI: dwc: ep: Use devicetree 'reg[addr_space]' to derive CPU -> ATU addr offset
PCI: dwc: ep: Consolidate devicetree handling in dw_pcie_ep_get_resources()
PCI: dwc: ep: Call epc_create() early in dw_pcie_ep_init()
PCI: dwc: Use devicetree 'reg[config]' to derive CPU -> ATU addr offset
PCI: dwc: Add dw_pcie_parent_bus_offset() checking and debug
PCI: dwc: Add dw_pcie_parent_bus_offset()
PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
PCI: xilinx-cpm: Add cpm_csr register mapping for CPM5_HOST1 variant
PCI: brcmstb: Make const read-only arrays static
...
|
||
|
|
ad744ed5dd |
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
This extends the VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT ioctls to attach/detach a given pasid of a vfio device to/from an IOAS/HWPT. Link: https://patch.msgid.link/r/20250321180143.8468-4-yi.l.liu@intel.com Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
||
|
|
290641346d |
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a device and the helpers for it. For now, only vfio-pci supports pasid attach/detach. Link: https://patch.msgid.link/r/20250321180143.8468-3-yi.l.liu@intel.com Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
||
|
|
2fb69c602d |
iommufd: Support pasid attach/replace
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.
int iommufd_device_attach(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
int iommufd_device_replace(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
void iommufd_device_detach(struct iommufd_device *idev,
ioasid_t pasid);
The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:
- no reserved region per pasid otherwise SVA architecture is already
broken (CPU address space doesn't count device reserved regions);
- accordingly no sw_msi trick;
Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).
Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
||
|
|
888bd8322d |
s390/pci: Introduce pdev->non_mappable_bars and replace VFIO_PCI_MMAP
The ability to map PCI resources to user-space is controlled by global defines. For vfio there is VFIO_PCI_MMAP which is only disabled on s390 and controls mapping of PCI resources using vfio-pci with a fallback option via the pread()/pwrite() interface. For the PCI core there is ARCH_GENERIC_PCI_MMAP_RESOURCE which enables a generic implementation for mapping PCI resources plus the newer sysfs interface. Then there is HAVE_PCI_MMAP which can be used with custom definitions of pci_mmap_resource_range() and the historical /proc/bus/pci interface. Both mechanisms are all or nothing. For s390 mapping PCI resources is possible and useful for testing and certain applications such as QEMU's vfio-pci based user-space NVMe driver. For certain devices, however access to PCI resources via mappings to user-space is not possible and these must be excluded from the general PCI resource mapping mechanisms. Introduce pdev->non_mappable_bars to indicate that a PCI device's BARs can not be accessed via mappings to user-space. In the future this enables per-device restrictions of PCI resource mapping. For now, set this flag for all PCI devices on s390 in line with the existing, general disable of PCI resource mapping. As s390 is the only user of the VFI_PCI_MMAP Kconfig options this can already be replaced with a check of this new flag. Also add similar checks in the other code protected by HAVE_PCI_MMAP respectively ARCH_GENERIC_PCI_MMAP in preparation for enabling these for supported devices. Link: https://lore.kernel.org/lkml/20250212132808.08dcf03c.alex.williamson@redhat.com/ Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-2-c5c0f1d26efd@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> |
||
|
|
860be250fc |
vfio/pci: Handle INTx IRQ_NOTCONNECTED
Some systems report INTx as not routed by setting pdev->irq to IRQ_NOTCONNECTED, resulting in a -ENOTCONN error when trying to setup eventfd signaling. Include this in the set of conditions for which the PIN register is virtualized to zero. Additionally consolidate vfio_pci_get_irq_count() to use this virtualized value in reporting INTx support via ioctl and sanity checking ioctl paths since pdev->irq is re-used when the device is in MSI mode. The combination of these results in both the config space of the device and the ioctl interface behaving as if the device does not support INTx. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250311230623.1264283-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
384a530111 |
vfio/virtio: Enable support for virtio-block live migration
With a functional and tested backend for virtio-block live migration, add the virtio-block device ID to the pci_device_id table. Currently, the driver supports legacy IO functionality only for virtio-net, and it is accounted for in specific parts of the code. To enforce this limitation, an explicit check for virtio-net, has been added in virtiovf_support_legacy_io(). Once a backend implements legacy IO functionality for virtio-block, the necessary support will be added to the driver, and this additional check should be removed. The module description was updated accordingly. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20250302162723.82578-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
0fd06844de |
vfio/type1: Use mapping page mask for pfnmaps
vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and pmd mappings for well aligned mappings. follow_pfnmap_start() walks the page table and therefore knows the page mask of the level where the address is found and returns this through follow_pfnmap_args.addr_mask. Subsequent pfns from this address until the end of the mapping page are necessarily consecutive. Use this information to retrieve a range of pfnmap pfns in a single pass. With optimal mappings and alignment on systems with 1GB pud and 4KB page size, this reduces iterations for DMA mapping PCI BARs by a factor of 256K. In real world testing, the overhead of iterating pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to sub-millisecond overhead. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com> Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250218222209.1382449-7-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
0635559233 |
vfio/type1: Use consistent types for page counts
Page count should more consistently be an unsigned long when passed as an argument while functions returning a number of pages should use a signed long to allow for -errno. vaddr_get_pfns() can therefore be upgraded to return long, though in practice it's currently limited by the batch capacity. In fact, the batch indexes are noted to never hold negative values, so while it doesn't make sense to bloat the structure with unsigned longs in this case, it does make sense to specify these as unsigned. No change in behavior expected. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com> Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250218222209.1382449-5-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
eb996eec78 |
vfio/type1: Use vfio_batch for vaddr_get_pfns()
Passing the vfio_batch to vaddr_get_pfns() allows for greater distinction between page backed pfns and pfnmaps. In the case of page backed pfns, vfio_batch.size is set to a positive value matching the number of pages filled in vfio_batch.pages. For a pfnmap, vfio_batch.size remains zero as vfio_batch.pages are not used. In both cases the return value continues to indicate the number of pfns and the provided pfn arg is set to the initial pfn value. This allows us to shortcut the pfnmap case, which is detected by the zero vfio_batch.size. pfnmaps do not contribute to locked memory accounting, therefore we can update counters and continue directly, which also enables a future where vaddr_get_pfns() can return a value greater than one for consecutive pfnmaps. NB. Now that we're not guessing whether the initial pfn is page backed or pfnmap, we no longer need to special case the put_pfn() and batch size reset. It's safe for vfio_batch_unpin() to handle this case. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com> Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com> Link: https://lore.kernel.org/r/20250218222209.1382449-4-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
7a701e90fc |
vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
This is a step towards passing the structure to vaddr_get_pfns() directly in order to provide greater distinction between page backed pfns and pfnmaps. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com> Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com> Link: https://lore.kernel.org/r/20250218222209.1382449-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
afe84f3b7a |
vfio/type1: Catch zero from pin_user_pages_remote()
pin_user_pages_remote() can currently return zero for invalid args or zero nr_pages, neither of which should ever happen. However vaddr_get_pfns() indicates it should only ever return a positive value or -errno and there's a theoretical case where this can slip through and be unhandled by callers. Therefore convert zero to -EFAULT. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com> Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250218222209.1382449-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
41112160ca |
vfio/pci: match IGD devices in display controller class
IGD device can either expose as a VGA controller or display controller depending on whether it is configured as the primary display device in BIOS. In both cases, the OpRegion may be present. A new helper function vfio_pci_is_intel_display() is introduced to check if the device might be an IGD device. Signed-off-by: Tomita Moeko <tomitamoeko@gmail.com> Link: https://lore.kernel.org/r/20250123163416.7653-1-tomitamoeko@gmail.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
f9835fa147
|
make use of anon_inode_getfile_fmode()
["fallen through the cracks" misc stuff] A bunch of anon_inode_getfile() callers follow it with adjusting ->f_mode; we have a helper doing that now, so let's make use of it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250118014434.GT1977892@ZenIV Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
3673f5be0e |
VFIO updates for v6.14-rc1
- Extend vfio-pci 8-byte read/write support to include archs defining
CONFIG_GENERIC_IOMAP, such as x86, and remove now extraneous #ifdefs
around 64-bit accessors. (Ramesh Thomas)
- Update vfio-pci shadow ROM handling and allow cached ROM from setup
data to be exposed as a functional ROM BAR region when available.
(Yunxiang Li)
- Update nvgrace-gpu vfio-pci variant driver for new Grace Blackwell
hardware, conditionalizing the uncached BAR workaround for previous
generation hardware based on the presence of a flag in a new DVSEC
capability, and include a delay during probe for link training to
complete, a new requirement for GB devices. (Ankit Agrawal)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmeZNf0bHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiZDIP/16zZ6FwPxu52erigZCN
6GwniPU051kDEKMhlO9aGSJNXkzB3qGYUXH1rExIYazRrovc9QYRG1JXBEGIj+Dj
NkxhwJBvaj61dc1p+lfNH/jZYipE+mbuguz1gOZfwMEA/8uNmsFd2uhXBZBehI2X
WCQQdxwz9GWB34pabN2OuR3cFwYbYD06kg/WfxsAEisDhRsjrPn//fbwZNBM7mLy
4gmatoQ97uPexo+SACQSIIop7TlNeiA+Mo8i/XxTpmjry9Wl0tMNBKfNaxMf7o5p
4laRU6EyT0/Cimc1w8Mct96fvO1AqKIRnBqFYwxzmtYthllpKPqnoZlwOPSE24f7
zbB46NkhdE6JOsqJUMPj+hdW3bBhQgcpIMU3MkYgbzNVjcb5DcIDZk3b64DIJOqK
HzItxvUNXVo9HYnc1gdI88c2btDA1hDOzH5fFX85AmQcUqs24+i7ekdEyws65J0O
iVBJP/cC51vAJv0y4gtty+bq1OqcQ3jwnEvre52F9LPJVHFsKA8RheOyodlG0Ar8
m1zWJVZbQIFbs8gp+q/GHdltQ9w0XvECQOe1EE7zxAQX0noks+3S+Ea+wAYYZH5p
a1fbep0MoL3fZF+s4a7kc/avcm1WRpQTSY10HC3K/+0wQ5S0B8n/QG4S9mMsEEBn
G/9+7ELvYrFop/CfC4Mkj0MA
=NJZ8
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.14-rc1' of https://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
- Extend vfio-pci 8-byte read/write support to include archs defining
CONFIG_GENERIC_IOMAP, such as x86, and remove now extraneous #ifdefs
around 64-bit accessors (Ramesh Thomas)
- Update vfio-pci shadow ROM handling and allow cached ROM from setup
data to be exposed as a functional ROM BAR region when available
(Yunxiang Li)
- Update nvgrace-gpu vfio-pci variant driver for new Grace Blackwell
hardware, conditionalizing the uncached BAR workaround for previous
generation hardware based on the presence of a flag in a new DVSEC
capability, and include a delay during probe for link training to
complete, a new requirement for GB devices (Ankit Agrawal)
* tag 'vfio-v6.14-rc1' of https://github.com/awilliam/linux-vfio:
vfio/nvgrace-gpu: Add GB200 SKU to the devid table
vfio/nvgrace-gpu: Check the HBM training and C2C link status
vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
vfio/platform: check the bounds of read/write syscalls
vfio/pci: Expose setup ROM at ROM bar when needed
vfio/pci: Remove shadow ROM specific code paths
vfio/pci: Remove #ifdef iowrite64 and #ifdef ioread64
vfio/pci: Enable iowrite64 and ioread64 for vfio pci
|
||
|
|
2ab002c755 |
Driver core and debugfs updates
Here is the big set of driver core and debugfs updates for 6.14-rc1.
It's coming late in the merge cycle as there are a number of merge
conflicts with your tree now, and I wanted to make sure they were
working properly. To resolve them, look in linux-next, and I will send
the "fixup" patch as a response to the pull request.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at least
one user that does have a regression with these, but Al is working on
tracking down the fix for it. In my use (and everyone else's linux-next
use), it does not seem like a big issue at the moment.
Here's a short list of the things in here:
- driver core bindings for PCI, platform, OF, and some i/o functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing things
in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon".
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
hcDSPodj4szR40RRnzBd
=u5Ey
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the big set of driver core and debugfs updates for 6.14-rc1.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at
least one user that does have a regression with these, but Al is
working on tracking down the fix for it. In my use (and everyone
else's linux-next use), it does not seem like a big issue at the
moment.
Here's a short list of the things in here:
- driver core rust bindings for PCI, platform, OF, and some i/o
functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing
things in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon""
* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
rust: device: Use as_char_ptr() to avoid explicit cast
rust: device: Replace CString with CStr in property_present()
devcoredump: Constify 'struct bin_attribute'
devcoredump: Define 'struct bin_attribute' through macro
rust: device: Add property_present()
saner replacement for debugfs_rename()
orangefs-debugfs: don't mess with ->d_name
octeontx2: don't mess with ->d_parent or ->d_parent->d_name
arm_scmi: don't mess with ->d_parent->d_name
slub: don't mess with ->d_name
sof-client-ipc-flood-test: don't mess with ->d_name
qat: don't mess with ->d_name
xhci: don't mess with ->d_iname
mtu3: don't mess wiht ->d_iname
greybus/camera - stop messing with ->d_iname
mediatek: stop messing with ->d_iname
netdevsim: don't embed file_operations into your structs
b43legacy: make use of debugfs_get_aux()
b43: stop embedding struct file_operations into their objects
carl9170: stop embedding file_operations into their objects
...
|
||
|
|
2bb447540e |
vfio/nvgrace-gpu: Add GB200 SKU to the devid table
NVIDIA is productizing the new Grace Blackwell superchip SKU bearing device ID 0x2941. Add the SKU devid to nvgrace_gpu_vfio_pci_table. CC: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
d85f69d520 |
vfio/nvgrace-gpu: Check the HBM training and C2C link status
In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Ensure that the BAR0 is enabled before accessing the registers. CC: Alex Williamson <alex.williamson@redhat.com> CC: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
6a9eb2d125 |
vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Cc: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
bd53764a60 |
vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] CC: Jason Gunthorpe <jgg@nvidia.com> CC: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
||
|
|
6bf9b5b40a |
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
647d69605c |
pci-v6.14-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmeTr8wUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxrMw//TJXH+U6x5LhYvBPD/KZ20ecGHqaA
eGXrbHAasYbU1CfW7HM0onR8NffOIGoYvQrtefjQAln0w6rTvyFO0xJKLP15vMfN
hnj+y1WWtKwAkSpu10Cl9nTj8uYRNNSQeoy5kS+1diwuXdby/DlgQONO2APSe9zd
KMPXJcqSfDJlM5zHrcqqtlxauO9KHInLCc/iutd85AKjvcjOoNHNeZE0pTC0C3gE
sXYHDqJiS3zdEG6X6mWFo3OzI/Q/7NGlHJ2j0CQaObsgQ9yA7eWkez25ifwZcugc
TPtjm8DhaDo9/zx0NV9c2dPauHRC6NYUjAflMPK7Aye/41BE1Ag5Ka+tMDgC2i/N
TbfBxSeArhjnjY+eZwRhrJNNC58TtHTUs69TO7Dbmuwr7cp99MIEDAYI5V6LFAdk
plKqn1h8FztW5QKRPCgmzy6KTE+WPytiGAGAQFxzIGYkV/QqyvFaVs8FIyOJUIFM
aDSa6Xy5WLGxmPZ9hPapzEm4ws/HTRpFjNgi/d4rRG5RWMwAxZZa44s9eldhN1D/
ZwEmF2rJ+U8S7Q+mXPHDlwcsHe5APCbiaTEp4X+e3LNe0i9oxhhaUWG6LrDDmTlQ
tU5j5daHiBa0nTDL1lfaayJlYX/oJ+IYQrIYzGbnivZv4ZVdPnuWSsOsMOEiLhEt
4QqCoanqf0mCn2A=
=O62O
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Batch sizing of multiple BARs while memory decoding is disabled
instead of disabling/enabling decoding for each BAR individually;
this optimizes virtualized environments where toggling decoding
enable is expensive (Alex Williamson)
- Add host bridge .enable_device() and .disable_device() hooks for
bridges that need to configure things like Requester ID to StreamID
mapping when enabling devices (Frank Li)
- Extend struct pci_ecam_ops with .enable_device() and
.disable_device() hooks so drivers that use pci_host_common_probe()
instead of their own .probe() have a way to set the
.enable_device() callbacks (Marc Zyngier)
- Drop 'No bus range found' message so we don't complain when DTs
don't specify the default 'bus-range = <0x00 0xff>' (Bjorn Helgaas)
- Rename the drivers/pci/of_property.c struct of_pci_range to
of_pci_range_entry to avoid confusion with the global of_pci_range
in include/linux/of_address.h (Bjorn Helgaas)
Driver binding:
- Update resource request API documentation to encourage callers to
supply a driver name when requesting resources (Philipp Stanner)
- Export pci_intx_unmanaged() and pcim_intx() (always managed) so
callers of pci_intx() (which is sometimes managed) can explicitly
choose the one they need (Philipp Stanner)
- Convert drivers from pci_intx() to always-managed pcim_intx() or
never-managed pci_intx_unmanaged(): amd_sfh, ata (ahci, ata_piix,
pata_rdc, sata_sil24, sata_sis, sata_uli, sata_vsc), bnx2x, bna,
ntb, qtnfmac, rtsx, tifm_7xx1, vfio, xen-pciback (Philipp Stanner)
- Remove pci_intx_unmanaged() since pci_intx() is now always
unmanaged and pcim_intx() is always managed (Philipp Stanner)
Error handling:
- Unexport pcie_read_tlp_log() to encourage drivers to use PCI core
logging rather than building their own (Ilpo Järvinen)
- Move TLP Log handling to its own file (Ilpo Järvinen)
- Store number of supported End-End TLP Prefixes always so we can
read the correct number of DWORDs from the TLP Prefix Log (Ilpo
Järvinen)
- Read TLP Prefixes in addition to the Header Log in
pcie_read_tlp_log() (Ilpo Järvinen)
- Add pcie_print_tlp_log() to consolidate printing of TLP Header and
Prefix Log (Ilpo Järvinen)
- Quirk the Intel Raptor Lake-P PIO log size to accommodate vendor
BIOSes that don't configure it correctly (Takashi Iwai)
ASPM:
- Save parent L1 PM Substates config so when we restore it along with
an endpoint's config, the parent info isn't junk (Jian-Hong Pan)
Power management:
- Avoid D3 for Root Ports on TUXEDO Sirius Gen1 with old BIOS because
the system can't wake up from suspend (Werner Sembach)
Endpoint framework:
- Destroy the EPC device in devm_pci_epc_destroy(), which previously
didn't call devres_release() (Zijun Hu)
- Finish virtual EP removal in pci_epf_remove_vepf(), which
previously caused a subsequent pci_epf_add_vepf() to fail with
-EBUSY (Zijun Hu)
- Write BAR_MASK before iATU registers in pci_epc_set_bar() so we
don't depend on the BAR_MASK reset value being larger than the
requested BAR size (Niklas Cassel)
- Prevent changing BAR size/flags in pci_epc_set_bar() to prevent
reads from bypassing the iATU if we reduced the BAR size (Niklas
Cassel)
- Verify address alignment when programming iATU so we don't attempt
to write bits that are read-only because of the BAR size, which
could lead to directing accesses to the wrong address (Niklas
Cassel)
- Implement artpec6 pci_epc_features so we can rely on all drivers
supporting it so we can use it in EPC core code (Niklas Cassel)
- Check for BARs of fixed size to prevent endpoint drivers from
trying to change their size (Niklas Cassel)
- Verify that requested BAR size is a power of two when endpoint
driver sets the BAR (Niklas Cassel)
Endpoint framework tests:
- Clear pci-epf-test dma_chan_rx, not dma_chan_tx, after freeing
dma_chan_rx (Mohamed Khalfella)
- Correct the DMA MEMCPY test so it doesn't fail if the Endpoint
supports both DMA_PRIVATE and DMA_MEMCPY (Manivannan Sadhasivam)
- Add pci-epf-test and pci_endpoint_test support for capabilities
(Niklas Cassel)
- Add Endpoint test for consecutive BARs (Niklas Cassel)
- Remove redundant comparison from Endpoint BAR test because a > 1MB
BAR can always be exactly covered by iterating with a 1MB buffer
(Hans Zhang)
- Move and convert PCI Endpoint tests from tools/pci to Kselftests
(Manivannan Sadhasivam)
Apple PCIe controller driver:
- Convert StreamID mapping configuration from a bus notifier to the
.enable_device() and .disable_device() callbacks (Marc Zyngier)
Freescale i.MX6 PCIe controller driver:
- Add Requester ID to StreamID mapping configuration when enabling
devices (Frank Li)
- Use DWC core suspend/resume functions for imx6 (Frank Li)
- Add suspend/resume support for i.MX8MQ, i.MX8Q, and i.MX95 (Richard
Zhu)
- Add DT compatible string 'fsl,imx8q-pcie-ep' and driver support for
i.MX8Q series (i.MX8QM, i.MX8QXP, and i.MX8DXL) Endpoints (Frank
Li)
- Add DT binding for optional i.MX95 Refclk and driver support to
enable it if the platform hasn't enabled it (Richard Zhu)
- Configure PHY based on controller being in Root Complex or Endpoint
mode (Frank Li)
- Rely on dbi2 and iATU base addresses from DT via
dw_pcie_get_resources() instead of hardcoding them (Richard Zhu)
- Deassert apps_reset in imx_pcie_deassert_core_reset() since it is
asserted in imx_pcie_assert_core_reset() (Richard Zhu)
- Add missing reference clock enable or disable logic for IMX6SX,
IMX7D, IMX8MM (Richard Zhu)
- Remove redundant imx7d_pcie_init_phy() since
imx7d_pcie_enable_ref_clk() does the same thing (Richard Zhu)
Freescale Layerscape PCIe controller driver:
- Simplify by using syscon_regmap_lookup_by_phandle_args() instead
of syscon_regmap_lookup_by_phandle() followed by
of_property_read_u32_array() (Krzysztof Kozlowski)
Marvell MVEBU PCIe controller driver:
- Add MODULE_DEVICE_TABLE() to enable module autoloading (Liao Chen)
MediaTek PCIe Gen3 controller driver:
- Use clk_bulk_prepare_enable() instead of separate
clk_bulk_prepare() and clk_bulk_enable() (Lorenzo Bianconi)
- Rearrange reset assert/deassert so they're both done in the
*_power_up() callbacks (Lorenzo Bianconi)
- Document that Airoha EN7581 requires PHY init and power-on before
PHY reset deassert, unlike other MediaTek Gen3 controllers (Lorenzo
Bianconi)
- Move Airoha EN7581 post-reset delay from the en7581 clock .enable()
method to mtk_pcie_en7581_power_up() (Lorenzo Bianconi)
- Sleep instead of delay during Airoha EN7581 power-up, since this is
a non-atomic context (Lorenzo Bianconi)
- Skip PERST# assertion on Airoha EN7581 during probe and
suspend/resume to avoid a hardware defect (Lorenzo Bianconi)
- Enable async probe to reduce system startup time (Douglas Anderson)
Microchip PolarFlare PCIe controller driver:
- Set up the inbound address translation based on whether the
platform allows coherent or non-coherent DMA (Daire McNamara)
- Update DT binding such that platforms are DMA-coherent by default
and must specify 'dma-noncoherent' if needed (Conor Dooley)
Mobiveil PCIe controller driver:
- Convert mobiveil-pcie.txt to YAML and update 'interrupt-names'
and 'reg-names' (Frank Li)
Qualcomm PCIe controller driver:
- Add DT SM8550 and SM8650 optional 'global' interrupt for link
events (Neil Armstrong)
- Add DT 'compatible' strings for IPQ5424 PCIe controller (Manikanta
Mylavarapu)
- If 'global' IRQ is supported for detection of Link Up events, tell
DWC core not to wait for link up (Krishna chaitanya chundru)
Renesas R-Car PCIe controller driver:
- Avoid passing stack buffer as resource name (King Dix)
Rockchip PCIe controller driver:
- Simplify clock and reset handling by using bulk interfaces (Anand
Moon)
- Pass typed rockchip_pcie (not void) pointer to
rockchip_pcie_disable_clocks() (Anand Moon)
- Return -ENOMEM, not success, when pci_epc_mem_alloc_addr() fails
(Dan Carpenter)
Rockchip DesignWare PCIe controller driver:
- Use dll_link_up IRQ to detect Link Up and enumerate devices so
users don't have to manually rescan (Niklas Cassel)
- Tell DWC core not to wait for link up since the 'sys' interrupt is
required and detects Link Up events (Niklas Cassel)
Synopsys DesignWare PCIe controller driver:
- Don't wait for link up in DWC core if driver can detect Link Up
event (Krishna chaitanya chundru)
- Update ICC and OPP votes after Link Up events (Krishna chaitanya
chundru)
- Always stop link in dw_pcie_suspend_noirq(), which is required at
least for i.MX8QM to re-establish link on resume (Richard Zhu)
- Drop racy and unnecessary LTSSM state check before sending
PME_TURN_OFF message in dw_pcie_suspend_noirq() (Richard Zhu)
- Add struct of_pci_range.parent_bus_addr for devices that need their
immediate parent bus address, not the CPU address, e.g., to program
an internal Address Translation Unit (iATU) (Frank Li)
TI DRA7xx PCIe controller driver:
- Simplify by using syscon_regmap_lookup_by_phandle_args() instead of
syscon_regmap_lookup_by_phandle() followed by
of_parse_phandle_with_fixed_args() or of_property_read_u32_index()
(Krzysztof Kozlowski)
Xilinx Versal CPM PCIe controller driver:
- Add DT binding and driver support for Xilinx Versal CPM5
(Thippeswamy Havalige)
MicroSemi Switchtec management driver:
- Add Microchip PCI100X device IDs (Rakesh Babu Saladi)
Miscellaneous:
- Move reset related sysfs code from pci.c to pci-sysfs.c where other
similar code lives (Ilpo Järvinen)
- Simplify reset_method_store() memory management by using __free()
instead of explicit kfree() cleanup (Ilpo Järvinen)
- Constify struct bin_attribute for sysfs, VPD, P2PDMA, and the IBM
ACPI hotplug driver (Thomas Weißschuh)
- Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT (Dongdong
Zhang)
- Correct documentation of the 'config_acs=' kernel parameter
(Akihiko Odaki)"
* tag 'pci-v6.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (111 commits)
PCI: Batch BAR sizing operations
dt-bindings: PCI: microchip,pcie-host: Allow dma-noncoherent
PCI: microchip: Set inbound address translation for coherent or non-coherent mode
Documentation: Fix pci=config_acs= example
PCI: Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT
PCI: Don't include 'pm_wakeup.h' directly
selftests: pci_endpoint: Migrate to Kselftest framework
selftests: Move PCI Endpoint tests from tools/pci to Kselftests
misc: pci_endpoint_test: Fix IOCTL return value
dt-bindings: PCI: qcom: Document the IPQ5424 PCIe controller
dt-bindings: PCI: qcom,pcie-sm8550: Document 'global' interrupt
dt-bindings: PCI: mobiveil: Convert mobiveil-pcie.txt to YAML
PCI: switchtec: Add Microchip PCI100X device IDs
misc: pci_endpoint_test: Remove redundant 'remainder' test
misc: pci_endpoint_test: Add consecutive BAR test
misc: pci_endpoint_test: Add support for capabilities
PCI: endpoint: pci-epf-test: Add support for capabilities
PCI: endpoint: pci-epf-test: Fix check for DMA MEMCPY test
PCI: endpoint: pci-epf-test: Set dma_chan_rx pointer to NULL on error
PCI: dwc: Simplify config resource lookup
...
|
||
|
|
ce9ff21ea8 |
vfio/platform: check the bounds of read/write syscalls
count and offset are passed from user space and not checked, only
offset is capped to 40 bits, which can be used to read/write out of
bounds of the device.
Fixes:
|
||
|
|
188973e353 |
PCI: Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT
Remove duplicate macro PCI_VSEC_HDR and its related macro PCI_VSEC_HDR_LEN_SHIFT from pci_regs.h to avoid redundancy and inconsistencies. Update VFIO PCI code to use PCI_VNDR_HEADER and PCI_VNDR_HEADER_LEN() for consistent naming and functionality. These changes aim to streamline header handling while minimizing impact, given the niche usage of these macros in userspace. Link: https://lore.kernel.org/r/20241216013536.4487-1-zhangdongdong@eswincomputing.com Signed-off-by: Dongdong Zhang <zhangdongdong@eswincomputing.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
dd19f4116e |
Merge 6.13-rc7 into driver-core-next
We need the debugfs / driver-core fixes in here as well for testing and to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
827ed8b159 |
drivers: core: remove device_link argument from class_compat_[create|remove]_link
After
|
||
|
|
e021e6cbfb |
vfio/pci: Expose setup ROM at ROM bar when needed
If ROM bar is missing for any reason, we can fallback to using pdev->rom to expose the ROM content to the guest. This fixes some passthrough use cases where the upstream bridge does not have enough address window. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Link: https://lore.kernel.org/r/20250102185013.15082-3-Yunxiang.Li@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
c5a8b5d740 |
vfio/pci: Remove shadow ROM specific code paths
After commit
|
||
|
|
b44a06bd28 |
vfio/pci: Remove #ifdef iowrite64 and #ifdef ioread64
Remove the #ifdef iowrite64 and #ifdef ioread64 checks around calls to 64 bit IO access. Since default implementations have been enabled, the checks are not required. Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241210131938.303500-3-ramesh.thomas@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
2b938e3db3 |
vfio/pci: Enable iowrite64 and ioread64 for vfio pci
Definitions of ioread64 and iowrite64 macros in asm/io.h called by vfio pci implementations are enclosed inside check for CONFIG_GENERIC_IOMAP. They don't get defined if CONFIG_GENERIC_IOMAP is defined. Include linux/io-64-nonatomic-lo-hi.h to define iowrite64 and ioread64 macros when they are not defined. io-64-nonatomic-lo-hi.h maps the macros to generic implementation in lib/iomap.c. The generic implementation does 64 bit rw if readq/writeq is defined for the architecture, otherwise it would do 32 bit back to back rw. Note that there are two versions of the generic implementation that differs in the order the 32 bit words are written if 64 bit support is not present. This is not the little/big endian ordering, which is handled separately. This patch uses the lo followed by hi word ordering which is consistent with current back to back implementation in the vfio/pci code. Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241210131938.303500-2-ramesh.thomas@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
09dfc8a5f2 |
vfio/pci: Fallback huge faults for unaligned pfn
The PFN must also be aligned to the fault order to insert a huge
pfnmap. Test the alignment and fallback when unaligned.
Fixes:
|
||
|
|
ec8e2d3889 |
VFIO fixes for v6.13-rc3
- Fix migration dirty page tracking support in the mlx5-vfio-pci
variant driver in configurations where the system page size exceeds
the device maximum message size, and anticipate device updates where
the opposite may also be required. (Yishai Hadas)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmdZ+IMbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsithwP/iYRj4FFMArTRxx8etEe
QE7ZCm17Rd7eAWNdJx9Rr8XE34OEigzxJy4IKHd/EajmCmq/FvRvRM/Svho1fKp7
3p4dzP6pGMC7fEf6OywrL1/51Lu8KU6Bkl7njyd81UEK6667fEPJNoifPlYXi3o9
NPczBETYjIAwhz4b6vUzHZ8mX52cgz1+wq7L2yOnMWrrwVwkQaZDUQVJWE+x3bkz
/NRSYzJqaL/UJzowGQM+YDnYPYIQs1lg81wecVvTM1Hu3Dseh/mIwpKG6r8HBpJw
kmEFqYNAfeCnNSleJA5s9iGxvX0VODBESfiFJyXYH25Tofjd8TEttTrXIJoO7dif
4Ryptc2J5cQXkIwTDgQdjwtNWlABA0qelssBUyVd+Lusf/GhhwRwUJK5uoe+faOq
0EcHUEEDrgDPhCxvy+rNvnZtYRGCaYmYnFWgP61m08cbbvFr1x6exuXRxMuMag4p
hN2uroHNGHuymAfmo/K6rB1u2c6y23rUmmr0wJ6qCx9XDGRbxnN/ZnJkvbKsfIGQ
pCEq4wHm6xJ9BpqXhebRA+3dEU5pH7Q+gAmJ680hyRile2G3cpZL+J3qXhMqFD9r
jaqrJ2droe+9wYEQiqBzQMaetm2yC9AlgWa3k//a9reVuI3tU3dI82HNcgw6S+mJ
bAEtKLe3ohs+td9Yo963anrr
=oI9e
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio
Pull vfio fix from Alex Williamson:
- Fix migration dirty page tracking support in the mlx5-vfio-pci
variant driver in configurations where the system page size exceeds
the device maximum message size, and anticipate device updates where
the opposite may also be required (Yishai Hadas)
* tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio:
vfio/mlx5: Align the page tracking max message size with the device capability
|
||
|
|
9c7c5430bc |
vfio/mlx5: Align the page tracking max message size with the device capability
Align the page tracking maximum message size with the device's
capability instead of relying on PAGE_SIZE.
This adjustment resolves a mismatch on systems where PAGE_SIZE is 64K,
but the firmware only supports a maximum message size of 4K.
Now that we rely on the device's capability for max_message_size, we
must account for potential future increases in its value.
Key considerations include:
- Supporting message sizes that exceed a single system page (e.g., an 8K
message on a 4K system).
- Ensuring the RQ size is adjusted to accommodate at least 4
WQEs/messages, in line with the device specification.
The above has been addressed as part of the patch.
Fixes:
|
||
|
|
cdd30ebb1b |
module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit
|
||
|
|
e70140ba0d |
Get rid of 'remove_new' relic from platform driver struct
The continual trickle of small conversion patches is grating on me, and is really not helping. Just get rid of the 'remove_new' member function, which is just an alias for the plain 'remove', and had a comment to that effect: /* * .remove_new() is a relic from a prototype conversion of .remove(). * New drivers are supposed to implement .remove(). Once all drivers are * converted to not use .remove_new any more, it will be dropped. */ This was just a tree-wide 'sed' script that replaced '.remove_new' with '.remove', with some care taken to turn a subsequent tab into two tabs to make things line up. I did do some minimal manual whitespace adjustment for places that used spaces to line things up. Then I just removed the old (sic) .remove_new member function, and this is the end result. No more unnecessary conversion noise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4aca98a8a1 |
VFIO updates for v6.13
- Constify an unmodified structure used in linking vfio and kvm.
(Christophe JAILLET)
- Add ID for an additional hardware SKU supported by the nvgrace-gpu
vfio-pci variant driver. (Ankit Agrawal)
- Fix incorrect signed cast in QAT vfio-pci variant driver, negating
test in check_add_overflow(), though still caught by later tests.
(Giovanni Cabiddu)
- Additional debugfs attributes exposed in hisi_acc vfio-pci variant
driver for migration debugging. (Longfang Liu)
- Migration support is added to the virtio vfio-pci variant driver,
becoming the primary feature of the driver while retaining emulation
of virtio legacy support as a secondary option. (Yishai Hadas)
- Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered
through reviews of the virtio variant driver. (Yishai Hadas)
- Fix an unlikely issue where a PCI device exposed to userspace with
an unknown capability at the base of the extended capability chain
can overflow an array index. (Avihai Horon)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmdE2SEbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiXa8P/ikuJ33L7sHnLJErYzHB
j2IPNY224LQrpXY+Rnfe4HVCcaSGO7Azeh95DYBFl7ZJ9QJxZbFhUt7Fl8jiKEOj
k5ag0e+SP4+5tMp2lZBehTa+xlZQLJ4QXMRxWF2kpfXyX7v6JaNKZhXWJ6lPvbrL
zco911Qr1Y5Kqc/kdgX6HGfNusoScj9d0leHNIrka2FFJnq3qZqGtmRKWe9V9zP3
Ke5idU1vYNNBDbOz51D6hZbxZLGxIkblG15sw7LNE3O1lhWznfG+gkJm7u7curlj
CrwR4XvXkgAtglsi8KOJHW84s4BO87UgAde3RUUXgXFcfkTQDSOGQuYVDVSKgFOs
eJCagrpz0p5jlS6LfrUyHU9FhK1sbDQdb8iJQRUUPVlR9U0kfxFbyv3HX7JmGoWw
csOr8Eh2dXmC4EWan9rscw2lxYdoeSmJW0qLhhcGylO7kUGxXRm8vP+MVenkfINX
9OPtsOsFhU7HDl54UsujBA5x8h03HIWmHz3rx8NllxL1E8cfhXivKUViuV8jCXB3
6rVT5mn2VHnXICiWZFXVmjZgrAK3mBfA+6ugi/nbWVdnn8VMomLuB/Df+62wSPSV
ICApuWFBhSuSVmQcJ6fsCX6a8x+E2bZDPw9xqZP7krPUdP1j5rJofgZ7wkdYToRv
HN0p5NcNwnoW2aM5chN9Ons1
=nTtY
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Constify an unmodified structure used in linking vfio and kvm
(Christophe JAILLET)
- Add ID for an additional hardware SKU supported by the nvgrace-gpu
vfio-pci variant driver (Ankit Agrawal)
- Fix incorrect signed cast in QAT vfio-pci variant driver, negating
test in check_add_overflow(), though still caught by later tests
(Giovanni Cabiddu)
- Additional debugfs attributes exposed in hisi_acc vfio-pci variant
driver for migration debugging (Longfang Liu)
- Migration support is added to the virtio vfio-pci variant driver,
becoming the primary feature of the driver while retaining emulation
of virtio legacy support as a secondary option (Yishai Hadas)
- Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered
through reviews of the virtio variant driver (Yishai Hadas)
- Fix an unlikely issue where a PCI device exposed to userspace with an
unknown capability at the base of the extended capability chain can
overflow an array index (Avihai Horon)
* tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: Properly hide first-in-list PCIe extended capability
vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()
vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()
vfio/virtio: Enable live migration once VIRTIO_PCI was configured
vfio/virtio: Add PRE_COPY support for live migration
vfio/virtio: Add support for the basic live migration functionality
virtio-pci: Introduce APIs to execute device parts admin commands
virtio: Manage device and driver capabilities via the admin commands
virtio: Extend the admin command to include the result size
virtio_pci: Introduce device parts access commands
Documentation: add debugfs description for hisi migration
hisi_acc_vfio_pci: register debugfs for hisilicon migration driver
hisi_acc_vfio_pci: create subfunction for data reading
hisi_acc_vfio_pci: extract public functions for container_of
vfio/qat: fix overflow check in qat_vf_resume_write()
vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table
kvm/vfio: Constify struct kvm_device_ops
|
||
|
|
fe4bf8d0b6 |
vfio/pci: Properly hide first-in-list PCIe extended capability
There are cases where a PCIe extended capability should be hidden from
the user. For example, an unknown capability (i.e., capability with ID
greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
chosen to be hidden from the user.
Hiding a capability is done by virtualizing and modifying the 'Next
Capability Offset' field of the previous capability so it points to the
capability after the one that should be hidden.
The special case where the first capability in the list should be hidden
is handled differently because there is no previous capability that can
be modified. In this case, the capability ID and version are zeroed
while leaving the next pointer intact. This hides the capability and
leaves an anchor for the rest of the capability list.
However, today, hiding the first capability in the list is not done
properly if the capability is unknown, as struct
vfio_pci_core_device->pci_config_map is set to the capability ID during
initialization but the capability ID is not properly checked later when
used in vfio_config_do_rw(). This leads to the following warning [1] and
to an out-of-bounds access to ecap_perms array.
Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
read only access instead of the ecap_perms array.
Note that this is safe since the above is the only case where cap_id can
exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
are already checked before).
[1]
WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
(snip)
Call Trace:
<TASK>
? show_regs+0x69/0x80
? __warn+0x8d/0x140
? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
? report_bug+0x18f/0x1a0
? handle_bug+0x63/0xa0
? exc_invalid_op+0x19/0x70
? asm_exc_invalid_op+0x1b/0x20
? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
vfio_device_fops_read+0x27/0x40 [vfio]
vfs_read+0xbd/0x340
? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
? __rseq_handle_notify_resume+0xa4/0x4b0
__x64_sys_pread64+0x96/0xc0
x64_sys_call+0x1c3d/0x20d0
do_syscall_64+0x4d/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes:
|
||
|
|
341d041daa |
iommufd 6.13 merge window pull
Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the backing
memory for an iommu mapping. To date VFIO/iommufd have used VMA's and
pin_user_pages(), this now allows using memfds and memfd_pin_folios().
Notably this creates a pure folio path from the memfd to the iommu page
table where memory is never broken down to PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between two
processes. Combined with the above this allows iommufd to support a VMM
re-start using exec() where something like qemu would exec() a new
version of itself and fd pass the memfds/iommufd/etc to the new
process. The memfd allows DMA access to the memory to continue while
the new process is getting setup, and the CHANGE_PROCESS updates all
the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed virtual
iommu. This will be used by VMMs to access hardware features that are
contained with in a VM. The first use is to inform the kernel of the
virtual SID to physical SID mapping when issuing SID based invalidation
on ARM. Further uses will tie HW features that are directly accessed by
the VM, such as invalidation queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used to
translate VM issued cache invalidation commands from virtual to physical
device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work with
the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and VDEVICE
the driver can model this HW's behavior for nested translation. This
includes a shared branch from Will.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZzzKKwAKCRCFwuHvBreF
YaCMAQDOQAgw87eUYKnY7vFodlsTUA2E8uSxDmk6nPWySd0NKwD/flOP85MdEs9O
Ot+RoL4/J3IyNH+eg5kN68odmx4mAw8=
=ec8x
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the
backing memory for an iommu mapping. To date VFIO/iommufd have used
VMA's and pin_user_pages(), this now allows using memfds and
memfd_pin_folios(). Notably this creates a pure folio path from the
memfd to the iommu page table where memory is never broken down to
PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between
two processes. Combined with the above this allows iommufd to
support a VMM re-start using exec() where something like qemu would
exec() a new version of itself and fd pass the memfds/iommufd/etc
to the new process. The memfd allows DMA access to the memory to
continue while the new process is getting setup, and the
CHANGE_PROCESS updates all the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed
virtual iommu. This will be used by VMMs to access hardware
features that are contained with in a VM. The first use is to
inform the kernel of the virtual SID to physical SID mapping when
issuing SID based invalidation on ARM. Further uses will tie HW
features that are directly accessed by the VM, such as invalidation
queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used
to translate VM issued cache invalidation commands from virtual to
physical device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work
with the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and
VDEVICE the driver can model this HW's behavior for nested
translation. This includes a shared branch from Will"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits)
iommu/arm-smmu-v3: Import IOMMUFD module namespace
iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest
iommufd: Add IOMMU_IOAS_CHANGE_PROCESS
iommufd: Lock all IOAS objects
iommufd: Export do_update_pinned
iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Use S2FWB for NESTED domains
iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
Documentation: userspace-api: iommufd: Update vDEVICE
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommu: Add iommu_copy_struct_from_full_user_array helper
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
...
|
||
|
|
cb04444c24 |
vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()
Fix unwind flows in mlx5vf_pci_save_device_data() and
mlx5vf_pci_resume_device_data() to avoid freeing the migf pointer at the
'end' label, as this will be handled by fput(migf->filp) through
mlx5vf_release_file().
To ensure mlx5vf_release_file() functions correctly, move the
initialization of migf fields (such as migf->lock) to occur before any
potential unwind flow, as these fields may be accessed within
mlx5vf_release_file().
Fixes:
|
||
|
|
22e87bf3f7 |
vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()
Fix an unwind issue in mlx5vf_add_migration_pages().
If a set of pages is allocated but fails to be added to the SG table,
they need to be freed to prevent a memory leak.
Any pages successfully added to the SG table will be freed as part of
mlx5vf_free_data_buffer().
Fixes:
|
||
|
|
40bcdb12c6 |
vfio/virtio: Enable live migration once VIRTIO_PCI was configured
Now that the driver supports live migration, only the legacy IO functionality depends on config VIRTIO_PCI_ADMIN_LEGACY. As part of that we introduce a bool configuration option as a sub menu under the driver's main live migration feature named VIRTIO_VFIO_PCI_ADMIN_LEGACY, to control the legacy IO functionality. This will let users configuring the kernel, know which features from the description might be available in the resulting driver. As of that, move the legacy IO into a separate file to be compiled only once CONFIG_VIRTIO_VFIO_PCI_ADMIN_LEGACY was configured and let the live migration depends only on VIRTIO_PCI. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
6cea64b1db |
vfio/virtio: Add PRE_COPY support for live migration
Add PRE_COPY support for live migration. This functionality may reduce the downtime upon STOP_COPY as of letting the target machine to get some 'initial data' from the source once the machine is still in its RUNNING state and let it prepares itself pre-ahead to get the final STOP_COPY data. As the Virtio specification does not support reading partial or incremental device contexts. This means that during the PRE_COPY state, the vfio-virtio driver reads the full device state. As the device state can be changed and the benefit is highest when the pre copy data closely matches the final data we read it in a rate limiter mode. This means we avoid reading new data from the device for a specified time interval after the last read. With PRE_COPY enabled, we observed a downtime reduction of approximately 70-75% in various scenarios compared to when PRE_COPY was disabled, while keeping the total migration time nearly the same. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
0bbc82e4ec |
vfio/virtio: Add support for the basic live migration functionality
Add support for basic live migration functionality in VFIO over virtio-net devices, aligned with the virtio device specification 1.4. This includes the following VFIO features: VFIO_MIGRATION_STOP_COPY, VFIO_MIGRATION_P2P. The implementation registers with the VFIO subsystem using vfio_pci_core and then incorporates the virtio-specific logic for the migration process. The migration follows the definitions in uapi/vfio.h and leverages the virtio VF-to-PF admin queue command channel for execution device parts related commands. Additional Notes: ----------------- The kernel protocol between the source and target devices contains a header with metadata, including record size, tag, and flags. The record size allows the target to recognize and read a complete image from the source before passing the device part data. This adheres to the virtio device specification, which mandates that partial device parts cannot be supplied. The tag and flags serve as placeholders for future extensions of the kernel protocol between the source and target, ensuring backward and forward compatibility. Both the source and target comply with the virtio device specification by using a device part object with a unique ID as part of the migration process. Since this resource is limited to a maximum of 255, its lifecycle is confined to periods with an active live migration flow. According to the virtio specification, a device has only two modes: RUNNING and STOPPED. As a result, certain VFIO transitions (i.e., RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When transitioning to RUNNING_P2P, the device state is set to STOP, and it will remain STOPPED until the transition out of RUNNING_P2P->RUNNING, at which point it returns to RUNNING. During transition to STOP, the virtio device only stops initiating outgoing requests(e.g. DMA, MSIx, etc.) but still must accept incoming operations. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
||
|
|
b398f91779 |
hisi_acc_vfio_pci: register debugfs for hisilicon migration driver
On the debugfs framework of VFIO, if the CONFIG_VFIO_DEBUGFS macro is
enabled, the debug function is registered for the live migration driver
of the HiSilicon accelerator device.
After registering the HiSilicon accelerator device on the debugfs
framework of live migration of vfio, a directory file "hisi_acc"
of debugfs is created, and then three debug function files are
created in this directory:
vfio
|
+---<dev_name1>
| +---migration
| +--state
| +--hisi_acc
| +--dev_data
| +--migf_data
| +--cmd_state
|
+---<dev_name2>
+---migration
+--state
+--hisi_acc
+--dev_data
+--migf_data
+--cmd_state
dev_data file: read device data that needs to be migrated from the
current device in real time
migf_data file: read the migration data of the last live migration
from the current driver.
cmd_state: used to get the cmd channel state for the device.
+----------------+ +--------------+ +---------------+
| migration dev | | src dev | | dst dev |
+-------+--------+ +------+-------+ +-------+-------+
| | |
| +------v-------+ +-------v-------+
| | saving_migf | | resuming_migf |
read | | file | | file |
| +------+-------+ +-------+-------+
| | copy |
| +------------+----------+
| |
+-------v--------+ +-------v--------+
| data buffer | | debug_migf |
+-------+--------+ +-------+--------+
| |
cat | cat |
+-------v--------+ +-------v--------+
| dev_data | | migf_data |
+----------------+ +----------------+
When accessing debugfs, user can obtain the most recent status data
of the device through the "dev_data" file. It can read recent
complete status data of the device. If the current device is being
migrated, it will wait for it to complete.
The data for the last completed migration function will be stored
in debug_migf. Users can read it via "migf_data".
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241112073322.54550-4-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
||
|
|
1962920689 |
hisi_acc_vfio_pci: create subfunction for data reading
This patch generates the code for the operation of reading data from the device into a sub-function. Then, it can be called during the device status data saving phase of the live migration process and the device status data reading function in debugfs. Thereby reducing the redundant code of the driver. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |