linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Linus Torvalds	d7c8087a9c	Power management updates for 7.1-rc1 - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) - Clean up and rearrange the intel_rapl power capping driver to make the respective interface drivers (TPMI, MSR, and MMOI) hold their own settings and primitives and consolidate PL4 and PMU support flags into rapl_defaults (Kuppuswamy Sathyanarayanan) - Correct kernel-doc function parameter names in the power capping core code (Randy Dunlap) - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko) - Use _visible attribute to replace create/remove_sysfs_files() in devfreq (Pengjie Zhang) - Add Tegra114 support to activity monitor device in tegra30-devfreq as a preparation to upcoming EMC controller support (Svyatoslav Ryhel) - Fix mistakes in cpupower man pages, add the boost and epp options to the cpupower-frequency-info man page, and add the perf-bias option to the cpupower-info man page (Roberto Ricci) - Remove unnecessary extern declarations from getopt.h in arguments parsing functions in cpufreq-set, cpuidle-info, cpuidle-set, cpupower-info, and cpupower-set utilities (Kaushlendra Kumar) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnY9TISHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1G9gH/j5mEqfPpiwX6fQ/ZwOGdNOOPVA5w9j4 KPHSMwMD5lZkoaZfasp2vt27KY5SOoVVvRZ2DKkFJ3Jai4I3cUPZYypga2nre1ag tgzX4vOjcw2r40Eda6ezWl1h4mca/xJJBX7xH2+hn1JY+Y1in37g50CqMIjKh96z Uugkk6UZytL1XcF55PMhIUgDf6pDtRT5UOW9xOKOkUt8FVWTJ7ei3HaWyV5kDmVq b5eQ42+OH7y6sWNnoKczFd8fStvh6J/avoJurBEvcOQhMcjaIaB48G19+KjDg73E NjrVcgG20P2rltBvV2d0J1TKskZHkaP7XjIeWfkwjGZhee3FL7ssS/g= =fRCO -----END PGP SIGNATURE----- Merge tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Once again, cpufreq is the most active development area, mostly because of the new feature additions and documentation updates in the amd-pstate driver, but there are also changes in the cpufreq core related to boost support and other assorted updates elsewhere. Next up are power capping changes due to the major cleanup of the Intel RAPL driver. On the cpuidle front, a new C-states table for Intel Panther Lake is added to the intel_idle driver, the stopped tick handling in the menu and teo governors is updated, and there are a couple of cleanups. Apart from the above, support for Tegra114 is added to devfreq and there are assorted cleanups of that code, there are also two updates of the operating performance points (OPP) library, two minor updates related to hibernation, and cpupower utility man pages updates and cleanups. Specifics: - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) - Clean up and rearrange the intel_rapl power capping driver to make the respective interface drivers (TPMI, MSR, and MMOI) hold their own settings and primitives and consolidate PL4 and PMU support flags into rapl_defaults (Kuppuswamy Sathyanarayanan) - Correct kernel-doc function parameter names in the power capping core code (Randy Dunlap) - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko) - Use _visible attribute to replace create/remove_sysfs_files() in devfreq (Pengjie Zhang) - Add Tegra114 support to activity monitor device in tegra30-devfreq as a preparation to upcoming EMC controller support (Svyatoslav Ryhel) - Fix mistakes in cpupower man pages, add the boost and epp options to the cpupower-frequency-info man page, and add the perf-bias option to the cpupower-info man page (Roberto Ricci) - Remove unnecessary extern declarations from getopt.h in arguments parsing functions in cpufreq-set, cpuidle-info, cpuidle-set, cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)" * tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits) cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP cpupower: remove extern declarations in cmd functions cpuidle: Simplify cpuidle_register_device() with guard() PM / devfreq: tegra30-devfreq: add support for Tegra114 PM / devfreq: use _visible attribute to replace create/remove_sysfs_files() PM / devfreq: Remove unneeded casting for HZ_PER_KHZ MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes ...	2026-04-13 19:47:52 -07:00
Linus Torvalds	4793dae01f	Driver core changes for 7.1-rc1 - debugfs: - Fix NULL pointer dereference in debugfs_create_str() - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str() - Fix soundwire debugfs NULL pointer dereference from uninitialized firmware_file - device property: - Make fwnode flags modifications thread safe; widen the field to unsigned long and use set_bit() / clear_bit() based accessors - Document how to check for the property presence - devres: - Separate struct devres_node from its "subclasses" (struct devres, struct devres_group); give struct devres_node its own release and free callbacks for per-type dispatch - Introduce struct devres_action for devres actions, avoiding the ARCH_DMA_MINALIGN alignment overhead of struct devres - Export struct devres_node and its init/add/remove/dbginfo primitives for use by Rust Devres<T> - Fix missing node debug info in devm_krealloc() - Use guard(spinlock_irqsave) where applicable; consolidate unlock paths in devres_release_group() - driver_override: - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the generic driver_override infrastructure, replacing per-bus driver_override strings, sysfs attributes, and match logic; fixes a potential UAF from unsynchronized access to driver_override in bus match() callbacks - Simplify __device_set_driver_override() logic - kernfs: - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file and directory removal - Add corresponding selftests for memcg - platform: - Allow attaching software nodes when creating platform devices via a new 'swnode' field in struct platform_device_info - Add kerneldoc for struct platform_device_info - software node: - Move software node initialization from postcore_initcall() to driver_init(), making it available early in the boot process - Move kernel_kobj initialization (ksysfs_init) earlier to support the above - Remove software_node_exit(); dead code in a built-in unit - SoC: - Introduce of_machine_read_compatible() and of_machine_read_model() OF helpers and export soc_attr_read_machine() to replace direct accesses to of_root from SoC drivers; also enables CONFIG_COMPILE_TEST coverage for these drivers - sysfs: - Constify attribute group array pointers to 'const struct attribute_group const ' in sysfs functions, device_add_groups() / device_remove_groups(), and struct class - Rust: - Devres: - Embed struct devres_node directly in Devres<T> instead of going through devm_add_action(), avoiding the extra allocation and the unnecessary ARCH_DMA_MINALIGN alignment - I/O: - Turn IoCapable from a marker trait into a functional trait carrying the raw I/O accessor implementation (io_read / io_write), providing working defaults for the per-type Io methods - Add RelaxedMmio wrapper type, making relaxed accessors usable in code generic over the Io trait - Remove overloaded per-type Io methods and per-backend macros from Mmio and PCI ConfigSpace - I/O (Register): - Add IoLoc trait and generic read/write/update methods to the Io trait, making I/O operations parameterizable by typed locations - Add register! macro for defining hardware register types with typed bitfield accessors backed by Bounded values; supports direct, relative, and array register addressing - Add write_reg() / try_write_reg() and LocatedRegister trait - Update PCI sample driver to demonstrate the register! macro Example: ``` register! { /// UART control register. CTRL(u32) @ 0x18 { /// Receiver enable. 19:19 rx_enable => bool; /// Parity configuration. 14:13 parity ?=> Parity; } /// FIFO watermark and counter register. WATER(u32) @ 0x2c { /// Number of datawords in the receive FIFO. 26:24 rx_count; /// RX interrupt threshold. 17:16 rx_water; } } impl WATER { fn rx_above_watermark(&self) -> bool { self.rx_count() > self.rx_water() } } fn init(bar: &pci::Bar<BAR0_SIZE>) { let water = WATER::zeroed() .with_const_rx_water::<1>(); // > 3 would not compile bar.write_reg(water); let ctrl = CTRL::zeroed() .with_parity(Parity::Even) .with_rx_enable(true); bar.write_reg(ctrl); } fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) { if bar.read(WATER).rx_above_watermark() { // drain the FIFO } } fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) { bar.update(CTRL, \|r\| r.with_parity(parity)); } ``` - IRQ: - Move 'static bounds from where clauses to trait declarations for IRQ handler traits - Misc: - Enable the generic_arg_infer Rust feature - Extend Bounded with shift operations, single-bit bool conversion, and const get() - Misc: - Make deferred_probe_timeout default a Kconfig option - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM callbacks when no bus type PM ops are set - Add conditional guard support for device_lock() - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry - Fix kernel-doc warnings in base.h - Fix stale reference to memory_block_add_nid() in documentation -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCadl5SwAKCRBFlHeO1qrK LpjDAQCSG3vYznwrngfpmRU5bCB9sdUy/pZiX5px1357+amJkwEA9LgIVQvtHAZW ZXcQ7Jr+mR3mJEdlatbkWHp3w1VHqAQ= =y1DV -----END PGP SIGNATURE----- Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "debugfs: - Fix NULL pointer dereference in debugfs_create_str() - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str() - Fix soundwire debugfs NULL pointer dereference from uninitialized firmware_file device property: - Make fwnode flags modifications thread safe; widen the field to unsigned long and use set_bit() / clear_bit() based accessors - Document how to check for the property presence devres: - Separate struct devres_node from its "subclasses" (struct devres, struct devres_group); give struct devres_node its own release and free callbacks for per-type dispatch - Introduce struct devres_action for devres actions, avoiding the ARCH_DMA_MINALIGN alignment overhead of struct devres - Export struct devres_node and its init/add/remove/dbginfo primitives for use by Rust Devres<T> - Fix missing node debug info in devm_krealloc() - Use guard(spinlock_irqsave) where applicable; consolidate unlock paths in devres_release_group() driver_override: - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the generic driver_override infrastructure, replacing per-bus driver_override strings, sysfs attributes, and match logic; fixes a potential UAF from unsynchronized access to driver_override in bus match() callbacks - Simplify __device_set_driver_override() logic kernfs: - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file and directory removal - Add corresponding selftests for memcg platform: - Allow attaching software nodes when creating platform devices via a new 'swnode' field in struct platform_device_info - Add kerneldoc for struct platform_device_info software node: - Move software node initialization from postcore_initcall() to driver_init(), making it available early in the boot process - Move kernel_kobj initialization (ksysfs_init) earlier to support the above - Remove software_node_exit(); dead code in a built-in unit SoC: - Introduce of_machine_read_compatible() and of_machine_read_model() OF helpers and export soc_attr_read_machine() to replace direct accesses to of_root from SoC drivers; also enables CONFIG_COMPILE_TEST coverage for these drivers sysfs: - Constify attribute group array pointers to 'const struct attribute_group const ' in sysfs functions, device_add_groups() / device_remove_groups(), and struct class Rust: - Devres: - Embed struct devres_node directly in Devres<T> instead of going through devm_add_action(), avoiding the extra allocation and the unnecessary ARCH_DMA_MINALIGN alignment - I/O: - Turn IoCapable from a marker trait into a functional trait carrying the raw I/O accessor implementation (io_read / io_write), providing working defaults for the per-type Io methods - Add RelaxedMmio wrapper type, making relaxed accessors usable in code generic over the Io trait - Remove overloaded per-type Io methods and per-backend macros from Mmio and PCI ConfigSpace - I/O (Register): - Add IoLoc trait and generic read/write/update methods to the Io trait, making I/O operations parameterizable by typed locations - Add register! macro for defining hardware register types with typed bitfield accessors backed by Bounded values; supports direct, relative, and array register addressing - Add write_reg() / try_write_reg() and LocatedRegister trait - Update PCI sample driver to demonstrate the register! macro Example: ``` register! { /// UART control register. CTRL(u32) @ 0x18 { /// Receiver enable. 19:19 rx_enable => bool; /// Parity configuration. 14:13 parity ?=> Parity; } /// FIFO watermark and counter register. WATER(u32) @ 0x2c { /// Number of datawords in the receive FIFO. 26:24 rx_count; /// RX interrupt threshold. 17:16 rx_water; } } impl WATER { fn rx_above_watermark(&self) -> bool { self.rx_count() > self.rx_water() } } fn init(bar: &pci::Bar<BAR0_SIZE>) { let water = WATER::zeroed() .with_const_rx_water::<1>(); // > 3 would not compile bar.write_reg(water); let ctrl = CTRL::zeroed() .with_parity(Parity::Even) .with_rx_enable(true); bar.write_reg(ctrl); } fn handle_rx(bar: &pci::Bar<BAR0_SIZE>) { if bar.read(WATER).rx_above_watermark() { // drain the FIFO } } fn set_parity(bar: &pci::Bar<BAR0_SIZE>, parity: Parity) { bar.update(CTRL, \|r\| r.with_parity(parity)); } ``` - IRQ: - Move 'static bounds from where clauses to trait declarations for IRQ handler traits - Misc: - Enable the generic_arg_infer Rust feature - Extend Bounded with shift operations, single-bit bool conversion, and const get() Misc: - Make deferred_probe_timeout default a Kconfig option - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM callbacks when no bus type PM ops are set - Add conditional guard support for device_lock() - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry - Fix kernel-doc warnings in base.h - Fix stale reference to memory_block_add_nid() in documentation" * tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits) bus: fsl-mc: use generic driver_override infrastructure s390/ap: use generic driver_override infrastructure s390/cio: use generic driver_override infrastructure vdpa: use generic driver_override infrastructure platform/wmi: use generic driver_override infrastructure PCI: use generic driver_override infrastructure driver core: make software nodes available earlier software node: remove software_node_exit() kernel: ksysfs: initialize kernel_kobj earlier MAINTAINERS: add ksysfs.c to the DRIVER CORE entry drivers/base/memory: fix stale reference to memory_block_add_nid() device property: Document how to check for the property presence soundwire: debugfs: initialize firmware_file to empty string debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str() debugfs: check for NULL pointer in debugfs_create_str() driver core: Make deferred_probe_timeout default a Kconfig option driver core: simplify __device_set_driver_override() clearing logic driver core: auxiliary bus: Drop auxiliary_dev_pm_ops device property: Make modifications of fwnode "flags" thread safe rust: devres: embed struct devres_node directly ...	2026-04-13 19:03:11 -07:00
Linus Torvalds	d568788baa	hardening updates for v7.1-rc1 - randomize_kstack: Improve implementation across arches (Ryan Roberts) - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test - refcount: Remove unused __signed_wrap function annotations -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCad16PwAKCRA2KwveOeQk u7crAP4qz8gXCjes76KsZm/YQS8PtOG5JroAVu5Oa4ohw0RfaQD+K/XLow1plcNF 4Bi8zSuv2ifcLysh9qEAbx5+wcHijgo= =woB3 -----END PGP SIGNATURE----- Merge tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - randomize_kstack: Improve implementation across arches (Ryan Roberts) - lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test - refcount: Remove unused __signed_wrap function annotations * tag 'hardening-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: lkdtm/fortify: Drop unneeded FORTIFY_STR_OBJECT test refcount: Remove unused __signed_wrap function annotations randomize_kstack: Unify random source across arches randomize_kstack: Maintain kstack_offset per task	2026-04-13 17:52:29 -07:00
Linus Torvalds	de639344bb	audit/stable-7.1 PR 20260410 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmnZegUUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNydxAApWBVRWp/AY7jtCQGWRYAa+6y+bQ0 RWfu8putXaOyk3NTeWP64e87FKsdByR/yflefYxMH+bXc2mwbuUZYAreEVmLCJ1P QxHKuwCkCNOz90n/Y7nlDSDK1GYdzlFkCgidfr4iNSCD58WMTtNNpZREzaNiR8a1 PZ3bFvJH+S7BRCGA6/S/20rNYeWTga56pSrWt6VpMwVHGJ1R4DsD60pT8z0NqMYI BTBLeZ36HlZdwUp+APldKNNDRKG1ZQVKJRO68qcSkopr4vQzK7yL/SJsCdU8MHj2 LccXTCTHHWJbpdiE7BtzPO9UobVZIdcz2wsnJHWxzHYtXlPolgM7F31111GL4HSv V/mq5o7dR3h6nn+1gkWHjOpd/f3J3xl3FaJsH9FIIhPmCRHb4oZI0WG0ZH3mHZBl o6aaWja3PBl0XNA+q87DQVBYDOyVNB4RjuaKy+d7hm4eronTRaZkg3zutrB6/XxP uFbp+Q3diWNMsYO52DKFThL/sStmnnCMIRJuTxd8QaPhLVakaFSkWZycSUH4HijD 8WMk3e4yo3TeD6rCAognwKclj0vCMHS3TLOMXlY0vMD04gwXJ2S81yfyXGT4F5De KkXj61TFMxPyiZ6yrxk86BmoqHL0DUiCDn1rMKbNdIncHedKZoNuy+O/XNLS6No/ hLRvXSI7MNthJ5E= =1rY2 -----END PGP SIGNATURE----- Merge tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Improved handling of unknown status requests from userspace The current kernel code ignores unknown/unused request bits sent from userspace and returns an error code based on the results of the request(s) it does understand. The patch from Ricardo fixes this so that unknown requests return an -EINVAL to userspace, making compatibility a bit easier moving forward. - A number of small style and formatting cleanups * tag 'audit-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: handle unknown status requests in audit_receive_msg() audit: fix coding style issues audit: remove redundant initialization of static variables to 0 audit: fix whitespace alignment in include/uapi/linux/audit.h	2026-04-13 14:56:54 -07:00
Linus Torvalds	ef3da345cc	vfs-7.1-rc1.misc Please consider pulling these changes from the signed vfs-7.1-rc1.misc tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc ohhBAQCAmQMlMRAXAgUZFYMTZpeQlcujP5rv+/vT2Tf/xS76YwD/dRDaw1FH294+ qtk/Z1NjleNixzE2sld1K9J32NxeyAc= =+g9q -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - coredump: add tracepoint for coredump events - fs: hide file and bfile caches behind runtime const machinery Fixes: - fix architecture-specific compat_ftruncate64 implementations - dcache: Limit the minimal number of bucket to two - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START - fs/mbcache: cancel shrink work before destroying the cache - dcache: permit dynamic_dname()s up to NAME_MAX Cleanups: - remove or unexport unused fs_context infrastructure - trivial ->setattr cleanups - selftests/filesystems: Assume that TIOCGPTPEER is defined - writeback: fix kernel-doc function name mismatch for wb_put_many() - autofs: replace manual symlink buffer allocation in autofs_dir_symlink - init/initramfs.c: trivial fix: FSM -> Finite-state machine - fs: remove stale and duplicate forward declarations - readdir: Introduce dirent_size() - fs: Replace user_access_{begin/end} by scoped user access - kernel: acct: fix duplicate word in comment - fs: write a better comment in step_into() concerning .mnt assignment - fs: attr: fix comment formatting and spelling issues" * tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) dcache: permit dynamic_dname()s up to NAME_MAX fs: attr: fix comment formatting and spelling issues fs: hide file and bfile caches behind runtime const machinery fs: write a better comment in step_into() concerning .mnt assignment proc: rename proc_notify_change to proc_setattr proc: rename proc_setattr to proc_nochmod_setattr affs: rename affs_notify_change to affs_setattr adfs: rename adfs_notify_change to adfs_setattr hfs: update comments on hfs_inode_setattr kernel: acct: fix duplicate word in comment fs: Replace user_access_{begin/end} by scoped user access readdir: Introduce dirent_size() coredump: add tracepoint for coredump events fs: remove do_sys_truncate fs: pass on FTRUNCATE_* flags to do_truncate fs: fix archiecture-specific compat_ftruncate64 fs: remove stale and duplicate forward declarations init/initramfs.c: trivial fix: FSM -> Finite-state machine autofs: replace manual symlink buffer allocation in autofs_dir_symlink fs/mbcache: cancel shrink work before destroying the cache ...	2026-04-13 14:20:11 -07:00
Linus Torvalds	07c3ef5822	vfs-7.1-rc1.pidfs Please consider pulling these changes from the signed vfs-7.1-rc1.pidfs tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc omfuAQDckt5g7vxBr9hKdyrq1//nsu44fst/mRqr2iSYjuKfPQD/VN6Lw9e56Y/q l4hHxsPPrSSxbijwng7im36iPIGdfwI= =BbFh -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull clone and pidfs updates from Christian Brauner: "Add three new clone3() flags for pidfd-based process lifecycle management. CLONE_AUTOREAP: CLONE_AUTOREAP makes a child process auto-reap on exit without ever becoming a zombie. This is a per-process property in contrast to the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a given parent. Currently the only way to automatically reap children is to set SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property affecting all children which makes it unsuitable for libraries or applications that need selective auto-reaping of specific children while still being able to wait() on others. CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct. When the child exits do_notify_parent() checks this flag and causes exit_notify() to transition the task directly to EXIT_DEAD. Since the flag lives on the child it survives reparenting: if the original parent exits and the child is reparented to a subreaper or init the child still auto-reaps when it eventually exits. This is cleaner than forcing the subreaper to get SIGCHLD and then reaping it. If the parent doesn't care the subreaper won't care. If there's a subreaper that would care it would be easy enough to add a prctl() that either just turns back on SIGCHLD and turns off auto-reaping or a prctl() that just notifies the subreaper whenever a child is reparented to it. CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to monitor the child's exit via poll() and retrieve exit status via PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget pattern. No exit signal is delivered so exit_signal must be zero. CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because autoreap is a process-level property, and CLONE_PARENT because an autoreap child reparented via CLONE_PARENT could become an invisible zombie under a parent that never calls wait(). The flag is not inherited by the autoreap process's own children. Each child that should be autoreaped must be explicitly created with CLONE_AUTOREAP. CLONE_NNP: CLONE_NNP sets no_new_privs on the child at clone time. Unlike prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself, CLONE_NNP allows the parent to impose no_new_privs on the child at creation without affecting the parent's own privileges. CLONE_THREAD is rejected because threads share credentials. CLONE_NNP is useful on its own for any spawn-and-sandbox pattern but was specifically introduced to enable unprivileged usage of CLONE_PIDFD_AUTOKILL. CLONE_PIDFD_AUTOKILL: This flag ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd or just wants a throwaway helper process. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It requires CLONE_PIDFD because the whole point is tying the child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child with no one to reap it would become a zombie - the primary use case is the parent crashing or abandoning the pidfd so no one is around to call waitpid(). CLONE_THREAD is rejected because autokill targets a process not a thread. If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an unprivileged user may spawn a process that is autokilled. The child cannot escalate privileges via setuid/setgid exec after being spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the caller must have have CAP_SYS_ADMIN in its user namespace" * tag 'vfs-7.1-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: check pidfd_info->coredump_code correctness pidfds: add coredump_code field to pidfd_info kselftest/coredump: reintroduce null pointer dereference selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests selftests/pidfd: add CLONE_NNP tests selftests/pidfd: add CLONE_AUTOREAP tests pidfd: add CLONE_PIDFD_AUTOKILL clone: add CLONE_NNP clone: add CLONE_AUTOREAP	2026-04-13 13:27:11 -07:00
Linus Torvalds	dc0dfa7338	namespaces-7.1-rc1.misc Please consider pulling these changes from the signed namespaces-7.1-rc1.misc tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCwAKCRCRxhvAZXjc ols1AP9rA4gjJOTwHg0/pc+GL4qLSqUP3O4KeuJ8qccBcEUITAD/frpUjR11Ibw/ F78/x1QhDPI8PCcw7kEyAPTfDb9VsgU= =5HBm -----END PGP SIGNATURE----- Merge tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace update from Christian Brauner: "Add two simple helper macros for the namespace infrastructure" * tag 'namespaces-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nsproxy: Add FOR_EACH_NS_TYPE() X-macro and CLONE_NS_ALL	2026-04-13 13:02:49 -07:00
Linus Torvalds	b7d74ea0fd	vfs-7.1-rc1.kino Please consider pulling these changes from the signed vfs-7.1-rc1.kino tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCadjZCgAKCRCRxhvAZXjc otmnAP4sbsxZQdz2TG2hJuOwnEZOkkxZQOUMc3ERVyZaWXIeTAEA7e5M+8FpoG9n 8ipO76UoaXdGLESrqVdp9EOhLqOW7QY= =uMeJ -----END PGP SIGNATURE----- Merge tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64	2026-04-13 12:19:01 -07:00
Linus Torvalds	28483203f7	RCU changes for v7.1 NOCB CPU management: - Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to reduce code duplication. - Extract nocb_bypass_needs_flush() helper to reduce duplication in NOCB bypass path. rcutorture/torture infrastructure: - Add NOCB01 config for RCU_LAZY torture testing. - Add NOCB02 config for NOCB poll mode testing. - Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU torture. - Test call_srcu() with preemption both disabled and enabled. - Remove kvm-check-branches.sh in favor of kvm-series.sh. - Make hangs more visible in torture.sh output. - Add informative message for tests without a recheck file. - Fix numeric test comparison in srcu_lockdep.sh. - Use torture_shutdown_init() in refscale and rcuscale instead of open-coded shutdown functions. - Fix modulo-zero error in torture_hrtimeout_ns(). SRCU: - Fix SRCU read flavor macro comments. - Fix s/they disables/they disable/ typo in srcu_read_unlock_fast(). RCU Tasks: - Document that RCU Tasks Trace grace periods now imply RCU grace periods. - Remove unnecessary smp_store_release() in cblist_init_generic(). RCU stall: - Add BOOTPARAM_RCU_STALL_PANIC Kconfig option to allow triggering a kernel panic on RCU stall via kernel boot parameter. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmnMBEAWHGpvZWxhZ25l bGZAbnZpZGlhLmNvbQAKCRCoDid/ajjmEBCdD/4rUNIocKBmSJWHFeDzl8rvMGSx RISjeaj8m+HUsQRSq7hecygUkSbGEQuj8KwFDJOsmuzhMRg3hR7yEz1AEphQiiIR TN0IdGblQd57KqAk6sHsXA5mNjSWZfJQuOkvHsMxIJdhhDRDHPTGtaCBwlYoLc6b de31LCvCfRL1QgiNAznTlPLRyehI9/+6VASLvKbmi3sQQv9B/KBBWydkryvMJbfW uc05SUn2qQq2h/NNeYvo6RllXWMPqO4/0ll8Y71ltwFhg8JdS9qXDSFX5cWlLISA uBA4pC9bk0xi7RGE5v4n9+fNNTr64Tjtd9QoxFvd/Jb6jdKDGjabwojoAYXCsdNw r8Dp3fKwGofruhUqgO6LyHyG54Ro2paVYjsqr5HW9C6jalZ9+HmH5wOnKw2h5E/i VRAM6MpwQ869ZeHhDxF8meRKf3+E+UIR95qfNZkABcFnwxgXheT+RASP/UOnoTM7 Zexuzp9c6GQu34MTt0Rz9vNfJta+ZmPlnccan1T3DgoHvcWip7IDhUD1Vqerd9/4 VK4JeeKI0t/fR5ydjs5qiA/O5caIFqtQTD0Ag2vLGQiP72pe2lRroIjD3xNZ2bWS DN4k6ZCxR2hdPUcWzdYlGe2pYmvAlqCBxZ4fNgrs55uP2EuqW+Y8aJd9RwvRPb2B wmTZepOh7Ct9+cjTBA== =wjsx -----END PGP SIGNATURE----- Merge tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Joel Fernandes: "NOCB CPU management: - Consolidate rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() to reduce code duplication - Extract nocb_bypass_needs_flush() helper to reduce duplication in NOCB bypass path rcutorture/torture infrastructure: - Add NOCB01 config for RCU_LAZY torture testing - Add NOCB02 config for NOCB poll mode testing - Add TRIVIAL-PREEMPT config for textbook-style preemptible RCU torture - Test call_srcu() with preemption both disabled and enabled - Remove kvm-check-branches.sh in favor of kvm-series.sh - Make hangs more visible in torture.sh output - Add informative message for tests without a recheck file - Fix numeric test comparison in srcu_lockdep.sh - Use torture_shutdown_init() in refscale and rcuscale instead of open-coded shutdown functions - Fix modulo-zero error in torture_hrtimeout_ns(). SRCU: - Fix SRCU read flavor macro comments - Fix s/they disables/they disable/ typo in srcu_read_unlock_fast() RCU Tasks: - Document that RCU Tasks Trace grace periods now imply RCU grace periods - Remove unnecessary smp_store_release() in cblist_init_generic()" * tag 'rcu.2026.03.31a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: rcutorture: Test call_srcu() with preemption disabled and not rcu: Add BOOTPARAM_RCU_STALL_PANIC Kconfig option torture: Avoid modulo-zero error in torture_hrtimeout_ns() rcu/nocb: Extract nocb_bypass_needs_flush() to reduce duplication rcu/nocb: Consolidate rcu_nocb_cpu_offload/deoffload functions rcu-tasks: Remove unnecessary smp_store_release() in cblist_init_generic() rcutorture: Add NOCB02 config for nocb poll mode testing rcutorture: Add NOCB01 config for RCU_LAZY torture testing rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods srcu: Fix s/they disables/they disable/ typo in srcu_read_unlock_fast() srcu: Fix SRCU read flavor macro comments rcuscale: Ditch rcu_scale_shutdown in favor of torture_shutdown_init() refscale: Ditch ref_scale_shutdown in favor of torture_shutdown_init() rcutorture: Fix numeric "test" comparison in srcu_lockdep.sh torture: Print informative message for test without recheck file torture: Make hangs more visible in torture.sh output kvm-check-branches.sh: Remove in favor of kvm-series.sh rcutorture: Add a textbook-style trivial preemptible RCU	2026-04-13 09:36:45 -07:00
Breno Leitao	76af546488	workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id() On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so cpu_shard_id[] is a single-element array (int[1]). In llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an unsigned int that the compiler cannot prove is always 0, triggering a -Warray-bounds warning when the result is used to index cpu_shard_id[]: kernel/workqueue.c:8321:55: warning: array subscript 1 is above array bounds of 'int[1]' [-Warray-bounds] 8321 \| cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)]; \| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a false positive: sibling_cpus can never be empty here because 'c' itself is always set in it, so cpumask_first() will always return a valid CPU. However, the compiler cannot prove this statically, and the warning only manifests on UP configs where the array size is 1. Add a bounds check with WARN_ON_ONCE to silence the warning, and store the result in a local variable to make the code clearer and avoid calling cpumask_first() twice. Fixes: `5920d046f7` ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-13 06:15:26 -10:00
Paolo Bonzini	e74c3a8891	KVM/arm64 updates for 7.1 * New features: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This comes with a full infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, with anonymous memory being used as a backing store. About time! * Improvements and bug fixes: - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmnWdswACgkQI9DQutE9 ekNYvBAAxj5Zmsx8sJ2CYDTJc2w4XkEjSgDugA+J/s0TMgrzExeBlWCstdhVTncy 68nwOjQl3TotnIrt7q36kko9u7IdD0pHNrk34NtlggLjHfB61n9SNcAA6j4F6zJa GFkHpJSrSnZuUPqapkDnlyhuPkgTIAkEUk2Am9siksSfY4HvRyHZJm2FTdxsdIBn NN9wvQqw2wefTXOQ8gS+oHbPVp1cPbwrF2a3EhzXXv/6W3mUBstXgsijgo07UzCp W6vHCv2wqHbHdf67z3Q3hL+VXlVH6oHlyW99/swqISvqRkH/iSB90+oUojnMRrSm yB6Wmhh8jboCaajWMJhG+veZw+7GMXU4nOrGd1rbnY8cwRl/TQ5YibhRm7DIdvjO xeUluTLJ0NdweQUwE2k4OlgKOuGang3E2p0clmkUO4SstA48MdqR/kpST6guIlWw U5syuNaaaiuwP5QOi9qZmMCNmQ3ZfnZG3nseJFdoyGjhVhf5jyQyv4Du9vGZQFF/ Zkg7yTqC4OWiC+3GkW9YYAySM1MyetivLtd47PGzHPTdtaZziWhNvQ0y+8QjQ+R+ CJNvyS/DvsT7epSya4sLgMP1ZAlih9xkz5sQ6k8NJLBYYXi0v33qwqditErgLLyj S4Ci4WNhHHWIusvCVM7JUBkH0AElpmi506f7F6iHoFLlkYR4t9U= =/SuQ -----END PGP SIGNATURE----- Merge tag 'kvmarm-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 7.1 * New features: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This comes with a full infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, with anonymous memory being used as a backing store. About time! * Improvements and bug fixes: - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn.	2026-04-13 11:49:54 +02:00
Alexei Starovoitov	fa2942918a	Merge patch series "bpf: Fix OOB in pcpu_init_value and add a test" xulang <xulang@uniontech.com> says: ==================== Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes, and add a test case to reproduce the issue. The root cause is that pcpu_init_value() uses copy_map_value_long() which rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not 8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when the copy is performed. ==================== Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 13:36:55 -07:00
Lang Xu	576afddfee	bpf: Fix OOB in pcpu_init_value An out-of-bounds read occurs when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes. The issue happens when: 1. A CGROUP_STORAGE map is created with value_size not aligned to 8 bytes (e.g., 4 bytes) 2. A pcpu map is created with the same value_size (e.g., 4 bytes) 3. Update element in 2 with data in 1 pcpu_init_value assumes that all sources are rounded up to 8 bytes, and invokes copy_map_value_long to make a data copy, However, the assumption doesn't stand since there are some cases where the source may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data. the verifier verifies exactly the size that the source claims, not the size rounded up to 8 bytes by kernel, an OOB happens when the source has only 4 bytes while the copy size(4) is rounded up to 8. Fixes: `d3bec0138b` ("bpf: Zero-fill re-used per-cpu map element") Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn> Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/ Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 13:35:59 -07:00
Emil Tsalapatis	ac61bffe91	bpf: Allow instructions with arena source and non-arena dest registers The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR operation into the scalar register rather than the pointer register. Relax the verifier to allow operations between a source arena register and a destination non-arena register, marking the destination's value as a PTR_TO_ARENA. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Song Liu <song@kernel.org> Fixes: `6082b6c328` ("bpf: Recognize addr_space_cast instruction in the verifier.") Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:47:39 -07:00
Menglong Dong	9fd19e3ed7	bpf: add missing fsession to the verifier log The fsession attach type is missed in the verifier log in check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id(). Update them to make the verifier log proper. Meanwhile, update the corresponding selftests. Acked-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:42:38 -07:00
Alexei Starovoitov	99a832a2b5	bpf: Move BTF checking logic into check_btf.c BTF validation logic is independent from the main verifier. Move it into check_btf.c Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:37:04 -07:00
Alexei Starovoitov	ed0b9710bd	bpf: Move backtracking logic to backtrack.c Move precision propagation and backtracking logic to backtrack.c to reduce verifier.c size. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:36:58 -07:00
Alexei Starovoitov	c82834a5a1	bpf: Move state equivalence logic to states.c verifier.c is huge. Move is_state_visited() to states.c, so that all state equivalence logic is in one file. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:36:52 -07:00
Alexei Starovoitov	f8a8faceab	bpf: Move check_cfg() into cfg.c verifier.c is huge. Move check_cfg(), compute_postorder(), compute_scc() into cfg.c Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:36:45 -07:00
Alexei Starovoitov	fc150cddee	bpf: Move compute_insn_live_regs() into liveness.c verifier.c is huge. Move compute_insn_live_regs() into liveness.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:36:38 -07:00
Alexei Starovoitov	449f08fa59	bpf: Move fixup/post-processing logic from verifier.c into fixups.c verifier.c is huge. Split fixup/post-processing logic that runs after the verifier accepted the program into fixups.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-12 12:35:54 -07:00
Linus Torvalds	35bdc192d8	workqueue: Fixes for v7.0-rc7 - Fix incomplete activation of multiple inactive works when unplugging a pool_workqueue, where the pending_pwqs list wasn't being updated for subsequent works. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCadvRzg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGeLdAP9SZN3b4QRE79fnP/1dzf2RR/XWlFq7x49Wv1nP mmpwjQEAx/B6YabLNp98bdxaoygUkdz3zBDnL7oaWx/G0p9T/QA= =80oo -----END PGP SIGNATURE----- Merge tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "This is a fix for a stall which triggers on ordered workqueues when there are multiple inactive work items during workqueue property changes through sysfs, which doesn't happen that frequently. While really late, the fix is very low risk as it just repeats an operation which is already being performed: - Fix incomplete activation of multiple inactive works when unplugging a pool_workqueue, where the pending_pwqs list wasn't being updated for subsequent works" * tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works	2026-04-12 10:42:40 -07:00
Linus Torvalds	ab3dee2640	Two fixes for the time/timers subsystem: - Invert the inverted fastpath decision in check_tick_dependency(), which prevents NOHZ full to stop the tick. That's a regression introduced in the 7.0 merge window. - Prevent a unpriviledged DoS in the clockevents code, where user space can starve the timer interrupt by arming a timerfd or posix interval timer in a tight loop with an absolute expiry time in the past. The fix turned out to be incomplete and was was amended yesterday to make it work on some 20 years old AMD machines as well. All issues with it have been confirmed to be resolved by various reporters. -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnboPYQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYocEDD/4+aEut3QsitcZliE5KTpD4QFVo70Ky5MMe XvQgOJyvSJeUBjOW3UVROEZvBWxpb0bYRbXGTs37pCKK4DwjOmodAaY6OI5W/atb 13lnVWSiCKkZoRLs3+G+PCxCD3WuS2HhZc27fiFJmXNyLGfaLBVC2trNnheEBmUs m3DleuJYiuL3tNbQqlKJ2nIXJFp9qpHSkZ241gxinTvFhBz6QhhAkegMbq9DmHTR dps4xc3dtNm1poEcTKFeZesfpqlgJaX5LoDydtOFtDomKbXX2SzD82nvCBritUty ICcd99kIpAgREEn+aer7hokV8X3eAOBRrzPe8tqWl2mc55jcjQJHxD8dE3zIAv5a AkcnIsiItfHLCPB3iP95inUQoeBCvAX7jCWaLFH8EZCz8QfLALSetjn6HSbWeN/6 ANqFsWlKhK7b1+VcdSL8LVdPtEsE2ILEIHR8dF9GqkVXzs1OPEJJHQoqQxUfDhpa BdAahPCybY+7lcvZko+WJYJ2Xj0TPRbvav9Ol1o8c8xLF3itFDTCL+E2VMoLdpUn F8SRDo2FUwysO4NVjHzR7YRTBhlF1pC3MgvPCL2kCekPv2TDUgacHjygnd/QGQ8m cEOZlD/FlCVE+MNSaLYOUFYJ2a7QwO4qCXOhjy6OFMtO3ezJnol9plqCRCJfhkgy towg2pGIaQ== =cnPC -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for the time/timers subsystem: - Invert the inverted fastpath decision in check_tick_dependency(), which prevents NOHZ full to stop the tick. That's a regression introduced in the 7.0 merge window. - Prevent a unpriviledged DoS in the clockevents code, where user space can starve the timer interrupt by arming a timerfd or posix interval timer in a tight loop with an absolute expiry time in the past. The fix turned out to be incomplete and was was amended yesterday to make it work on some 20 years old AMD machines as well. All issues with it have been confirmed to be resolved by various reporters" * tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Prevent timer interrupt starvation tick/nohz: Fix inverted return value in check_tick_dependency() fast path	2026-04-12 10:01:55 -07:00
Linus Torvalds	02640d8886	Fix DL server related slowdown to deferred fair tasks. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnbSK8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jwlg/8DwVE6c2SeQH51lvWNpzOp+1C/Q+P7Fxv PofW6L4AQHclj0NHSjSN+NNEuQ4y4VeaxNeJHtRrRBkAV7cPB0MOEqzRU6dpgaA+ Tlcm54X6+9xXjWpkLqCZvJBxmK6/d6NUt4mCJbXa/x7omVDoo/8I646WTbdDRZ1p YIJ63x+U06+81S/3DhsUR0X4+r5GlJDB3iTlEI3wZ26XeLsT/MvWs30QEzeQ1mzM zqqCYZ36aq1ULAz/nRYmGDIEsbwkBU0vXWeZDNeqGvMzNSK8P+Vejdd3wXXPs4vS QxbVCMQv256zmK3B9l7av0aJEwszs0aMqW/PD5QVLxTrWwV35akC0QF5GdAwNVbz m52KFMWElfq/Bu/fc/HrYeuEUY4ViA6/5Wrq021YGctidpEr+BFcFiLumQB6LL8/ 3lOBqwHtiM2qAtEvVCPgfVf3P9KmGl1OAw5TzgzzqKjA5TcDqSAe2CI/wvDhc6WK sSXo6pI3IOE2Q03X5b06Lyqd/NMVhBTb6DfwHTXdYbiDloV3m92UABBFfN9t6/qh IIn3U5JJEydZPaSKP1N9IdYScaRgqFgA2tXcMSqD1EJUYBJjh8Sp2cnv49nlfvK8 fYoSniKlmfnSaS4ooHGp/z44BCceY7x+zfGBgdMfgf31EKypqwn71biXxqvquJoA UZTOfCJ/fRc= =b4Dg -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix DL server related slowdown to deferred fair tasks" * tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Use revised wakeup rule for dl_server	2026-04-12 08:30:20 -07:00
Alexei Starovoitov	2ec74a0536	bpf: Simplify do_check_insn() Move env->insn_idx++ to the caller, so that most of check_*() calls in do_check_insn() tail call into the next helper. Link: https://lore.kernel.org/r/20260411230001.71664-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-11 17:43:29 -07:00
Alexei Starovoitov	ae3f8ca2ba	bpf: Move checks for reserved fields out of the main pass Check reserved fields of each insn once in a prepass instead of repeatedly rechecking them during the main verifier pass. Link: https://lore.kernel.org/r/20260411200932.41797-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-11 15:24:41 -07:00
Linus Torvalds	558b9206d5	Probes fixes for v7.0-rc7 - tracing/probe: reject non-closed empty immediate strings Fixed a buffer index underflow bug that occurred when passing an non-closed empty immediate string to the probe event. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmnaJr8bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b4PcH/29I02lnjYf4IJ6oiANa K5+RU+Xq0z+Tn079GapHI9vmF8d9M89bI76fqKAVBGHosk+QTdYPLwupbowY72Fq 5uZPpkMXD+cvNJz0qGEq2WOPeB+VGxq8zR/lnZjhZLupJP0DfBp6SpXS0lFx4d0Q x+bw6keG7Guq0ocq2mFNWPo2U8bxlh78UrX+pYoIHyjeO+LVNC7X7Ld3aRDebMck uDnS5oLNMojd4Vi5WYuP1GzGzIUJ0cXDDXnSVqHP5zdt7VV83eZNlfWOKe1YyHEq FoVTrb3dRFFWn2EdXOJB+DjQXWZeZtUsfJTxWm4UgPgV1DmLmBFurfB8Kk0JmX7S 0XU= =vHwk -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing probe fix from Masami Hiramatsu: "Reject non-closed empty immediate strings Fix a buffer index underflow bug that occurred when passing an non-closed empty immediate string to the probe event" * tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probe: reject non-closed empty immediate strings	2026-04-11 11:33:08 -07:00
Alexei Starovoitov	57205e2dd9	bpf: Delete unused variable 'cnt' is set, but not used. Delete it. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604111401.eqzyF2kx-lkp@intel.com/ Fixes: `2c167d9177` ("bpf: change logging scheme for live stack analysis") Link: https://lore.kernel.org/r/20260411141447.45932-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-11 10:04:28 -07:00
Thomas Gleixner	ff1c0c5d07	Merge branch 'timers/urgent' into timers/core to resolve the conflict with urgent fixes.	2026-04-11 07:58:33 +02:00
Amery Hung	136deea435	bpf: Remove gfp_flags plumbing from bpf_local_storage_update() Remove the check that rejects sleepable BPF programs from doing BPF_ANY/BPF_EXIST updates on local storage. This restriction was added in commit `b00fa38a9c` ("bpf: Enable non-atomic allocations in local storage") because kzalloc(GFP_KERNEL) could sleep inside local_storage->lock. This is no longer a concern: all local storage allocations now use kmalloc_nolock() which never sleeps. In addition, since kmalloc_nolock() only accepts __GFP_ACCOUNT, __GFP_ZERO and __GFP_NO_OBJ_EXT, the gfp_flags parameter plumbing from bpf__storage_get() to bpf_local_storage_update() becomes dead code. Remove gfp_flags from bpf_selem_alloc(), bpf_local_storage_alloc() and bpf_local_storage_update(). Drop the hidden 5th argument from bpf__storage_get helpers, and remove the verifier patching that injected GFP_KERNEL/GFP_ATOMIC into the fifth argument. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 21:22:32 -07:00
Amery Hung	5063e77588	bpf: Use kmalloc_nolock() universally in local storage Switch to kmalloc_nolock() universally in local storage. Socket local storage didn't move to kmalloc_nolock() when BPF memory allocator was replaced by it for performance reasons. Now that kfree_rcu() supports freeing memory allocated by kmalloc_nolock(), we can move the remaining local storages to use kmalloc_nolock() and cleanup the cluttered free paths. Use kfree() instead of kfree_nolock() in bpf_selem_free_trace_rcu() and bpf_local_storage_free_trace_rcu(). Both callbacks run in process context where spinning is allowed, so kfree_nolock() is unnecessary. Benchmark: ./bench -p 1 local-storage-create --storage-type socket \ --batch-size {16,32,64} The benchmark is a microbenchmark stress-testing how fast local storage can be created. There is no measurable throughput change for socket local storage after switching from kzalloc() to kmalloc_nolock(). Socket local storage batch creation speed diff --------------- ---- ------------------ ---- Baseline 16 433.9 ± 0.6 k/s 32 434.3 ± 1.4 k/s 64 434.2 ± 0.7 k/s After 16 439.0 ± 1.9 k/s +1.2% 32 437.3 ± 2.0 k/s +0.7% 64 435.8 ± 2.5k/s +0.4% Also worth noting that the baseline got a 5% throughput boost when sheaf replaces percpu partial slab recently [0]. [0] https://lore.kernel.org/bpf/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 21:22:32 -07:00
Tejun Heo	49d78adf95	sched_ext: Drop spurious warning on kick during scheduler disable kick_cpus_irq_workfn() warns when scx_kick_syncs is NULL, but this can legitimately happen when a BPF timer or other kick source races with free_kick_syncs() during scheduler disable. Drop the pr_warn_once() and add a comment explaining the race. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>	2026-04-10 16:38:25 -10:00
Daniel Borkmann	2f2ec8e773	bpf: Enforce regsafe base id consistency for BPF_ADD_CONST scalars When regsafe() compares two scalar registers that both carry BPF_ADD_CONST, check_scalar_ids() maps their full compound id (aka base \| BPF_ADD_CONST flag) as one idmap entry. However, it never verifies that the underlying base ids, that is, with the flag stripped are consistent with existing idmap mappings. This allows construction of two verifier states where the old state has R3 = R2 + 10 (both sharing base id A) while the current state has R3 = R4 + 10 (base id C, unrelated to R2). The idmap creates two independent entries: A->B (for R2) and A\|flag->C\|flag (for R3), without catching that A->C conflicts with A->B. State pruning then incorrectly succeeds. Fix this by additionally verifying base ID mapping consistency whenever BPF_ADD_CONST is set: after mapping the compound ids, also invoke check_ids() on the base IDs (flag bits stripped). This ensures that if A was already mapped to B from comparing the source register, any ADD_CONST derivative must also derive from B, not an unrelated C. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260410232651.559778-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 17:39:09 -07:00
Linus Torvalds	e774d5f1bc	RISC-V updates for v7.0-rc8 Before v7.0 is released, fix a few issues with the CFI patchset, merged earlier in v7.0-rc, that primarily affect interfaces to non-kernel code: - Improve the prctl() interface for per-task indirect branch landing pad control to expand abbreviations and to resemble the speculation control prctl() interface - Expand the "LP" and "SS" abbreviations in the ptrace uapi header file to "branch landing pad" and "shadow stack", to improve readability - Fix a typo in a CFI-related macro name in the ptrace uapi header file - Ensure that the indirect branch tracking state and shadow stack state are unlocked immediately after an exec() on the new task so that libc subsequently can control it - While working in this area, clean up the kernel-internal, cross-architecture prctl() function names by expanding the abbreviations mentioned above -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAmnYP5YACgkQx4+xDQu9 KkuoPQ//Yye5D+35EqfA12yP96Vrtg0QCKiqMotz3yLo0T7zh5KosAs/QIE5eQi7 vWRnCld5PsFa0ZS2822oPfQo8pKVO1y7M2ecFWSwaOWq865Xs82M/puqEQF3GFCS 219cg1dTVBGvvKSf4MINUBRprfZmZRT9pzhSk79qHEbHKzwCDk7uah51iUdyPJyd KX3hshYMLq3rooTHR2wD/ChTpV+pCrt2rSUVbW8+sTUWDfv2sTLauHmemKw7LpdW C0SulXvcYkGyiqsB5AXW9x2ttJ5hX9diPb73XS6eBCU0CaMl9BVZWNKeqhEMJxKR wmqIadD8pelf7Jh7wGAbNW4hWqTsO3xRpZH38Y/cGLdhs3cqvKjEmT3fOFWUP9bP hWv5027gVXVSOmvxhPiUJs7D5WWAz4Q64JZfdJSmDdEWVXcI0v/hzdukuPw4iiT6 DaqOyClTcwc+j1jawFTICXTF7wXfvZT5sjulrmPk1HX4nZ5padKpfQ77AdKHF9Q6 9pC25QHQk42h/R4ynA4lm15YnCOfYvjP25hU7K64gQnqO6qBrolfrA4kJOmdYv/g 1IXsA2YZafJbcXwyFZjWy50uu5gaCM5JhRRFdUrjmB6j3gv9HfBlWJXQywReUjPo Kq4tnFppxzFVm23COj9j5kyjsFjUhZ8KCft3+n7lrndeOCk5Z3E= =5/Ct -----END PGP SIGNATURE----- Merge tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Paul Walmsley: "Before v7.0 is released, fix a few issues with the CFI patchset, merged earlier in v7.0-rc, that primarily affect interfaces to non-kernel code: - Improve the prctl() interface for per-task indirect branch landing pad control to expand abbreviations and to resemble the speculation control prctl() interface - Expand the "LP" and "SS" abbreviations in the ptrace uapi header file to "branch landing pad" and "shadow stack", to improve readability - Fix a typo in a CFI-related macro name in the ptrace uapi header file - Ensure that the indirect branch tracking state and shadow stack state are unlocked immediately after an exec() on the new task so that libc subsequently can control it - While working in this area, clean up the kernel-internal, cross-architecture prctl() function names by expanding the abbreviations mentioned above" * tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: prctl: cfi: change the branch landing pad prctl()s to be more descriptive riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers prctl: rename branch landing pad implementation functions to be more explicit riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers riscv: cfi: clear CFI lock status in start_thread() riscv: ptrace: cfi: fix "PRACE" typo in uapi header	2026-04-10 17:27:08 -07:00
Alexei Starovoitov	2cb27158ad	bpf: poison dead stack slots As a sanity check poison stack slots that stack liveness determined to be dead, so that any read from such slots will cause program rejection. If stack liveness logic is incorrect the poison can cause valid program to be rejected, but it also will prevent unsafe program to be accepted. Allow global subprogs "read" poisoned stack slots. The static stack liveness determined that subprog doesn't read certain stack slots, but sizeof(arg_type) based global subprog validation isn't accurate enough to know which slots will actually be read by the callee, so it needs to check full sizeof(arg_type) at the caller. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-14-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:13:38 -07:00
Eduard Zingerman	2c167d9177	bpf: change logging scheme for live stack analysis Instead of breadcrumbs like: (d2,cs15) frame 0 insn 18 +live -16 (d2,cs15) frame 0 insn 17 +live -16 Print final accumulated stack use/def data per-func_instance per-instruction. printed func_instance's are ordered by callsite and depth. For example: stack use/def subprog#0 shared_instance_must_write_overwrite (d0,cs0): 0: (b7) r1 = 1 1: (7b) (u64 )(r10 -8) = r1 ; def: fp0-8 2: (7b) (u64 )(r10 -16) = r1 ; def: fp0-16 3: (bf) r1 = r10 4: (07) r1 += -8 5: (bf) r2 = r10 6: (07) r2 += -16 7: (85) call pc+7 ; use: fp0-8 fp0-16 8: (bf) r1 = r10 9: (07) r1 += -16 10: (bf) r2 = r10 11: (07) r2 += -8 12: (85) call pc+2 ; use: fp0-8 fp0-16 13: (b7) r0 = 0 14: (95) exit stack use/def subprog#1 forwarding_rw (d1,cs7): 15: (85) call pc+1 ; use: fp0-8 fp0-16 16: (95) exit stack use/def subprog#1 forwarding_rw (d1,cs12): 15: (85) call pc+1 ; use: fp0-8 fp0-16 16: (95) exit stack use/def subprog#2 write_first_read_second (d2,cs15): 17: (7a) (u64 )(r1 +0) = 42 18: (79) r0 = (u64 )(r2 +0) ; use: fp0-8 fp0-16 19: (95) exit For groups of three or more consecutive stack slots, abbreviate as follows: 25: (85) call bpf_loop#181 ; use: fp2-8..-512 fp1-8..-512 fp0-8..-512 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-10-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:13:37 -07:00
Eduard Zingerman	6762e3a0bc	bpf: simplify liveness to use (callsite, depth) keyed func_instances Rework func_instance identification and remove the dynamic liveness API, completing the transition to fully static stack liveness analysis. Replace callchain-based func_instance keys with (callsite, depth) pairs. The full callchain (all ancestor callsites) is no longer part of the hash key; only the immediate callsite and the call depth matter. This does not lose precision in practice and simplifies the data structure significantly: struct callchain is removed entirely, func_instance stores just callsite, depth. Drop must_write_acc propagation. Previously, must_write marks were accumulated across successors and propagated to the caller via propagate_to_outer_instance(). Instead, callee entry liveness (live_before at subprog start) is pulled directly back to the caller's callsite in analyze_subprog() after each callee returns. Since (callsite, depth) instances are shared across different call chains that invoke the same subprog at the same depth, must_write marks from one call may be stale for another. To handle this, analyze_subprog() records into a fresh_instance() when the instance was already visited (must_write_initialized), then merge_instances() combines the results: may_read is unioned, must_write is intersected. This ensures only slots written on ALL paths through all call sites are marked as guaranteed writes. This replaces commit_stack_write_marks() logic. Skip recursive descent into callees that receive no FP-derived arguments (has_fp_args() check). This is needed because global subprogram calls can push depth beyond MAX_CALL_FRAMES (max depth is 64 for global calls but only 8 frames are accommodated for FP passing). It also handles the case where a callback subprog cannot be determined by argument tracking: such callbacks will be processed by analyze_subprog() at depth 0 independently. Update lookup_instance() (used by is_live_before queries) to search for the func_instance with maximal depth at the corresponding callsite, walking depth downward from frameno to 0. This accounts for the fact that instance depth no longer corresponds 1:1 to bpf_verifier_state->curframe, since skipped non-FP calls create gaps. Remove the dynamic public liveness API from verifier.c: - bpf_mark_stack_{read,write}(), bpf_reset/commit_stack_write_marks() - bpf_update_live_stack(), bpf_reset_live_stack_callchain() - All call sites in check_stack_{read,write}_fixed_off(), check_stack_range_initialized(), mark_stack_slot_obj_read(), mark/unmark_stack_slots_{dynptr,iter,irq_flag}() - The per-instruction write mark accumulation in do_check() - The bpf_update_live_stack() call in prepare_func_exit() mark_stack_read() and mark_stack_write() become static functions in liveness.c, called only from the static analysis pass. The func_instance->updated and must_write_dropped flags are removed. Remove spis_single_slot(), spis_one_bit() helpers from bpf_verifier.h as they are no longer used. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-9-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:13:20 -07:00
Eduard Zingerman	fed53dbcdb	bpf: record arg tracking results in bpf_liveness masks After arg tracking reaches a fixed point, perform a single linear scan over the converged at_in[] state and translate each memory access into liveness read/write masks on the func_instance: - Load/store instructions: FP-derived pointer's frame and offset(s) are converted to half-slot masks targeting per_frame_masks->{may_read,must_write} - Helper/kfunc calls: record_call_access() queries bpf_helper_stack_access_bytes() / bpf_kfunc_stack_access_bytes() for each FP-derived argument to determine access size and direction. Unknown access size (S64_MIN) conservatively marks all slots from fp_off to fp+0 as read. - Imprecise pointers (frame == ARG_IMPRECISE): conservatively mark all slots in every frame covered by the pointer's frame bitmask as fully read. - Static subprog calls with unresolved arguments: conservatively mark all frames as fully read. Instead of a call to clean_live_states(), start cleaning the current state continuously as registers and stack become dead since the static analysis provides complete liveness information. This makes clean_live_states() and bpf_verifier_state->cleaned unnecessary. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-8-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:06:14 -07:00
Eduard Zingerman	bf0c571f7f	bpf: introduce forward arg-tracking dataflow analysis The analysis is a basis for static liveness tracking mechanism introduced by the next two commits. A forward fixed-point analysis that tracks which frame's FP each register value is derived from, and at what byte offset. This is needed because a callee can receive a pointer to its caller's stack frame (e.g. r1 = fp-16 at the call site), then do (u64 )(r1 + 0) inside the callee — a cross-frame stack access that the callee's local liveness must attribute to the caller's stack. Each register holds an arg_track value from a three-level lattice: - Precise {frame=N, off=[o1,o2,...]} — known frame index and up to 4 concrete byte offsets - Offset-imprecise {frame=N, off_cnt=0} — known frame, unknown offset - Fully-imprecise {frame=ARG_IMPRECISE, mask=bitmask} — unknown frame, mask says which frames might be involved At CFG merge points the lattice moves toward imprecision (same frame+offset stays precise, same frame different offsets merges offset sets or becomes offset-imprecise, different frames become fully-imprecise with OR'd bitmask). The analysis also tracks spills/fills to the callee's own stack (at_stack_in/out), so FP derived values spilled and reloaded. This pass is run recursively per call site: when subprog A calls B with specific FP-derived arguments, B is re-analyzed with those entry args. The recursion follows analyze_subprog -> compute_subprog_args -> (for each call insn) -> analyze_subprog. Subprogs that receive no FP-derived args are skipped during recursion and analyzed independently at depth 0. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-7-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:05:05 -07:00
Eduard Zingerman	8d3219f64d	bpf: prepare liveness internal API for static analysis pass Move the `updated` check and reset from bpf_update_live_stack() into update_instance() itself, so callers outside the main loop can reuse it. Similarly, move write_insn_idx assignment out of reset_stack_write_marks() into its public caller, and thread insn_idx as a parameter to commit_stack_write_marks() instead of reading it from liveness->write_insn_idx. Drop the unused `env` parameter from alloc_frame_masks() and mark_stack_read(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-6-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:05:03 -07:00
Eduard Zingerman	be23266b4a	bpf: 4-byte precise clean_verifier_state Migrate clean_verifier_state() and its liveness queries from 8-byte SPI granularity to 4-byte half-slot granularity. In __clean_func_state(), each SPI is cleaned in two independent halves: - half_spi 2i (lo): slot_type[0..3] - half_spi 2i+1 (hi): slot_type[4..7] Slot types STACK_DYNPTR, STACK_ITER and STACK_IRQ_FLAG are never cleaned, as their slot type markers are required by destroy_if_dynptr_stack_slot(), is_iter_reg_valid_uninit() and is_irq_flag_reg_valid_uninit() for correctness. When only the hi half is dead, spilled_ptr metadata is destroyed and the lo half's STACK_SPILL bytes are downgraded to STACK_MISC or STACK_ZERO. When only the lo half is dead, spilled_ptr is preserved because the hi half may still need it for state comparison. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-5-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:04:59 -07:00
Eduard Zingerman	7ca5f68cda	bpf: make liveness.c track stack with 4-byte granularity Convert liveness bitmask type from u64 to spis_t, doubling the number of trackable stack slots from 64 to 128 to support 4-byte granularity. Each 8-byte SPI now maps to two consecutive 4-byte sub-slots in the bitmask: spi2 half and spi2+1 half. In verifier.c, check_stack_write_fixed_off() now reports 4-byte aligned writes of 4-byte writes as half-slot marks and 8-byte aligned 8-byte writes as two slots. Similar logic applied in check_stack_read_fixed_off(). Queries (is_live_before) are not yet migrated to half-slot granularity. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-4-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:04:56 -07:00
Eduard Zingerman	cf3ee1ecf3	bpf: save subprogram name in bpf_subprog_info Subprogram name can be computed from function info and BTF, but it is convenient to have the name readily available for logging purposes. Update comment saying that bpf_subprog_info->start has to be the first field, this is no longer true, relevant sites access .start field by it's name. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-2-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:01:56 -07:00
Eduard Zingerman	33dfc521c2	bpf: share several utility functions as internal API Namely: - bpf_subprog_is_global - bpf_vlog_alignment Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-1-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 15:01:55 -07:00
Thomas Gleixner	d6e152d905	clockevents: Prevent timer interrupt starvation Calvin reported an odd NMI watchdog lockup which claims that the CPU locked up in user space. He provided a reproducer, which sets up a timerfd based timer and then rearms it in a loop with an absolute expiry time of 1ns. As the expiry time is in the past, the timer ends up as the first expiring timer in the per CPU hrtimer base and the clockevent device is programmed with the minimum delta value. If the machine is fast enough, this ends up in a endless loop of programming the delta value to the minimum value defined by the clock event device, before the timer interrupt can fire, which starves the interrupt and consequently triggers the lockup detector because the hrtimer callback of the lockup mechanism is never invoked. As a first step to prevent this, avoid reprogramming the clock event device when: - a forced minimum delta event is pending - the new expiry delta is less then or equal to the minimum delta Thanks to Calvin for providing the reproducer and to Borislav for testing and providing data from his Zen5 machine. The problem is not limited to Zen5, but depending on the underlying clock event device (e.g. TSC deadline timer on Intel) and the CPU speed not necessarily observable. This change serves only as the last resort and further changes will be made to prevent this scenario earlier in the call chain as far as possible. [ tglx: Updated to restore the old behaviour vs. !force and delta <= 0 and fixed up the tick-broadcast handlers as pointed out by Borislav ] Fixes: `d316c57ff6` ("[PATCH] clockevents: add core functionality") Reported-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Calvin Owens <calvin@wbinvd.org> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/ Link: https://patch.msgid.link/20260407083247.562657657@kernel.org	2026-04-10 22:45:38 +02:00
Sechang Lim	4406942e65	bpf: Fix RCU stall in bpf_fd_array_map_clear() Add a missing cond_resched() in bpf_fd_array_map_clear() loop. For PROG_ARRAY maps with many entries this loop calls prog_array_map_poke_run() per entry which can be expensive, and without yielding this can cause RCU stalls under load: rcu: Stack dump where RCU GP kthread last ran: CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef) Workqueue: events prog_array_map_clear_deferred RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246 Call Trace: <TASK> prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096 __fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925 bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline] prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141 process_one_work+0x898/0x19d0 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x770/0x10b0 kernel/workqueue.c:3400 kthread+0x465/0x880 kernel/kthread.c:464 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Fixes: `da765a2f59` ("bpf: Add poke dependency tracking for prog array maps") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Link: https://lore.kernel.org/r/20260407103823.3942156-1-rhkrqnwk98@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 12:10:06 -07:00
Puranjay Mohan	4cbee026db	bpf: return VMA snapshot from task_vma iterator Holding the per-VMA lock across the BPF program body creates a lock ordering problem when helpers acquire locks that depend on mmap_lock: vm_lock -> i_rwsem -> mmap_lock -> vm_lock Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then drop the lock before returning. The BPF program accesses only the snapshot. The verifier only trusts vm_mm and vm_file pointers (see BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference- counted with get_file() under the lock and released via fput() on the next iteration or in _destroy(). vm_mm is already correct because lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers are left as-is by memcpy() since the verifier treats them as untrusted. Fixes: `4ac4546821` ("bpf: Introduce task_vma open-coded iterator kfuncs") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408154539.3832150-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 12:05:16 -07:00
Puranjay Mohan	bee9ef4a40	bpf: switch task_vma iterator from mmap_lock to per-VMA locks The open-coded task_vma iterator holds mmap_lock for the entire duration of iteration, increasing contention on this highly contended lock. Switch to per-VMA locking. Find the next VMA via an RCU-protected maple tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not used because its fallback takes mmap_read_lock(), and the iterator must work in non-sleepable contexts. lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA containing a given address but cannot iterate across gaps. An RCU-protected vma_next() walk (mas_find) first locates the next VMA's vm_start to pass to lock_vma_under_rcu(). Between the RCU walk and the lock, the VMA may be removed, shrunk, or write-locked. On failure, advance past it using vm_end from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be stale; fall back to PAGE_SIZE advancement when it does not make forward progress. Concurrent VMA insertions at addresses already passed by the iterator are not detected. CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260408154539.3832150-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 12:05:16 -07:00
Puranjay Mohan	d8e27d2d22	bpf: fix mm lifecycle in open-coded task_vma iterator The open-coded task_vma iterator reads task->mm locklessly and acquires mmap_read_trylock() but never calls mmget(). If the task exits concurrently, the mm_struct can be freed as it is not SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free. Safely read task->mm with a trylock on alloc_lock and acquire an mm reference. Drop the reference via bpf_iter_mmput_async() in _destroy() and error paths. bpf_iter_mmput_async() is a local wrapper around mmput_async() with a fallback to mmput() on !CONFIG_MMU. Reject irqs-disabled contexts (including NMI) up front. Operations used by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async) take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from NMI or from a tracepoint that fires with those locks held could deadlock. A trylock on alloc_lock is used instead of the blocking task_lock() (get_task_mm) to avoid a deadlock when a softirq BPF program iterates a task that already holds its alloc_lock on the same CPU. Fixes: `4ac4546821` ("bpf: Introduce task_vma open-coded iterator kfuncs") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260408154539.3832150-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 12:05:16 -07:00
Tejun Heo	e719e17d99	sched_ext: Warn on task-based SCX op recursion The kf_tasks[] design assumes task-based SCX ops don't nest - if they did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE caught invalid nesting via kf_mask, but that machinery is gone now. Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at entry to any of the three means re-entry from somewhere in the family. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	979a98b6e9	sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() The "kf_allowed" framing on this helper comes from the old runtime scx_kf_allowed() gate, which has been removed. Rename it to describe what it actually does in the new model. Pure rename, no functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Cheng-Yang Chou	7cd9a5d7d4	sched_ext: Remove runtime kfunc mask enforcement Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	d1d3c1c6ae	sched_ext: Add verifier-time kfunc context filter Move enforcement of SCX context-sensitive kfunc restrictions from per-task runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's struct_ops context information. A shared .filter callback is attached to each context-sensitive BTF set and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX ops member offset. Disallowed calls are now rejected at program load time instead of at runtime. The old model split reachability across two places: each SCX_CALL_OP*() set bits naming its op context, and each kfunc's scx_kf_allowed() check OR'd together the bits it accepted. A kfunc was callable when those two masks overlapped. The new model transposes the result to the caller side - each op's allow flags directly list the kfunc groups it may call. The old bit assignments were: Call-site bits: ops.select_cpu = ENQUEUE \| SELECT_CPU ops.enqueue = ENQUEUE ops.dispatch = DISPATCH ops.cpu_release = CPU_RELEASE Kfunc-group accepted bits: enqueue group = ENQUEUE \| DISPATCH select_cpu group = SELECT_CPU \| ENQUEUE dispatch group = DISPATCH cpu_release group = CPU_RELEASE Intersecting them yields the reachability now expressed directly by scx_kf_allow_flags[]: ops.select_cpu -> SELECT_CPU \| ENQUEUE ops.enqueue -> SELECT_CPU \| ENQUEUE ops.dispatch -> ENQUEUE \| DISPATCH ops.cpu_release -> CPU_RELEASE Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; that maps directly to UNLOCKED in the new table. Equivalence was checked by walking every (op, kfunc-group) combination across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old scx_kf_allowed() runtime checks. With two intended exceptions (see below), all combinations reach the same verdict; disallowed calls are now caught at load time instead of firing scx_error() at runtime. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are exceptions: they have no runtime check at all, but the new filter rejects them from ops outside dispatch/unlocked. The affected cases are nonsensical - the values these setters store are only read by scx_bpf_dsq_move{,_vtime}(), which is itself restricted to dispatch/unlocked, so a setter call from anywhere else was already dead code. Runtime scx_kf_mask enforcement is left in place by this patch and removed in a follow-up. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	2193af26a1	sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) mask check and a kf_tasks[] check. After the preceding call-site fixes, every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED non-zero, so the mask check is redundant whenever the kf_tasks[] check passes. Drop it and simplify the helper to take only @sch and @p. Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which scx_bpf_task_cgroup() now points to. No functional change. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	0022b32850	sched_ext: Decouple kfunc unlocked-context check from kf_mask scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() \|\| in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No semantic change. v2: s/No functional change/No semantic change/ - the unlocked path now acquires pi_lock instead of the heavier task_rq_lock() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	b470e37c1f	sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() returns NULL. Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() from set_cpus_allowed_scx(). Three effects: - scx_bpf_task_cgroup() becomes callable (was rejected by scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked branch). Calling it while holding an unrelated task's rq lock is risky; rejection is correct. - scx_bpf_select_cpu_*() previously took the unlocked branch in select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which would deadlock against the already-held pi_lock. Now it takes the locked-rq branch and is rejected with -EPERM via the existing kf_allowed(SCX_KF_SELECT_CPU \| SCX_KF_ENQUEUE) check. Latent deadlock fix. No in-tree scheduler is known to call any of these from ops.cgroup_move(). v2: Add Fixes: tag (Andrea Righi). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	9fb457074f	sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() reflects reality. v2: Add Fixes: tag (Andrea Righi). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	a37e134317	sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch that accepts calls from unlocked contexts and takes task_rq_lock() itself - a "callable from unlocked" property encoded in the kfunc body rather than in set membership. That's fine while the runtime check is the authoritative gate, but the upcoming verifier-time filter uses set membership as the source of truth and needs it to reflect every context the kfunc may be called from. Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full set of callable contexts is captured by set membership. This follows the existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and scx_bpf_dsq_move_set_{slice,vtime}, which are members of both scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. While at it, add brief comments on each duplicate BTF_ID_FLAGS block (including the pre-existing dsq_move ones) explaining the dual membership. No runtime behavior change: the runtime check in select_cpu_from_kfunc() remains the authoritative gate until it is removed along with the rest of the scx_kf_mask enforcement in a follow-up. v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	9b5501d3c9	sched_ext: Drop TRACING access to select_cpu kfuncs The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's pi_lock state unknown. Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu set registered only for STRUCT_OPS and SYSCALL. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Vincent Guittot	78cde54ea5	sched/eevdf: Clear buddies for preempt_short next buddy should not prevent shorter slice preemption. Don't take buddy into account when checking if shorter slice entity can preempt and clear it if the entity with a shorter slice can preempt current. Test on snapdragon rb5: hackbench -T -p -l 16000000 -g 2 1> /dev/null & hackbench runs in cgroup /test-A cyclictest -t 1 -i 2777 -D 63 --policy=fair --mlock -h 20000 -q cyclictest runs in cgroup /test-B tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 22679 22595 22686 Average (us) \| 84 94(-12%) 59( 37%) Median (P50) (us) \| 56 56( 0%) 56( 0%) 90th Percentile (us) \| 64 65(- 2%) 63( 3%) 99th Percentile (us) \| 1047 1273(-22%) 74( 94%) 99.9th Percentile (us) \| 2431 4751(-95%) 663( 86%) Maximum (us) \| 4694 8655(-84%) 3934( 55%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260410132321.2897789-1-vincent.guittot@linaro.org	2026-04-10 16:31:50 +02:00
Catalin Marinas	480a9e57cc	Merge branches 'for-next/misc', 'for-next/tlbflush', 'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core * arm64/for-next/perf: : Perf updates perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc() perf/arm-cmn: Fix incorrect error check for devm_ioremap() perf: add NVIDIA Tegra410 C2C PMU perf: add NVIDIA Tegra410 CPU Memory Latency PMU perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU perf/arm_cspmu: Add arm_cspmu_acpi_dev_get perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU perf/arm_cspmu: nvidia: Rename doc to Tegra241 perf/arm-cmn: Stop claiming entire iomem region arm64: cpufeature: Use pmuv3_implemented() function arm64: cpufeature: Make PMUVer and PerfMon unsigned KVM: arm64: Read PMUVer as unsigned * arm64/for-next/read-once: : Fixes for __READ_ONCE() with CONFIG_LTO=y arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y arm64: Optimize __READ_ONCE() with CONFIG_LTO=y * for-next/misc: : Miscellaneous cleanups/fixes arm64: rsi: use linear-map alias for realm config buffer arm64: Kconfig: fix duplicate word in CMDLINE help text arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps arm64: kexec: Remove duplicate allocation for trans_pgd arm64: mm: Use generic enum pgtable_level arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0 arm64: remove ARCH_INLINE_* * for-next/tlbflush: : Refactor the arm64 TLB invalidation API and implementation arm64: mm: __ptep_set_access_flags must hint correct TTL arm64: mm: Provide level hint for flush_tlb_page() arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range() arm64: mm: More flags for __flush_tlb_range() arm64: mm: Refactor __flush_tlb_range() to take flags arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid() arm64: mm: Simplify __flush_tlb_range_limit_excess() arm64: mm: Simplify __TLBI_RANGE_NUM() macro arm64: mm: Re-implement the __flush_tlb_range_op macro in C arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range() arm64: mm: Push __TLBI_VADDR() into __tlbi_level() arm64: mm: Implicitly invalidate user ASID based on TLBI operation arm64: mm: Introduce a C wrapper for by-range TLB invalidation arm64: mm: Re-implement the __tlbi_level macro as a C function * for-next/ttbr-macros-cleanup: : Cleanups of the TTBR1_* macros arm64/mm: Directly use TTBRx_EL1_CnP arm64/mm: Directly use TTBRx_EL1_ASID_MASK arm64/mm: Describe TTBR1_BADDR_4852_OFFSET * for-next/kselftest: : arm64 kselftest updates selftests/arm64: Implement cmpbr_sigill() to hwcap test * for-next/feat_lsui: : Futex support using FEAT_LSUI instructions to avoid toggling PAN arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present arm64: Kconfig: Add support for LSUI KVM: arm64: Use CAST instruction for swapping guest descriptor arm64: futex: Support futex with FEAT_LSUI arm64: futex: Refactor futex atomic operation KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI KVM: arm64: Expose FEAT_LSUI to guests arm64: cpufeature: Add FEAT_LSUI * for-next/mpam: (40 commits) : Expose MPAM to user-space via resctrl: : - Add architecture context-switch and hiding of the feature from KVM. : - Add interface to allow MPAM to be exposed to user-space using resctrl. : - Add errata workaoround for some existing platforms. : - Add documentation for using MPAM and what shape of platforms can use resctrl arm64: mpam: Add initial MPAM documentation arm_mpam: Quirk CMN-650's CSU NRDY behaviour arm_mpam: Add workaround for T241-MPAM-6 arm_mpam: Add workaround for T241-MPAM-4 arm_mpam: Add workaround for T241-MPAM-1 arm_mpam: Add quirk framework arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl arm64: mpam: Select ARCH_HAS_CPU_RESCTRL arm_mpam: resctrl: Add empty definitions for assorted resctrl functions arm_mpam: resctrl: Update the rmid reallocation limit arm_mpam: resctrl: Add resctrl_arch_rmid_read() arm_mpam: resctrl: Allow resctrl to allocate monitors arm_mpam: resctrl: Add support for csu counters arm_mpam: resctrl: Add monitor initialisation and domain boilerplate arm_mpam: resctrl: Add kunit test for control format conversions arm_mpam: resctrl: Add support for 'MB' resource arm_mpam: resctrl: Wait for cacheinfo to be ready arm_mpam: resctrl: Add rmid index helpers arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT ... * for-next/hotplug-batched-tlbi: : arm64/mm: Enable batched TLB flush in unmap_hotplug_range() arm64/mm: Reject memory removal that splits a kernel leaf mapping arm64/mm: Enable batched TLB flush in unmap_hotplug_range() * for-next/bbml2-fixes: : Fixes for realm guest and BBML2_NOABORT arm64: mm: Remove pmd_sect() and pud_sect() arm64: mm: Handle invalid large leaf mappings correctly arm64: mm: Fix rodata=full block mapping support for realm guests * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06 * for-next/generic-entry: : More arm64 refactoring towards using the generic entry code arm64: Check DAIF (and PMR) at task-switch time arm64: entry: Use split preemption logic arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode() arm64: entry: Consistently prefix arm64-specific wrappers arm64: entry: Don't preempt with SError or Debug masked entry: Split preemption from irqentry_exit_to_kernel_mode() entry: Split kernel mode logic from irqentry_{enter,exit}() entry: Move irqentry_enter() prototype later entry: Remove local_irq_{enable,disable}_exit_to_user() entry: Fix stale comment for irqentry_enter() * for-next/acpi: : arm64 ACPI updates ACPI: AGDI: fix missing newline in error message	2026-04-10 14:22:24 +01:00
Rafael J. Wysocki	7431d90cfc	Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep' Merge cpuidle updates, OPP (operating performance points) library updates, and updates related to system suspend and hibernation for 7.1-rc1: - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) * pm-cpuidle: cpuidle: Simplify cpuidle_register_device() with guard() cpuidle: clean up dead dependencies on CPU_IDLE in Kconfig intel_idle: Add Panther Lake C-states table cpuidle: governors: teo: Rearrange stopped tick handling cpuidle: governors: menu: Refine stopped tick handling * pm-opp: OPP: Move break out of scoped_guard in dev_pm_opp_xlate_required_opp() OPP: debugfs: Use performance level if available to distinguish between rates * pm-sleep: PM: hibernate: return -ENODATA if the snapshot image is not loaded PM: hibernate: x86: Remove inclusion of crypto/hash.h	2026-04-10 12:37:27 +02:00
Rafael J. Wysocki	83e990310d	Merge branch 'pm-cpufreq' Merge cpufreq updates for 7.1-rc1: - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) * pm-cpufreq: (38 commits) cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes amd-pstate-ut: Add module parameter to select testcases amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2() amd-pstate: Add sysfs support for floor_freq and floor_count amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF x86/cpufeatures: Add AMD CPPC Performance Priority feature. ...	2026-04-10 12:05:32 +02:00
cuitao	3348e1e83a	cgroup/rdma: fix swapped arguments in pr_warn() format string The format string says "device %p ... rdma cgroup %p" but the arguments were passed as (cg, device), printing them in the wrong order. Signed-off-by: cuitao <cuitao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-09 22:30:08 -10:00
Jiayuan Chen	a0c584fc18	bpf: Fix use-after-free in offloaded map/prog info fill When querying info for an offloaded BPF map or program, bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns() obtain the network namespace with get_net(dev_net(offmap->netdev)). However, the associated netdev's netns may be racing with teardown during netns destruction. If the netns refcount has already reached 0, get_net() performs a refcount_t increment on 0, triggering: refcount_t: addition on 0; use-after-free. Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains valid, they cannot prevent the netns refcount from reaching zero. Fix this by using maybe_get_net() instead of get_net(). maybe_get_net() uses refcount_inc_not_zero() and returns NULL if the refcount is already zero, which causes ns_get_path_cb() to fail and the caller to return -ENOENT -- the correct behavior when the netns is being destroyed. Fixes: `675fc275a3` ("bpf: offload: report device information for offloaded programs") Fixes: `52775b33bb` ("bpf: offload: report device information about offloaded maps") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-09 13:24:32 -07:00
Daniel Borkmann	9f118095dd	bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from a comparison, and then, known-constant arithmetic is performed, adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw = ptr_reg->raw without clearing the negative reg->range sentinel values. This lets is_pkt_ptr_branch_taken() choose one branch direction and skip going through the other. Fix this by clearing negative pkt range values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown and both branches are properly verified. Fixes: `6d94e741a8` ("bpf: Support for pointers beyond pkt_end.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-09 13:11:31 -07:00
Linus Torvalds	3ffcd57823	last dma-mapping fix for Linux 7.0 A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCadfSDQAKCRCJp1EFxbsS RPtKAQCzfZRx2zC6ACG5opRZqqsUAah5lko9ROJZ1CK2yrfvAwEA7tGeQ8YgSUxT Xu5BHh7Ff/jk0ph001/6j7OSBZtuKAY= =xdCr -----END PGP SIGNATURE----- Merge tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov)" * tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement	2026-04-09 11:02:35 -07:00
Daniel Borkmann	9dba0ae973	bpf: Remove static qualifier from local subprog pointer The local subprog pointer in create_jt() and visit_abnormal_return_insn() was declared static. It is unconditionally assigned via bpf_find_containing_subprog() before every use. Thus, the static qualifier serves no purpose and rather creates confusion. Just remove it. Fixes: `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Daniel Borkmann	ee861486e3	bpf: Fix ld_{abs,ind} failure path analysis in subprogs Usage of ld_{abs,ind} instructions got extended into subprogs some time ago via commit `09b28d76ea` ("bpf: Add abnormal return checks."). These are only allowed in subprograms when the latter are BTF annotated and have scalar return types. The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 + exit) from legacy cBPF times. While the enforcement is on scalar return types, the verifier must also simulate the path of abnormal exit if the packet data load via ld_{abs,ind} failed. This is currently not the case. Fix it by having the verifier simulate both success and failure paths, and extend it in similar ways as we do for tail calls. The success path (r0=unknown, continue to next insn) is pushed onto stack for later validation and the r0=0 and return to the caller is done on the fall-through side. Fixes: `09b28d76ea` ("bpf: Add abnormal return checks.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Daniel Borkmann	6bd96e40f3	bpf: Propagate error from visit_tailcall_insn Commit `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") added visit_tailcall_insn() but did not check its return value. Fixes: `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Kumar Kartikeya Dwivedi	4f64d5b664	bpf: Make find_linfo widely available Move find_linfo() as bpf_find_linfo() into core.c to allow for its use in the verifier in subsequent patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:09:56 -07:00
Kumar Kartikeya Dwivedi	fbb98834a9	bpf: Extract bpf_get_linfo_file_line Extract bpf_get_linfo_file_line as its own function so that the logic to obtain the file, line, and line number for a given program can be shared in subsequent patches. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:09:56 -07:00
Marc Zyngier	2de32a25a3	Merge branch kvm-arm64/hyp-tracing into kvmarm-master/next * kvm-arm64/hyp-tracing: (40 commits) : . : EL2 tracing support, adding both 'remote' ring-buffer : infrastructure and the tracing itself, courtesy of : Vincent Donnefort. From the cover letter: : : "The growing set of features supported by the hypervisor in protected : mode necessitates debugging and profiling tools. Tracefs is the : ideal candidate for this task: : : * It is simple to use and to script. : : * It is supported by various tools, from the trace-cmd CLI to the : Android web-based perfetto. : : * The ring-buffer, where are stored trace events consists of linked : pages, making it an ideal structure for sharing between kernel and : hypervisor. : : This series first introduces a new generic way of creating remote events and : remote buffers. Then it adds support to the pKVM hypervisor." : . tracing: selftests: Extend hotplug testing for trace remotes tracing: Non-consuming read for trace remotes with an offline CPU tracing: Adjust cmd_check_undefined to show unexpected undefined symbols tracing: Restore accidentally removed SPDX tag KVM: arm64: avoid unused-variable warning tracing: Generate undef symbols allowlist for simple_ring_buffer KVM: arm64: tracing: add ftrace dependency tracing: add more symbols to whitelist tracing: Update undefined symbols allow list for simple_ring_buffer KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing tracing: selftests: Add hypervisor trace remote tests KVM: arm64: Add selftest event support to nVHE/pKVM hyp KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote KVM: arm64: Add trace reset to the nVHE/pKVM hyp KVM: arm64: Sync boot clock with the nVHE/pKVM hyp KVM: arm64: Add trace remote for the nVHE/pKVM hyp KVM: arm64: Add tracing capability for the nVHE/pKVM hyp KVM: arm64: Support unaligned fixmap in the pKVM hyp KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp ... Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-04-08 12:21:51 +01:00
Anshuman Khandual	5a84b60005	perf/events: Replace READ_ONCE() with standard pgtable accessors Replace raw READ_ONCE() dereferences of pgtable entries with corresponding standard page table accessors pxdp_get() in perf_get_pgtable_size(). These accessors default to READ_ONCE() on platforms that don't override them. So there is no functional change on such platforms. However arm64 platform is being extended to support 128 bit page tables via a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not provide required single copy atomic access for 128 bit page table entries. Although pxdp_get() accessors can later be overridden on arm64 platform to extend required single copy atomicity support on 128 bit entries. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com	2026-04-08 13:11:46 +02:00
Michal Koutný	985215804d	sched/rt: Cleanup global RT bandwidth functions The commit `5f6bd380c7` ("sched/rt: Remove default bandwidth control") and followup changes made a few of the functions unnecessary, drop them for simplicity. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-3-1e7d5ed6b249@suse.com	2026-04-08 13:11:44 +02:00
Michal Koutný	4f70a0456d	sched/rt: Move group schedulability check to sched_rt_global_validate() The sched_rt_global_constraints() function is a remnant that used to set up global RT throttling but that is no more since commit `5f6bd380c7` ("sched/rt: Remove default bandwidth control") and the function ended up only doing schedulability check. Move the check into the validation function where it fits better. (The order of validations sched_dl_global_validate() and sched_rt_global_validate() shouldn't matter.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-2-1e7d5ed6b249@suse.com	2026-04-08 13:11:44 +02:00
Michal Koutný	8b016dcec9	sched/rt: Skip group schedulable check with rt_group_sched=0 The warning from the commit `87f1fb77d8` ("sched: Add RT_GROUP WARN checks for non-root task_groups") is wrong -- it assumes that only task_groups with rt_rq are traversed, however, the schedulability check would iterate all task_groups even when rt_group_sched=0 is disabled at boot time but some non-root task_groups exist. The schedulability check is supposed to validate: a) that children don't overcommit its parent, b) no RT task group overcommits global RT limit. but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups, therefore skip the validation altogether. Otherwise, writes to the global sched_rt_runtime_us knob will be rejected with incorrect validation error. This fix is immaterial with CONFIG_RT_GROUP_SCHED=n. Fixes: `87f1fb77d8` ("sched: Add RT_GROUP WARN checks for non-root task_groups") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-1-1e7d5ed6b249@suse.com	2026-04-08 13:11:44 +02:00
Peter Zijlstra	14a8570564	sched/deadline: Use revised wakeup rule for dl_server John noted that commit `1151354225` ("sched/deadline: Fix 'stuck' dl_server") unfixed the issue from commit `a3a70caf79` ("sched/deadline: Fix dl_server behaviour"). The issue in commit `1151354225` was for wakeups of the server after the deadline; in which case you have to start a new period. The case for `a3a70caf79` is wakeups before the deadline. Now, because the server is effectively running a least-laxity policy, it means that any wakeup during the runnable phase means dl_entity_overflow() will be true. This means we need to adjust the runtime to allow it to still run until the existing deadline expires. Use the revised wakeup rule for dl_defer entities. Fixes: `1151354225` ("sched/deadline: Fix 'stuck' dl_server") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net	2026-04-08 13:11:43 +02:00
Mark Rutland	c5538d0141	entry: Split kernel mode logic from irqentry_{enter,exit}() The generic irqentry code has entry/exit functions specifically for exceptions taken from user mode, but doesn't have entry/exit functions specifically for exceptions taken from kernel mode. It would be helpful to have separate entry/exit functions specifically for exceptions taken from kernel mode. This would make the structure of the entry code more consistent, and would make it easier for architectures to manage logic specific to exceptions taken from kernel mode. Move the logic specific to kernel mode out of irqentry_enter() and irqentry_exit() into new irqentry_enter_from_kernel_mode() and irqentry_exit_to_kernel_mode() functions. These are marked __always_inline and placed in irq-entry-common.h, as with irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so that they can be inlined into architecture-specific wrappers. The existing out-of-line irqentry_enter() and irqentry_exit() functions retained as callers of the new functions. The lockdep assertion from irqentry_exit() is moved into irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This was previously missing from irqentry_exit_to_user_mode() when called directly, and any new lockdep assertion failure relating from this change is a latent bug. Aside from the lockdep change noted above, there should be no functional change as a result of this change. [ tglx: Updated kernel doc ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com	2026-04-08 11:43:32 +02:00
Mark Rutland	22f66e7ef4	entry: Remove local_irq_{enable,disable}_exit_to_user() local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are never overridden by architecture code, and are always equivalent to local_irq_enable() and local_irq_disable(). These functions were added on the assumption that arm64 would override them to manage 'DAIF' exception masking, as described by Thomas Gleixner in these threads: https://lore.kernel.org/all/20190919150809.340471236@linutronix.de/ https://lore.kernel.org/all/alpine.DEB.2.21.1910240119090.1852@nanos.tec.linutronix.de/ In practice arm64 did not need to override either. Prior to moving to the generic irqentry code, arm64's management of DAIF was reworked in commit: `97d935faac` ("arm64: Unmask Debug + SError in do_notify_resume()") Since that commit, arm64 only masks interrupts during the 'prepare' step when returning to user mode, and masks other DAIF exceptions later. Within arm64_exit_to_user_mode(), the arm64 entry code is as follows: local_irq_disable(); exit_to_user_mode_prepare_legacy(regs); local_daif_mask(); mte_check_tfsr_exit(); exit_to_user_mode(); Remove the unnecessary local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() functions. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-3-mark.rutland@arm.com	2026-04-08 11:43:31 +02:00
Amery Hung	017f5c4ef7	bpf: Allow overwriting referenced dynptr when refcnt > 1 The verifier currently does not allow overwriting a referenced dynptr's stack slot to prevent resource leak. This is because referenced dynptr holds additional resources that requires calling specific helpers to release. This limitation can be relaxed when there are multiple copies of the same dynptr. Whether it is the orignial dynptr or one of its clones, as long as there exists at least one other dynptr with the same ref_obj_id (to be used to release the reference), its stack slot should be allowed to be overwritten. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:20:49 -07:00
Daniel Borkmann	1b327732c8	bpf: Clear delta when clearing reg id for non-{add,sub} ops When a non-{add,sub} alu op such as xor is performed on a scalar register that previously had a BPF_ADD_CONST delta, the else path in adjust_reg_min_max_vals() only clears dst_reg->id but leaves dst_reg->delta unchanged. This stale delta can propagate via assign_scalar_id_before_mov() when the register is later used in a mov. It gets a fresh id but keeps the stale delta from the old (now-cleared) BPF_ADD_CONST. This stale delta can later propagate leading to a verifier-vs- runtime value mismatch. The clear_id label already correctly clears both delta and id. Make the else path consistent by also zeroing the delta when id is cleared. More generally, this introduces a helper clear_scalar_id() which internally takes care of zeroing. There are various other locations in the verifier where only the id is cleared. By using the helper we catch all current and future locations. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:15:42 -07:00
Daniel Borkmann	d7f14173c0	bpf: Fix linked reg delta tracking when src_reg == dst_reg Consider the case of rX += rX where src_reg and dst_reg are pointers to the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first modifies the dst_reg in-place, and later in the delta tracking, the subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the post-{add,sub} value instead of the original source. This is problematic since it sets an incorrect delta, which sync_linked_regs() then propagates to linked registers, thus creating a verifier-vs-runtime mismatch. Fix it by just skipping this corner case. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:15:42 -07:00
Andrey Grodzovsky	1870ddcd94	bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes When an unqualified kprobe target exists in both vmlinux and a loaded module, number_of_same_symbols() returns a count greater than 1, causing kprobe attachment to fail with -EADDRNOTAVAIL even though the vmlinux symbol is unambiguous. When no module qualifier is given and the symbol is found in vmlinux, return the vmlinux-only count without scanning loaded modules. This preserves the existing behavior for all other cases: - Symbol only in a module: vmlinux count is 0, falls through to module scan as before. - Symbol qualified with MOD:SYM: mod != NULL, unchanged path. - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is. Fixes: `926fe783c8` ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Fixes: `9d8616034f` ("tracing/kprobes: Add symbol counting check when module loads") Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 16:27:52 -07:00
Kumar Kartikeya Dwivedi	57b23c0f61	bpf: Retire rcu_trace_implies_rcu_gp() RCU Tasks Trace grace period implies RCU grace period, and this guarantee is expected to remain in the future. Only BPF is the user of this predicate, hence retire the API and clean up all in-tree users. RCU Tasks Trace is now implemented on SRCU-fast and its grace period mechanism always has at least one call to synchronize_rcu() as it is required for SRCU-fast's correctness (it replaces the smp_mb() that SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 12:24:49 -07:00
Maninder Singh	034db4dd44	workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value use NR_STD_WORKER_POOLS for irq_work_fns[] array definition. NR_STD_WORKER_POOLS is also 2, but better to use MACRO. Initialization loop for_each_bh_worker_pool() also uses same MACRO. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-07 08:13:19 -10:00
Linus Torvalds	66d64899ea	8 hotfixes. All are cc:stable and 7 are for MM. All are singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCadQzWgAKCRDdBJ7gKXxA jp58AP9C4nNI7ReZ9hyH3xg32JZIszk8vzwRUZte/HvvY6YsXQEAt94DZnO8wa3k DoYYRjIi9PfhZzI4aRJFXgANWI+0twc= =VqfR -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Eight hotfixes. All are cc:stable and seven are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: ocfs2: fix out-of-bounds write in ocfs2_write_end_inline mm/damon/stat: deallocate damon_call() failure leaking damon_ctx mm/vma: fix memory leak in __mmap_region() mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails mm: reinstate unconditional writeback start in balance_dirty_pages() liveupdate: propagate file deserialization failures mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()	2026-04-07 10:24:44 -07:00
Zhan Xusheng	09c04714cb	alarmtimer: Access timerqueue node under lock in suspend In alarmtimer_suspend(), timerqueue_getnext() is called under base->lock, but next->expires is read after the lock is released. This is safe because suspend freezes all relevant task contexts, but reading the node while holding the lock makes the code easier to reason about and not worry about a theoretical UAF. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com	2026-04-07 19:14:26 +02:00
Jiayuan Chen	beaf0e96b1	bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooks bpf_lsm_task_to_inode() is called under rcu_read_lock() and bpf_lsm_inet_conn_established() is called from softirq context, so neither hook can be used by sleepable LSM programs. Fixes: `423f16108c` ("bpf: Augment the set of sleepable LSM hooks") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 07:57:07 -07:00
Josh Snyder	82b915051d	tick/nohz: Fix inverted return value in check_tick_dependency() fast path Commit `56534673ce` ("tick/nohz: Optimize check_tick_dependency() with early return") added a fast path that returns !val when the tick_stop tracepoint is disabled. This is inverted: the slow path returns true when a dependency IS found (val != 0), but !val returns true when val is zero (no dependency). The result is that can_stop_full_tick() sees "dependency found" when there are none, and the tick never stops on nohz_full CPUs. Fix this by returning !!val instead of !val, matching the slow-path semantics. Fixes: `56534673ce` ("tick/nohz: Optimize check_tick_dependency() with early return") Signed-off-by: Josh Snyder <josh@code406.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:claude-opus-4-6 Link: https://patch.msgid.link/20260402-fix-idle-tick2-v1-1-eecb589649d3@code406.com	2026-04-07 15:30:21 +02:00
K Prateek Nayak	556146ce5e	sched/fair: Avoid overflow in enqueue_entity() Here is one scenario which was triggered when running: stress-ng --yield=32 -t 10000000s& while true; do perf bench sched messaging -p -t -l 100000 -g 16; done on a 256CPUs machine after about an hour into the run: __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0) cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1) cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37) The above comes from __enqueue_entity() after a place_entity(). Breaking this down: vlag_initial = 57498 vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754 vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055 entity_key(se, cfs_rq) = -141,245,081,754 Now, multiplying the entity_key with its own weight results to 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but in Python, without overflow, this would be: -1,2837,944,014,404,397,056 Avoid the overflow (without doing the division for avg_vruntime()), by moving zero_vruntime to the new entity when it is heavier. Fixes: `4823725d9d` ("sched/fair: Increase weight bits for avg_vruntime") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> [peterz: suggested 'weight > load' condition] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407120052.GG3738010@noisy.programming.kicks-ass.net	2026-04-07 14:02:00 +02:00
Joseph Salisbury	c6e80201e0	sched: Use u64 for bandwidth ratio calculations to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and runtime values, but it returns unsigned long. tg_rt_schedulable() also stores the current group limit and the accumulated child sum in unsigned long. On 32-bit builds, large bandwidth ratios can be truncated and the RT group sum can wrap when enough siblings are present. That can let an overcommitted RT hierarchy pass the schedulability check, and it also narrows the helper result for other callers. Return u64 from to_ratio() and use u64 for the RT group totals so bandwidth ratios are preserved and compared at full width on both 32-bit and 64-bit builds. Fixes: `b40b2e8eb5` ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com	2026-04-07 09:23:52 +02:00
Anton Protopopov	43cd9d9520	bpf: Do not ignore offsets for loads from insn_arrays When a pointer to PTR_TO_INSN is dereferenced, the offset field of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier to not ignore this field. Reported-by: Jiyong Yang <ksur673@gmail.com> Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 18:38:32 -07:00
Gustavo A. R. Silva	18474aed5d	bpf: Avoid -Wflex-array-members-not-at-end warnings Apparently, struct bpf_empty_prog_array exists entirely to populate a single element of "items" in a global variable. "null_prog" is only used during the initializer. None of this is needed; globals will be correctly sized with an array initializer of a flexible-array member. So, remove struct bpf_empty_prog_array and adjust the rest of the code, accordingly. With these changes, fix the following warnings: ./include/linux/bpf.h:2369:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 18:37:52 -07:00
Kumar Kartikeya Dwivedi	f25777056e	bpf: Enable unaligned accesses for syscall ctx Don't reject usage of fixed unaligned offsets for syscall ctx. Tests will be added in later commits. Unaligned offsets already work for variable offsets. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
Kumar Kartikeya Dwivedi	ae5ef001aa	bpf: Support variable offsets for syscall PTR_TO_CTX Allow accessing PTR_TO_CTX with variable offsets in syscall programs. Fixed offsets are already enabled for all program types that do not convert their ctx accesses, since the changes we made in the commit `de6c7d99f8` ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note that we also lift the restriction on passing syscall context into helpers, which was not permitted before, and passing modified syscall context into kfuncs. The structure of check_mem_access can be mostly shared and preserved, but we must use check_mem_region_access to correctly verify access with variable offsets. The check made in check_helper_mem_access is hardened to only allow PTR_TO_CTX for syscall programs to be passed in as helper memory. This was the original intention of the existing code anyway, and it makes little sense for other program types' context to be utilized as a memory buffer. In case a convincing example presents itself in the future, this check can be relaxed further. We also no longer use the last-byte access to simulate helper memory access, but instead go through check_mem_region_access. Since this no longer updates our max_ctx_offset, we must do so manually, to keep track of the maximum offset at which the program ctx may be accessed. Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not relax any fixed or variable offset constraints around PTR_TO_CTX even in syscall programs, and require them to be passed unmodified. There are several reasons why this is necessary. First, if we pass a modified ctx, then the global subprog's accesses will not update the max_ctx_offset to its true maximum offset, and can lead to out of bounds accesses. Second, tail called program (or extension program replacing global subprog) where their max_ctx_offset exceeds the program they are being called from can also cause issues. For the latter, unmodified PTR_TO_CTX is the first requirement for the fix, the second is ensuring max_ctx_offset >= the program they are being called from, which has to be a separate change not made in this commit. All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and make our relaxation around offsets conditional on it. Drop coverage of syscall tests from verifier_ctx.c temporarily for negative cases until they are updated in subsequent commits. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
Leo Timmins	307e0c5859	liveupdate: propagate file deserialization failures luo_session_deserialize() ignored the return value from luo_file_deserialize(). As a result, a session could be left partially restored even though the /dev/liveupdate open path treats deserialization failures as fatal. Propagate the error so a failed file deserialization aborts session deserialization instead of silently continuing. Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com Fixes: `16cec0d265` ("liveupdate: luo_session: add ioctls for file preservation") Signed-off-by: Leo Timmins <leotimmins1974@gmail.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-06 11:13:42 -07:00
MingTao Huang	a1aa9ef47c	bpf: Fix stale offload->prog pointer after constant blinding When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes JIT compilation with constant blinding enabled (bpf_jit_harden >= 2), bpf_jit_blind_constants() clones the program. The original prog is then freed in bpf_jit_prog_release_other(), which updates aux->prog to point to the surviving clone, but fails to update offload->prog. This leaves offload->prog pointing to the freed original program. When the network namespace is subsequently destroyed, cleanup_net() triggers bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls __bpf_prog_offload_destroy(offload->prog). Accessing the freed prog causes a page fault: BUG: unable to handle page fault for address: ffffc900085f1038 Workqueue: netns cleanup_net RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80 Call Trace: __bpf_offload_dev_netdev_unregister+0x257/0x350 bpf_dev_bound_netdev_unregister+0x4a/0x90 unregister_netdevice_many_notify+0x2a2/0x660 ... cleanup_net+0x21a/0x320 The test sequence that triggers this reliably is: 1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden) 2. Run xdp_metadata selftest, which creates a dev-bound-only XDP program on a veth inside a netns (./test_progs -t xdp_metadata) 3. cleanup_net -> page fault in __bpf_prog_offload_destroy Dev-bound-only programs are unique in that they have an offload structure but go through the normal JIT path instead of bpf_prog_offload_compile(). This means they are subject to constant blinding's prog clone-and-replace, while also having offload->prog that must stay in sync. Fix this by updating offload->prog in bpf_jit_prog_release_other(), alongside the existing aux->prog update. Both are back-pointers to the prog that must be kept in sync when the prog is replaced. Fixes: `2b3486bc2d` ("bpf: Introduce device-bound XDP programs") Signed-off-by: MingTao Huang <mintaohuang@tencent.com> Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:48:09 -07:00
Weiming Shi	5828b9e5b2	bpf: fix end-of-list detection in cgroup_storage_get_next_key() list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: `de9cbbaadb` ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:45:05 -07:00
Mykyta Yatsenko	07738bc566	bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK When a BPF_F_LOCK update races with a concurrent delete, the freed element can be immediately recycled by alloc_htab_elem(). The fast path in htab_map_update_elem() performs a lockless lookup and then calls copy_map_value_locked() under the element's spin_lock. If alloc_htab_elem() recycles the same memory, it overwrites the value with plain copy_map_value(), without taking the spin_lock, causing torn writes. Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's value is written under the embedded spin_lock, serializing against any stale lock holders. Fixes: `96049f3afd` ("bpf: introduce BPF_F_LOCK flag") Reported-by: Aaron Esau <aaron1esau@gmail.com> Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:37:32 -07:00
Pengpeng Hou	4346be6577	tracing/probe: reject non-closed empty immediate strings parse_probe_arg() accepts quoted immediate strings and passes the body after the opening quote to __parse_imm_string(). That helper currently computes strlen(str) and immediately dereferences str[len - 1], which underflows when the body is empty and not closed with double-quotation. Reject empty non-closed immediate strings before checking for the closing quote. Link: https://lore.kernel.org/all/20260401160315.88518-1-pengpeng@iscas.ac.cn/ Fixes: `a42e3c4de9` ("tracing/probe: Add immediate string parameter support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-04-06 09:22:42 +09:00
Pratyush Yadav	22bdab8e98	kho: drop restriction on maximum page order KHO currently restricts the maximum order of a restored page to the maximum order supported by the buddy allocator. While this works fine for much of the data passed across kexec, it is possible to have pages larger than MAX_PAGE_ORDER. For one, it is possible to get a larger order when using kho_preserve_pages() if the number of pages is large enough, since it tries to combine multiple aligned 0-order preservations into one higher order preservation. For another, upcoming support for hugepages can have gigantic hugepages being preserved over KHO. There is no real reason for this limit. The KHO preservation machinery can handle any page order. Remove this artificial restriction on max page order. Link: https://lkml.kernel.org/r/20260309123410.382308-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:24 -07:00
Pratyush Yadav (Google)	91e74fa8b1	kho: make sure preservations do not span multiple NUMA nodes The KHO restoration machinery is not capable of dealing with preservations that span multiple NUMA nodes. kho_preserve_folio() guarantees the preservation will only span one NUMA node since folios can't span multiple nodes. This leaves kho_preserve_pages(). While semantically kho_preserve_pages() only deals with 0-order pages, so all preservations should be single page only, in practice it combines preservations to higher orders for efficiency. This can result in a preservation spanning multiple nodes. Break up the preservations into a smaller order if that happens. Link: https://lkml.kernel.org/r/20260309123410.382308-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:24 -07:00
David Hildenbrand (Arm)	0326440c35	mm: rename zap_page_range_single() to zap_vma_range() Let's rename it to make it better match our new naming scheme. While at it, polish the kerneldoc. [akpm@linux-foundation.org: fix rustfmtcheck] Link: https://lkml.kernel.org/r/20260227200848.114019-15-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:15 -07:00
David Hildenbrand (Arm)	de008c9ba5	mm/memory: remove "zap_details" parameter from zap_page_range_single() Nobody except memory.c should really set that parameter to non-NULL. So let's just drop it and make unmap_mapping_range_vma() use zap_page_range_single_batched() instead. [david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:13 -07:00
Kiryl Shutsemau	d50569612c	mm: rename the 'compound_head' field in the 'struct page' to 'compound_info' The 'compound_head' field in the 'struct page' encodes whether the page is a tail and where to locate the head page. Bit 0 is set if the page is a tail, and the remaining bits in the field point to the head page. As preparation for changing how the field encodes information about the head page, rename the field to 'compound_info'. Link: https://lkml.kernel.org/r/20260227194302.274384-4-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:08 -07:00
Pasha Tatashin	019fc36872	kho: fix KASAN support for restored vmalloc regions Restored vmalloc regions are currently not properly marked for KASAN, causing KASAN to treat accesses to these regions as out-of-bounds. Fix this by properly unpoisoning the restored vmalloc area using kasan_unpoison_vmalloc(). This requires setting the VM_UNINITIALIZED flag during the initial area allocation and clearing it after the pages have been mapped and unpoisoned, using the clear_vm_uninitialized_flag() helper. Link: https://lkml.kernel.org/r/20260225223857.1714801-3-pasha.tatashin@soleen.com Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Tested-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:06 -07:00
Jason Miu	6b0dd42d76	kho: remove finalize state and clients Eliminate the `kho_finalize()` function and its associated state from the KHO subsystem. The transition to a radix tree for memory tracking makes the explicit "finalize" state and its serialization step obsolete. Remove the `kho_finalize()` and `kho_finalized()` APIs and their stub implementations. Update KHO client code and the debugfs interface to no longer call or depend on the `kho_finalize()` mechanism. Complete the move towards a stateless KHO, simplifying the overall design by removing unnecessary state management. Link: https://lkml.kernel.org/r/20260206021428.3386442-3-jasonmiu@google.com Signed-off-by: Jason Miu <jasonmiu@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:04 -07:00
Jason Miu	3f2ad90060	kho: adopt radix tree for preserved memory tracking Patch series "Make KHO Stateless", v9. This series transitions KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel. The key motivations for this change are to: - Eliminate the need for data serialization before kexec. - Remove the KHO finalize state. - Pass preservation metadata more directly to the next kernel via the FDT. The new approach uses a radix tree to mark preserved pages. A page's physical address and its order are encoded into a single value. The tree is composed of multiple levels of page-sized tables, with leaf nodes being bitmaps where each set bit represents a preserved page. The physical address of the radix tree's root is passed in the FDT, allowing the next kernel to reconstruct the preserved memory map. This series is broken down into the following patches: 1. kho: Adopt radix tree for preserved memory tracking: Replaces the xarray-based tracker with the new radix tree implementation and increments the ABI version. 2. kho: Remove finalize state and clients: Removes the now-obsolete kho_finalize() function and its usage from client code and debugfs. This patch (of 2): Introduce a radix tree implementation for tracking preserved memory pages and switch the KHO memory tracking mechanism to use it. This lays the groundwork for a stateless KHO implementation that eliminates the need for serialization and the associated "finalize" state. This patch introduces the core radix tree data structures and constants to the KHO ABI. It adds the radix tree node and leaf structures, along with documentation for the radix tree key encoding scheme that combines a page's physical address and order. To support broader use by other kernel subsystems, such as hugetlb preservation, the core radix tree manipulation functions are exported as a public API. The xarray-based memory tracking is replaced with this new radix tree implementation. The core KHO preservation and unpreservation functions are wired up to use the radix tree helpers. On boot, the second kernel restores the preserved memory map by walking the radix tree whose root physical address is passed via the FDT. The ABI `compatible` version is bumped to "kho-v2" to reflect the structural changes in the preserved memory map and sub-FDT property names. This includes renaming "fdt" to "preserved-data" to better reflect that preserved state may use formats other than FDT. [ran.xiaokai@zte.com.cn: fix child node parsing for debugfs in/sub_fdts] Link: https://lkml.kernel.org/r/20260309033530.244508-1-ranxiaokai627@163.com Link: https://lkml.kernel.org/r/20260206021428.3386442-1-jasonmiu@google.com Link: https://lkml.kernel.org/r/20260206021428.3386442-2-jasonmiu@google.com Signed-off-by: Jason Miu <jasonmiu@google.com> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:04 -07:00
Pratyush Yadav (Google)	63de231ef0	kho: move alloc tag init to kho_init_{folio,pages}() Commit `8f1081892d` ("kho: simplify page initialization in kho_restore_page()") cleaned up the page initialization logic by moving the folio and 0-order-page paths into separate functions. It missed moving the alloc tag initialization. Do it now to keep the two paths cleanly separated. While at it, touch up the comments to be a tiny bit shorter (mainly so it doesn't end up splitting into a multiline comment). This is purely a cosmetic change and there should be no change in behaviour. Link: https://lkml.kernel.org/r/20260213085914.2778107-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:03 -07:00
Linus Torvalds	2ab99ad7fa	Misc scheduler fixes: - Fix zero_vruntime tracking again (Peter Zijlstra) - Fix avg_vruntime() usage in sched_debug (Peter Zijlstra) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnSL8ERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h+BA/9F8SXhsN9+jmMCFqFScoXqbUpXWapnH1x sc58NThW783sRF2CA29AjE2js9/DzXEwxV9tGtCcYwT4kANpPESpDUfiy+KZB7/X QcK74UtDmA5D1MMfS1ub5+8+vnaquxafBlWtu2S4ZKgEeZUW+W1Txdjsf0aVq00U AU5gyRHdpyMPMJ9ecrVkvWce7dKK/ejiRT0zizLHcgrqNWAI6bDyLo3N4Z4SYndo mg6kofq4ghOyTClk6SbfwU+UZYiBCPC7aew8W66Nh0GIOWR/kbVtpBanTRDQcGWg L0IXdeBuUyhEmM3fDcFEcYO1tSHgb6pWRXooo9MkTg0b2UUyTw2nhXdPgOU9gBuW 4R1vnm3vyIR/I/36IhsoEs17PuxF3TFpsD2gjv1g563GRwgWQh8Afxoud1kias7Y eLFxSyoH48UUui1Pqqh9F02EaNV0KmosJKuYZ/MMkQZ8DGq5ZrStVKJ/8YAuXP6d dzIWTitHW9vjJ4cDiFQ721RFprEX+mpXCxBc2/OaQcV4vXKGC99LjnmbIlAZIsMY sE4QFduI/P310y3XfwLS7SuV6/Q7Yx5aTiNfk/GyLlpM3IqMdnFgt0kNs+Rm/hrg vWMj3rQwRtZEo5vsGuh8fg7f4FrgU4k6ScNl5BXIq/XEON8QJNr2m81dTMcznbiI 22vp6EGuMDk= =C0xO -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking again (Peter Zijlstra) - Fix avg_vruntime() usage in sched_debug (Peter Zijlstra) * tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Fix avg_vruntime() usage sched/fair: Fix zero_vruntime tracking fix	2026-04-05 13:45:37 -07:00
Paul Walmsley	08ee155905	prctl: cfi: change the branch landing pad prctl()s to be more descriptive Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the indirect branch tracking prctl()s with something more readable, and suggesting the use of the speculation control prctl()s as an exemplar, reimplement the prctl()s and related constants that control per-task forward-edge control flow integrity. This primarily involves two changes. First, the prctls are restructured to resemble the style of the speculative execution workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them easier to extend in the future. Second, the "indir_br_lp" abbrevation is expanded to "branch_landing_pads" to be less telegraphic. The kselftest and documentation is adjusted accordingly. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>	2026-04-04 18:40:58 -06:00
Paul Walmsley	adfc80dd0d	prctl: rename branch landing pad implementation functions to be more explicit Per Linus' comments about the unreadability of abbreviations such as "indir_br_lp", rename the three prctl() implementation functions to be more explicit. This involves renaming "indir_br_lp_status" in the function names to "branch_landing_pad_state". While here, add _prctl_ into the function names, following the speculation control prctl implementation functions. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>	2026-04-04 18:40:58 -06:00
Thomas Gleixner	bad28e01f2	Linux 7.0-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmnJqkAeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGYH/RMBphrIZUnC2zwq mS+lwIve9Tb6LTwlCw+DbR0WROsiLUWCuL6AsMy6mEsWMVtj18uFmWv0vX0RP1o8 GuFNt2oTJ+3tqZgdlUi6//IZddXntiqwyvibocfrHIdLYfNdpTFCW5D7bnVEIkl3 9z7MH8IwZNajri38c+sqqpDhhsKfG6PgAzPea3kibw/XwcLquJv1h6KeCPoFAmKe Tl8Pl96T9ESGUWa5Cu65CwQgaqITLH7BkyceVuUDXJGBJDN3wPhuD1ciPkjSCuJW ou2WyCr30uEfsmFlYrmsHR/aF6SuGYgXFGzL+kmWhOk2nCjAwi8Xxue4tIAYKD/s 0GPb+hg= =At5f -----END PGP SIGNATURE----- Merge tag 'v7.0-rc6' into irq/core to be able to merge the hyper-v patch related to randomness.	2026-04-04 20:59:34 +02:00
Rafael J. Wysocki	5cdfedf68e	amd-pstate new content for 7.1 (2026-04-02) Add support for new features: * CPPC performance priority * Dynamic EPP * Raw EPP * New unit tests for new features Fixes for: * PREEMPT_RT * sysfs files being present when HW missing * Broken/outdated documentation -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmnOpNgTHHN1cGVybTFA a2VybmVsLm9yZwAKCRAtGSymJHcCduR7EADexgetxq0l6/iV2DyI1/YJcf+cNPoS yxE93vN9i3A2xcx87klncVF0C2zIZaZFkp6o7VY/AReL/UyUOh6snz371OXBl7pm A/uppkT5QdzTpmknJMyqkLRlHfkMjNRzWv4sdh4kyJSB3SkgaN7zSVi6Zxamt/vJ VNCgExZQeDqk4VL2X/NBfaBagYSnPnBmBdXoY6aPYqFrqKj4SlDxYNbJsQlcyE9Z z0naVGb5YPEJOaMvE+5z+DwX4EmtN3si+vfi8VuQOXPnoDGOG763rpMLnz7xYvfW poPu2fnitN39MaT96btRShD6XuCg9eaPAEmpb3j6c93n1kUo+joLLbalhfc0HMeL 1/8ndz+KatEUMQTCVgs8cboob1PpRvqhIb+vrs6aTEqCsgqUKUZ7GYgglBamyRka mivC5Q+ssCxq47/ilGfECFr8vK0oV3rTu9Ltp4MS5zN70tI0YYZk3o1454nY5dhc Byv5e9bft/n9AA576y5vXENcWCSez/8UFGl5RjoxQZ7SFKNFnbSic1BT4uMRVX/G 4QUk5TWwC8WdOp7YsO30LwZ0y9vtxmfBn8BF/6n/dYGhM1/DVQ1nX9iyzhCHZ3XH fgyrkUktdI1dsm/xKvbqxK9Djw0tkMsfH1yI6iQccefnlo4gRSvTRFiM2yepY6py E8MZpz1ML8T2Pw== =XTdh -----END PGP SIGNATURE----- Merge tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux Pull amd-pstate new content for 7.1 (2026-04-02) from Mario Limonciello: "Add support for new features: * CPPC performance priority * Dynamic EPP * Raw EPP * New unit tests for new features Fixes for: * PREEMPT_RT * sysfs files being present when HW missing * Broken/outdated documentation" * tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (22 commits) MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes amd-pstate-ut: Add module parameter to select testcases amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2() amd-pstate: Add sysfs support for floor_freq and floor_count amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF x86/cpufeatures: Add AMD CPPC Performance Priority feature. amd-pstate: Make certain freq_attrs conditionally visible ...	2026-04-04 20:55:56 +02:00
Lucas De Marchi	663385f915	module: Simplify warning on positive returns from module_init() It should now be rare to trigger this warning - it doesn't need to be so verbose. Make it follow the usual style in the module loading code. For the same reason, drop the dump_stack(). Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-04-04 00:04:48 +00:00
Lucas De Marchi	743f8cae54	module: Override -EEXIST module return The -EEXIST errno is reserved by the module loading functionality. When userspace calls [f]init_module(), it expects a -EEXIST to mean that the module is already loaded in the kernel. If module_init() returns it, that is not true anymore. Override the error when returning to userspace: it doesn't make sense to change potentially long error propagation call chains just because it's will end up as the return of module_init(). Closes: https://lore.kernel.org/all/aKLzsAX14ybEjHfJ@orbyte.nwl.cc/ Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Phil Sutter <phil@nwl.cc> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> [Sami: Fixed a typo.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-04-04 00:04:42 +00:00
Linus Torvalds	631919fb12	sched_ext: Fixes for v7.0-rc6 - Fix stale direct dispatch state in ddsp_dsq_id which can cause spurious warnings in mark_direct_dispatch() on task wakeup. - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU configs which can lead to incorrectly dispatching migration-disabled tasks to remote CPUs. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCac//0w4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGdqUAP9kEuxvB+pxjheSKV0j7zvDHd+ksMxjQTRoBmyu PE0hIgEA5gAax8ebef9MlyRVsm9Qh7v/AmovUHt75oeCnDk++Ag= =hD7A -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "These are late but both fix subtle yet critical problems and the blast radius is limited strictly to sched_ext. - Fix stale direct dispatch state in ddsp_dsq_id which can cause spurious warnings in mark_direct_dispatch() on task wakeup - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU configs which can lead to incorrectly dispatching migration- disabled tasks to remote CPUs" * tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix stale direct dispatch state in ddsp_dsq_id sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU	2026-04-03 12:05:06 -07:00
Tejun Heo	744ab12a5b	Merge branch 'for-7.0-fixes' into for-7.1 Conflict in kernel/sched/ext.c between: `7e0ffb72de` ("sched_ext: Fix stale direct dispatch state in ddsp_dsq_id") which clears ddsp state at individual call sites instead of dispatch_enqueue(), and sub-sched related code reorg and API updates on for-7.1. Resolved by applying the ddsp fix with for-7.1's signatures. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:48:28 -10:00
Bartosz Golaszewski	9617b5b62c	kernel: ksysfs: initialize kernel_kobj earlier Software nodes depend on kernel_kobj which is initialized pretty late into the boot process - as a core_initcall(). Ahead of moving the software node initialization to driver_init() we must first make kernel_kobj available before it. Make ksysfs_init() visible in a new header - ksysfs.h - and call it in do_basic_setup() right before driver_init(). Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Link: https://patch.msgid.link/20260402-nokia770-gpio-swnodes-v5-1-d730db3dd299@oss.qualcomm.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>	2026-04-03 19:39:52 +02:00
Andrea Righi	7e0ffb72de	sched_ext: Fix stale direct dispatch state in ddsp_dsq_id @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: `5b26f7b920` ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-03 07:14:49 -10:00
Linus Torvalds	1270605fd2	Power management fixes for 7.0-rc7 - Fix a NULL pointer dereference in the energy model netlink interface that may occur if a given perf domain ID is not recognized (Changwoo Min) - Avoid double free in the cpufreq_dbs_governor_init() error path when kobject_init_and_add() fails (Guangshuo Li) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnPshwSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1iCsH/2R8XWqh/Yc5h4dobakF4AjDZ5TB7aHN GBvXXUAfZ9VRhqzfUxArbnCLDQcLnc7z/wdJo2nq/K0M+NqjBd2yY3nVirJmdpxr 4H6mORN2YLWh853roDx6ZrMWGaYrozY4KVCLx/2NJ/OqxNYDEQlpOjobxV7YUq1b omtBWm72lMHyCC/mRcyBC7JYUWt2PA3yqYb7IuVBdn4M4vY1hOMxAxoLwLiIx0qw zmWsuv7FPOAWcf7JvFyEGYaPkMT4Spf6ZYT1t6DALg9o194+0iWJSKl1IUDWkA3C 2OTxUIVv+5Mpw+sE4DQGbZSbTaglgjCQ6DJu1O+AwUbd7usRN8ZJ2Zk= =El4h -----END PGP SIGNATURE----- Merge tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a potential NULL pointer dereference in the energy model netlink interface and a potential double free in an error path in the common cpufreq governor management code: - Fix a NULL pointer dereference in the energy model netlink interface that may occur if a given perf domain ID is not recognized (Changwoo Min) - Avoid double free in the cpufreq_dbs_governor_init() error path when kobject_init_and_add() fails (Guangshuo Li)" * tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path PM: EM: Fix NULL pointer dereference when perf domain ID is not found	2026-04-03 09:56:32 -07:00
Alexei Starovoitov	1a1cadbd5d	bpf: Add helper and kfunc stack access size resolution The static stack liveness analysis needs to know how many bytes a helper or kfunc accesses through a stack pointer argument, so it can precisely mark the affected stack slots as stack 'def' or 'use'. Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes() which resolve the access size for a given call argument. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:44 -07:00
Alexei Starovoitov	19dbb13474	bpf: Move verifier helpers to header Move several helpers to header as preparation for the subsequent stack liveness patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:41 -07:00
Alexei Starovoitov	f1606dd0ac	bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passes Add two passes before the main verifier pass: bpf_compute_const_regs() is a forward dataflow analysis that tracks register values in R0-R9 across the program using fixed-point iteration in reverse postorder. Each register is tracked with a six-state lattice: UNVISITED -> CONST(val) / MAP_PTR(map_index) / MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN At merge points, if two paths produce the same state and value for a register, it stays; otherwise it becomes UNKNOWN. The analysis handles: - MOV, ADD, SUB, AND with immediate or register operands - LD_IMM64 for plain constants, map FDs, map values, and subprogs - LDX from read-only maps: constant-folds the load by reading the map value directly via bpf_map_direct_read() Results that fit in 32 bits are stored per-instruction in insn_aux_data and bitmasks. bpf_prune_dead_branches() uses the computed constants to evaluate conditional branches. When both operands of a conditional jump are known constants, the branch outcome is determined statically and the instruction is rewritten to an unconditional jump. The CFG postorder is then recomputed to reflect new control flow. This eliminates dead edges so that subsequent liveness analysis doesn't propagate through dead code. Also add runtime sanity check to validate that precomputed constants match the verifier's tracked state. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:36 -07:00
Alexei Starovoitov	e6898ec751	bpf: Sort subprogs in topological order after check_cfg() Add a pass that sorts subprogs in topological order so that iterating subprog_topo_order[] walks leaf subprogs first, then their callers. This is computed as a DFS post-order traversal of the CFG. The pass runs after check_cfg() to ensure the CFG has been validated before traversing and after postorder has been computed to avoid walking dead code. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:30 -07:00
Alexei Starovoitov	503d21ef8e	bpf: Do register range validation early Instead of checking src/dst range multiple times during the main verifier pass do them once. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:26 -07:00
Alexei Starovoitov	891a05ccba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc6+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:14:13 -07:00
Vincent Guittot	059258b0d4	sched/fair: Prevent negative lag increase during delayed dequeue Delayed dequeue feature aims to reduce the negative lag of a dequeued task while sleeping but it can happens that newly enqueued tasks will move backward the avg vruntime and increase its negative lag. When the delayed dequeued task wakes up, it has more neg lag compared to being dequeued immediately or to other tasks that have been dequeued just before theses new enqueues. Ensure that the negative lag of a delayed dequeued task doesn't increase during its delayed dequeued phase while waiting for its neg lag to diseappear. Similarly, we remove any positive lag that the delayed dequeued task could have gain during thsi period. Short slice tasks are particularly impacted in overloaded system. Test on snapdragon rb5: hackbench -T -p -l 16000000 -g 2 1> /dev/null & cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock -h 20000 -q The scheduling latency of cyclictest is: tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 115632 119733 119806 Average (us) \| 364 64(-82%) 61(- 5%) Median (P50) (us) \| 60 56(- 7%) 56( 0%) 90th Percentile (us) \| 1166 62(-95%) 62( 0%) 99th Percentile (us) \| 4192 73(-98%) 72(- 1%) 99.9th Percentile (us) \| 8528 2707(-68%) 1300(-52%) Maximum (us) \| 17735 14273(-20%) 13525(- 5%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260331162352.551501-1-vincent.guittot@linaro.org	2026-04-03 14:23:41 +02:00
Vincent Guittot	2d4cc371ba	sched/fair: Use sched_energy_enabled() Use helper sched_energy_enabled() everywhere we want to test if EAS is enabled instead of mixing sched_energy_enabled() and direct call to static_branch_unlikely(). No functional change Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260327132013.2800517-1-vincent.guittot@linaro.org	2026-04-03 14:23:41 +02:00
John Stultz	b049b81bdf	sched: Handle blocked-waiter migration (and return migration) Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com	2026-04-03 14:23:41 +02:00
John Stultz	dec9554dc0	sched: Move attach_one_task and attach_task helpers to sched.h The fair scheduler locally introduced attach_one_task() and attach_task() helpers, but these could be generically useful so move this code to sched.h so we can use them elsewhere. One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy the function. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-10-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	48fda62de6	sched: Add logic to zap balance callbacks if we pick again With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] <TASK> [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] </TASK> This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	f9530b3183	sched: Add assert_balance_callbacks_empty helper With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	2d76226698	sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' \| \| ,-- Task-B `-> Mutex-2 -+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ \| \| v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration before* it can be run. This will then be used in a later patch to handle proxy return-migration. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com	2026-04-03 14:23:40 +02:00
John Stultz	56f4b24267	sched: Fix modifying donor->blocked on without proper locking Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	fa4a1ff8ab	locking: Add task::blocked_lock to serialize blocked_on state So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	f4fe6be82e	sched: Fix potentially missing balancing with Proxy Exec K Prateek pointed out that with Proxy Exec, we may have cases where we context switch in __schedule(), while the donor remains the same. This could cause balancing issues, since the put_prev_set_next() logic short-cuts if (prev == next). With proxy-exec prev is the previous donor, and next is the next donor. Should the donor remain the same, but different tasks are picked to actually run, the shortcut will have avoided enqueuing the sched class balance callback. So, if we are context switching, add logic to catch the same-donor case, and trigger the put_prev/set_next calls to ensure the balance callbacks get enqueued. Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com	2026-04-03 14:23:39 +02:00
John Stultz	37341ec573	sched: Minimise repeated sched_proxy_exec() checking Peter noted: Compilers are really bad (as in they utterly refuse) optimizing (even when marked with __pure) the static branch things, and will happily emit multiple identical in a row. So pull out the one obvious sched_proxy_exec() branch in __schedule() and remove some of the 'implicit' ones in that path. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com	2026-04-03 14:23:38 +02:00
John Stultz	e0ca8991b2	sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() With proxy-execution, the scheduler selects the donor, but for blocked donors, we end up running the lock owner. This caused some complexity, because the class schedulers make sure to remove the task they pick from their pushable task lists, which prevents the donor from being migrated, but there wasn't then anything to prevent rq->curr from being migrated if rq->curr != rq->donor. This was sort of hacked around by calling proxy_tag_curr() on the rq->curr task if we were running something other then the donor. proxy_tag_curr() did a dequeue/enqueue pair on the rq->curr task, allowing the class schedulers to remove it from their pushable list. The dequeue/enqueue pair was wasteful, and additonally K Prateek highlighted that we didn't properly undo things when we stopped proxying, leaving the lock owner off the pushable list. After some alternative approaches were considered, Peter suggested just having the RT/DL classes just avoid migrating when task_on_cpu(). So rework pick_next_pushable_dl_task() and the rt pick_next_pushable_task() functions so that they skip over the first pushable task if it is on_cpu. Then just drop all of the proxy_tag_curr() logic. Fixes: `be39617e38` ("sched: Fix proxy/current (push,pull)ability") Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com	2026-04-03 14:23:38 +02:00
Coiby Xu	03738dd159	crash_dump/dm-crypt: don't print in arch-specific code Patch series "kdump: Enable LUKS-encrypted dump target support in ARM64 and PowerPC", v5. CONFIG_CRASH_DM_CRYPT has been introduced to support LUKS-encrypted device dump target by addressing two challenges [1], - Kdump kernel may not be able to decrypt the LUKS partition. For some machines, a system administrator may not have a chance to enter the password to decrypt the device in kdump initramfs after the 1st kernel crashes - LUKS2 by default use the memory-hard Argon2 key derivation function which is quite memory-consuming compared to the limited memory reserved for kdump. To also enable this feature for ARM64 and PowerPC, we need to add a device tree property dmcryptkeys [2] as similar to elfcorehdr to pass the memory address of the stored info of dm-crypt keys to the kdump kernel. This patch (of 3): When the vmcore dumping target is not a LUKS-encrypted target, it's expected that there is no dm-crypt key thus no need to return -ENOENT. Also print more logs in crash_load_dm_crypt_keys. The benefit is arch-specific code can be more succinct. Link: https://lkml.kernel.org/r/20260225060347.718905-1-coxu@redhat.com Link: https://lkml.kernel.org/r/20260225060347.718905-2-coxu@redhat.com Link: https://lore.kernel.org/all/20250502011246.99238-1-coxu@redhat.com/ [1] Link: https://github.com/devicetree-org/dt-schema/pull/181 [2] Signed-off-by: Coiby Xu <coxu@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Arnaud Lefebvre <arnaud.lefebvre@clever-cloud.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Staudt <tstaudt@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-02 23:36:24 -07:00
Linus Torvalds	7b9e74c5a4	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnPGdMACgkQ6rmadz2v bTrNxw/9Hcn2V/Jqp/cEagmKIKqSAUFgEE+AwRbQU5YL2Yem/6Q15rnOk8pOSDT5 jqk7VbuchVmWa+a9DVy7d3XVWohk332QbvQRHfqV8P0ZpnfJa0YqdZlKg2/4/8P/ yVhLzVrGIGcvvz9CfhIynRhq/fvr7iYbSSv9JT3nig4qCYpUf7kPbXSLtxyElNWN xX36KfTxQO4xI2+iezsNwklXF25Tv59V1fNuKF2lshxS+DwaroAzAJLd3MGvTHRj 8y5kU1UDb+HeJh9DpEFjppQp4qUQjIKAiNVvXGUOe7TI/i9VTIiMfesniWKNwzYv Alo2G8fLb4nJhzNL2ol4R0I5BCYmMT55tBFvSNJQ+9Esy6azkbExmKuE1hXsUXo1 jY0TbNt58zSZEmyz9SYoFKlg4lOW4ZIMl0RtnSBRoDwtK3ThGV7QFlnKq3uPZ6ce RcpMk7cOnERLzwPnpSiACrQmzhMk+j5HG1u+Eb3rXKxYCQO6bAhpQyPDKsiXNgkL uezq2zqAnNho0/CInHGlRj7E1JnvRoHCcLBT4zzyIY/jruI8fzK0aMqGMvk/qOby BWDnJ9GG3VmGSUc/FOp3IchKCnxXhkYqsjBCP03cbIZgr1MuixZeom81OsPNmSX8 Ke+FeGNsU5zOUJ1iG2BZjdya/DAgP8hd85WVtaXyX60KKhuu45c= =w0RY -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix register equivalence for pointers to packet (Alexei Starovoitov) - Fix incorrect pruning due to atomic fetch precision tracking (Daniel Borkmann) - Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya Dwivedi) - Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima) - Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang) - Reject sleepable kprobe_multi programs at attach time (Varun R Mallya) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add more precision tracking tests for atomics bpf: Fix incorrect pruning due to atomic fetch precision tracking bpf: Reject sleepable kprobe_multi programs at attach time bpf: reject direct access to nullable PTR_TO_BUF pointers bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready(). bpf: Fix grace period wait for tracepoint bpf_link bpf: Fix regsafe() for pointers to packet	2026-04-02 18:59:56 -07:00
Harishankar Vishwanathan	b254c6d816	bpf: Simulate branches to prune based on range violations This patch fixes the invariant violations that can happen after we refine ranges & tnum based on an incorrectly-detected branch condition. For example, the branch is always true, but we miss it in is_branch_taken; we then refine based on the branch being false and end up with incoherent ranges (e.g. umax < umin). To avoid this, we can simulate the refinement on both branches. More specifically, this patch simulates both branches taken using regs_refine_cond_op and reg_bounds_sync. If the resulting register states are ill-formed on one of the branches, is_branch_taken can mark that branch as "never taken". On a more formal note, we can deduce a branch is not taken when regs_refine_cond_op or reg_bounds_sync returns an ill-formed state because the branch operators are sound (verified with Agni [1]). Soundness means that the verifier is guaranteed to produce sound outputs on the taken branches. On the non-taken branch (explored because of imprecision in the bounds), the verifier is free to produce any output. We use ill-formedness as a signal that the branch is dead and prune that branch. This patch moves the refinement logic for both branches from reg_set_min_max to their own function, simulate_both_branches_taken, which is called from is_scalar_branch_taken. As a result, reg_set_min_max now only runs sanity checks and has been renamed to reg_bounds_sanity_check_branches to reflect that. We have had five patches fixing specific cases of invariant violations in the past, all added with selftests: - commit `fbc7aef517` ("bpf: Fix u32/s32 bounds when ranges cross min/max boundary") - commit `efc11a6678` ("bpf: Improve bounds when tnum has a single possible value") - commit `f41345f47f` ("bpf: Use tnums for JEQ/JNE is_branch_taken logic") - commit `00bf8d0c6c` ("bpf: Improve bounds when s64 crosses sign boundary") - commit `6279846b9b` ("bpf: Forget ranges when refining tnum after JSET") To confirm that this patch addresses all invariant violations, we have also reverted those five commits and verified that their related selftests don't cause any invariant violation warnings anymore. Those selftests still fail but only because of misdetected branches or less-precise bounds than expected. This demonstrates that the current patch is enough to avoid the invariant violation warning AND that the previous five patches are still useful to improve branch detection. In addition to the selftests, this change was also tested with the Cilium complexity test suite: all programs were successfully loaded and it didn't change the number of processed instructions. Link: https://github.com/bpfverif/agni [1] Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a2a14e874b	bpf: Exit early if reg_bounds_sync gets invalid inputs In the subsequent commit, to prune dead branches we will rely on detecting ill-formed ranges using range_bounds_violations() (e.g., umin > umax) after refining register bounds using regs_refine_cond_op(). However, reg_bounds_sync() can sometimes "repair" ill-formed bounds, potentially masking a violation that was produced by regs_refine_cond_op(). This commit modifies reg_bounds_sync() to exit early if an invariant violation is already present in the input. This ensures ill-formed reg_states remain ill-formed after reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly identify dead branches with a single check to range_bounds_violation(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Paul Chaignon	ec1d77cb0e	bpf: Use bpf_verifier_env buffers for reg_set_min_max In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync functions will be called in is_branch_taken instead of reg_set_min_max, to simulate each branch's outcome. Since they will run before we branch out, these two functions will need to work on temporary registers for the two branches. This refactoring patch prepares for that change, by introducing the temporary registers on bpf_verifier_env and using them in reg_set_min_max. This change also allows us to save one fake_reg slot as we don't need to allocate an additional temporary buffer in case of a BPF_K condition. Finally, you may notice that this patch removes the check for "false_reg1 == false_reg2" in reg_set_min_max. That check was introduced in commit `d43ad9da80` ("bpf: Skip bounds adjustment for conditional jumps on same scalar register") to avoid an invariant violation. Given that "env->false_reg1 == env->false_reg2" doesn't make sense and invariant violations are addressed in a subsequent commit, this patch just removes the check. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a1311b94ef	bpf: Refactor reg_bounds_sanity_check This commit refactors reg_bounds_sanity_check to factor out the logic that performs the sanity check from the logic that does the reporting. Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:24 -07:00
Michael Kelley	fd7400cfcb	genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq() handle_percpu_devid_irq() is a version of handle_percpu_irq() but with the addition of a pointer to a per-CPU devid. However, handle_percpu_irq() invokes add_interrupt_randomness(), while handle_percpu_devid_irq() currently does not. Add the missing add_interrupt_randomness(), as it is needed when per-CPU interrupts with devid's are used in VMs for interrupts from the hypervisor. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260402202400.1707-2-mhklkml@zohomail.com	2026-04-02 23:03:29 +02:00
Changwoo Min	0c4a59df37	sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU Since commit `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable() only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate(). Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled, so migration_disabled == 1 always means the task is truly migration-disabled regardless of whether it is the current task. The old unconditional p == current check was a false negative in this case, potentially allowing a migration-disabled task to be dispatched to a remote CPU and triggering scx_error in task_can_run_on_remote_rq(). Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is enabled, where the ambiguity with the BPF prolog still exists. Fixes: `8e4f0b1ebc` ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c") Cc: stable@vger.kernel.org # v6.18+ Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:26:55 -10:00
Samuele Mariotti	b905ee77d5	sched_ext: Fix missing warning in scx_set_task_state() default case In scx_set_task_state(), the default case was setting the warn flag, but then returning immediately. This is problematic because the only purpose of the warn flag is to trigger WARN_ONCE, but the early return prevented it from ever firing, leaving invalid task states undetected and untraced. To fix this, a WARN_ONCE call is now added directly in the default case. The fix addresses two aspects: - Guarantees the invalid task states are properly logged and traced. - Provides a distinct warning message ("sched_ext: Invalid task state") specifically for states outside the defined scx_task_state enum values, making it easier to distinguish from other transition warnings. This ensures proper detection and reporting of invalid states. Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-02 09:22:03 -10:00
Steven Rostedt	3515572dd0	tracing: Allow backup to save persistent ring buffer before it starts When the persistent ring buffer was first introduced, it did not make sense to start tracing for it on the kernel command line. That's because if there was a crash, the start of events would invalidate the events from the previous boot that had the crash. But now that there's a "backup" instance that can take a snapshot of the persistent ring buffer when boot starts, it is possible to have the persistent ring buffer start events at boot up and not lose the old events. Update the code where the boot events start after all boot time instances are created. This will allow the backup instance to copy the persistent ring buffer from the previous boot, and allow the persistent ring buffer to start tracing new events for the current boot. reserve_mem=100M:12M:trace trace_instance=boot_mapped^@trace,sched trace_instance=backup=boot_mapped The above will create a boot_mapped persistent ring buffer and enabled the scheduler events. If there's a crash, a "backup" instance will be created holding the events of the persistent ring buffer from the previous boot, while the persistent ring buffer will once again start tracing scheduler events of the current boot. Now the user doesn't have to remember to start the persistent ring buffer. It will always have the events started at each boot. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260331163924.6ccb3896@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:29:08 -04:00
Masami Hiramatsu (Google)	eca33fdab4	tracing: Remove the backup instance automatically after read Since the backup instance is readonly, after reading all data via pipe, no data is left on the instance. Thus it can be removed safely after closing all files. This also removes it if user resets the ring buffer manually via 'trace' file. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502547711.1311542.12572973358010839400.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:22:30 -04:00
Masami Hiramatsu (Google)	2c79da099a	tracing: Make the backup instance non-reusable Since there is no reason to reuse the backup instance, make it readonly (but erasable). Note that only backup instances are readonly, because other trace instances will be empty unless it is writable. Only backup instances have copy entries from the original. With this change, most of the trace control files are removed from the backup instance, including eventfs enable/filter etc. # find /sys/kernel/tracing/instances/backup/events/ \| wc -l 4093 # find /sys/kernel/tracing/instances/boot_map/events/ \| wc -l 9573 Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502546939.1311542.1826814401724828930.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:20:38 -04:00
Vincent Donnefort	20ad8b0888	ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a ring_buffer_per_cpu will be allocated and its corresponding bit toggled in the cpumask. Many readers check this cpumask to know if they can safely read the ring_buffer_per_cpu but they are doing so without memory ordering and may observe the cpumask bit set while having NULL buffer pointer. Enforce the memory read ordering by sending an IPI to all online CPUs. The hotplug path is a slow-path anyway and it saves us from adding read barriers in numerous call sites. Link: https://patch.msgid.link/20260401053659.3458961-1-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-02 13:19:09 -04:00
Daniel Borkmann	179ee84a89	bpf: Fix incorrect pruning due to atomic fetch precision tracking When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) (u64 )(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: `5ffa25502b` ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: `5ca419f286` ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:57:59 -07:00
Varun R Mallya	eb7024bfcc	bpf: Reject sleepable kprobe_multi programs at attach time kprobe.multi programs run in atomic/RCU context and cannot sleep. However, bpf_kprobe_multi_link_attach() did not validate whether the program being attached had the sleepable flag set, allowing sleepable helpers such as bpf_copy_from_user() to be invoked from a non-sleepable context. This causes a "sleeping function called from invalid context" splat: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 0 Fix this by rejecting sleepable programs early in bpf_kprobe_multi_link_attach(), before any further processing. Fixes: `0dcac27254` ("bpf: Add multi kprobe link") Signed-off-by: Varun R Mallya <varunrmallya@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:48:46 -07:00
Qi Tang	b0db1accbc	bpf: reject direct access to nullable PTR_TO_BUF pointers check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF \| PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: `20b2aff4bc` ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:47:13 -07:00
Mykyta Yatsenko	cc878b4144	bpf: Migrate dynptr file to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc now that kmalloc can be used from NMI context. freader_cleanup() runs before kfree_nolock() while the dynptr still holds exclusive access, so plain kfree_nolock() is safe — no concurrent readers can access the object. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Mykyta Yatsenko	90f51ebff2	bpf: Migrate bpf_task_work to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_rcu for bpf_task_work_ctx. Replace guard(rcu_tasks_trace)() with guard(rcu)() in bpf_task_work_irq(). The function only accesses ctx struct members (not map values), so tasks trace protection is not needed - regular RCU is sufficient since ctx is freed via kfree_rcu. The guard in bpf_task_work_callback() remains as tasks trace since it accesses map values from process context. Sleepable BPF programs hold rcu_read_lock_trace but not regular rcu_read_lock. Since kfree_rcu waits for a regular RCU grace period, the ctx memory can be freed while a sleepable program is still running. Add scoped_guard(rcu) around the pointer read and refcount tryget in bpf_task_work_acquire_ctx to close this race window. Since kfree_rcu uses call_rcu internally which is not safe from NMI context, defer destruction via irq_work when irqs are disabled. For the lost-cmpxchg path the ctx was never published, so kfree_nolock is safe. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
K Prateek Nayak	c03791085a	cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq_cpu_get() can sleep on PREEMPT_RT in presence of concurrent writer(s), however amd-pstate depends on fetching the cpudata via the policy's driver data which necessitates grabbing the reference. Since schedutil governor can call "cpufreq_driver->update_perf()" during sched_tick/enqueue/dequeue with rq_lock held and IRQs disabled, fetching the policy object using the cpufreq_cpu_get() helper in the scheduler fast-path leads to "BUG: scheduling while atomic" on PREEMPT_RT [1]. Pass the cached cpufreq policy object in sg_policy to the update_perf() instead of just the CPU. The CPU can be inferred using "policy->cpu". The lifetime of cpufreq_policy object outlasts that of the governor and the cpufreq driver (allocated when the CPU is onlined and only reclaimed when the CPU is offlined / the CPU device is removed) which makes it safe to be referenced throughout the governor's lifetime. Closes:https://lore.kernel.org/all/20250731092316.3191-1-spasswolf@web.de/ [1] Fixes: `1d215f0319` ("cpufreq: amd-pstate: Add fast switch function for AMD P-State") Reported-by: Bert Karwatzki <spasswolf@web.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Gary Guo <gary@garyguo.net> # Rust Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260316081849.19368-3-kprateek.nayak@amd.com Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2026-04-02 11:30:24 -05:00
Leon Hwang	611fe4b79a	bpf: Fix abuse of kprobe_write_ctx via freplace uprobe programs are allowed to modify struct pt_regs. Since the actual program type of uprobe is KPROBE, it can be abused to modify struct pt_regs via kprobe+freplace when the kprobe attaches to kernel functions. For example, SEC("?kprobe") int kprobe(struct pt_regs regs) { return 0; } SEC("?freplace") int freplace_kprobe(struct pt_regs regs) { regs->di = 0; return 0; } freplace_kprobe prog will attach to kprobe prog. kprobe prog will attach to a kernel function. Without this patch, when the kernel function runs, its first arg will always be set as 0 via the freplace_kprobe prog. To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow attaching freplace programs on kprobe programs with different kprobe_write_ctx values. Fixes: `7384893d97` ("bpf: Allow uprobe program to change context registers") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:29:49 -07:00
Vincent Donnefort	ce47b798ed	tracing: Non-consuming read for trace remotes with an offline CPU When a trace_buffer is created while a CPU is offline, this CPU is cleared from the trace_buffer CPU mask, preventing the creation of a non-consuming iterator (ring_buffer_iter). For trace remotes, it means the iterator fails to be allocated (-ENOMEM) even though there are available ring buffers in the trace_buffer. For non-consuming reads of trace remotes, skip missing ring_buffer_iter to allow reading the available ring buffers. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260401045100.3394299-2-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>	2026-04-02 14:16:09 +01:00
Ingo Molnar	9853914c08	Merge branch 'sched/urgent' into sched/core, to resolve conflicts The following fix in sched/urgent: `e08d007f9d` ("sched/debug: Fix avg_vruntime() usage") is in conflict with this pending commit in sched/core: `4823725d9d` ("sched/fair: Increase weight bits for avg_vruntime") Both modify the same variable definition and initialization blocks, resolve it by merging the two. Conflicts: kernel/sched/debug.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2026-04-02 15:04:09 +02:00
Peter Zijlstra	e08d007f9d	sched/debug: Fix avg_vruntime() usage John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking"). The commit in question changes avg_vruntime() from a function that is a pure reader, to a function that updates variables. This turns an unlocked sched/debug usage of this function from a minor mistake into a data corruptor. Fixes: `af4cf40470` ("sched/fair: Add cfs_rq::avg_vruntime") Fixes: `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.196370805@infradead.org	2026-04-02 13:42:43 +02:00
Peter Zijlstra	1319ea5752	sched/fair: Fix zero_vruntime tracking fix John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking"). The combination of yield and that commit was specific enough to hypothesize the following scenario: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and current is no longer. So we schedule. Since we are runnable, there is no {de,en}queue. All we have is the __{en,de}queue_entity() from {put_prev,set_next}_task(). But per the fingered commit, those two no longer move zero_vruntime. All that moves zero_vruntime are tick and full {de,en}queue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. Additionally, when multiple cgroups are involved, there is no guarantee the tick will in fact hit every cgroup in a timely manner. Statistically speaking it will, but that same statistics does not rule out the possibility of one cgroup not getting a tick for a significant amount of time -- however unlikely. Therefore, just like with the yield() case, force an update at the end of every slice. This ensures the update is never more than a single slice behind and the whole thing is within 2 lag bounds as per the comment on entity_key(). Fixes: `b3d99f43c7` ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.081530332@infradead.org	2026-04-02 13:42:43 +02:00
Jiri Pirko	f0548044a0	dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory Current CC designs don't place a vIOMMU in front of untrusted devices. Instead, the DMA API forces all untrusted device DMA through swiotlb bounce buffers (is_swiotlb_force_bounce()) which copies data into shared memory on behalf of the device. When a caller has already arranged for the memory to be shared via set_memory_decrypted(), the DMA API needs to know so it can map directly using the unencrypted physical address rather than bounce buffering. Following the pattern of DMA_ATTR_MMIO, add DMA_ATTR_CC_SHARED for this purpose. Like the MMIO case, only the caller knows what kind of memory it has and must inform the DMA API for it to work correctly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260325192352.437608-2-jiri@resnulli.us	2026-04-02 07:29:33 +02:00
Andrey Grodzovsky	93e8fd1a56	ftrace: Use kallsyms binary search for single-symbol lookup When ftrace_lookup_symbols() is called with a single symbol (cnt == 1), use kallsyms_lookup_name() for O(log N) binary search instead of the full linear scan via kallsyms_on_each_symbol(). ftrace_lookup_symbols() was designed for batch resolution of many symbols in a single pass. For large cnt this is efficient: a single O(N) walk over all symbols with O(log cnt) binary search into the sorted input array. But for cnt == 1 it still decompresses all ~200K kernel symbols only to match one. kallsyms_lookup_name() uses the sorted kallsyms index and needs only ~17 decompressions for a single lookup. This is the common path for kprobe.session with exact function names, where libbpf sends one symbol per BPF_LINK_CREATE syscall. If binary lookup fails (duplicate symbol names where the first match is not ftrace-instrumented), the function falls through to the existing linear scan path. Before (cnt=1, 50 kprobe.session programs): Attach: 858 ms (kallsyms_expand_symbol 25% of CPU) After: Attach: 52 ms (16x faster) Cc: <bpf@vger.kernel.org> Link: https://patch.msgid.link/20260302200837.317907-3-andrey.grodzovsky@crowdstrike.com Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-04-01 16:58:36 -04:00
Breno Leitao	4cdc8a7389	workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound workqueues. On systems where many CPUs share one LLC, the previous default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool, causing heavy spinlock contention on pool->lock. WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing a better balance between locality and contention. Users can revert to the previous behavior with workqueue.default_affinity_scope=cache. On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single shard covering the entire LLC, making it functionally identical to the previous CACHE default. The sharding only activates when an LLC has more than 8 cores. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-01 10:24:18 -10:00
Breno Leitao	5920d046f7	workqueue: add WQ_AFFN_CACHE_SHARD affinity scope On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-01 10:24:18 -10:00
Matthew Brost	703ccb63ae	workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works In unplug_oldest_pwq(), the first inactive work item on the pool_workqueue is activated correctly. However, if multiple inactive works exist on the same pool_workqueue, subsequent works fail to activate because wq_node_nr_active.pending_pwqs is empty — the list insertion is skipped when the pool_workqueue is plugged. Fix this by checking for additional inactive works in unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs accordingly. Fixes: `4c065dbce1` ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org Cc: Carlos Santa <carlos.santa@intel.com> Cc: Ryan Neph <ryanneph@google.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Waiman Long <longman@redhat.com>	2026-04-01 10:18:22 -10:00
Zhan Xusheng	c5283a1ffd	hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check The #endif comment says "BITS_PER_LONG >= 64", but the corresponding #if guard is "BITS_PER_LONG < 64". The comment was originally correct when the block had a three-way #if/#else/#endif structure, where the #else branch provided a 64-bit inline version. Commit `79bf2bb335` ("[PATCH] tick-management: dyntick / highres functionality") removed the #else branch but did not update the #endif comment, leaving it inconsistent with the remaining #if condition. Fix the comment to match the preprocessor guard. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260331074811.26147-1-zhanxusheng@xiaomi.com	2026-04-01 18:48:15 +02:00
Thomas Weißschuh	7138a8698a	timens: Use task_lock guard in timens_get() Simplify the logic in timens_get() by converting the task_lock usage to a guard(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260330-timens-cleanup-v1-4-936e91c9dd30@linutronix.de	2026-04-01 17:13:36 +02:00
Thomas Weißschuh	6d89dc8b1c	timens: Use mutex guard in proc_timens_set_offset() Simplify the logic in proc_timens_set_offset() by converting the mutex usage to a guard(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260330-timens-cleanup-v1-3-936e91c9dd30@linutronix.de	2026-04-01 17:13:35 +02:00
Thomas Weißschuh	3fa3aeb4a5	timens: Simplify some calls to put_time_ns() Use the new __free() based cleanup helpers to simplify some functions. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260330-timens-cleanup-v1-2-936e91c9dd30@linutronix.de	2026-04-01 17:13:35 +02:00
Sebastian Andrzej Siewior	34d85ad426	genirq/affinity: Remove cpus_read_lock() while reading cpu_possible_mask cpu_possible_mask is set early during boot based on information from the firmware. After that it remains read only and is never changed. Therefore there is no need to acquire the CPU-hotplug lock while reading it. Remove cpus_read_*() while accessing cpu_possible_mask. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260401121334.xeMOSC1v@linutronix.de	2026-04-01 16:09:05 +02:00
Nam Cao	00f0dadde8	rv: Allow epoll in rtapp-sleep monitor Since commit `0c43094f8c` ("eventpoll: Replace rwlock with spinlock"), epoll_wait is real-time-safe syscall for sleeping. Add epoll_wait to the list of rt-safe sleeping APIs. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20260401130828.3115428-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-04-01 15:18:30 +02:00
Mike Rapoport (Microsoft)	87ce9e83ab	memblock, treewide: make memblock_free() handle late freeing It shouldn't be responsibility of memblock users to detect if they free memory allocated from memblock late and should use memblock_free_late(). Make memblock_free() and memblock_phys_free() take care of late memory freeing and drop memblock_free_late(). Link: https://patch.msgid.link/20260323074836.3653702-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>	2026-04-01 11:20:15 +03:00
Siddharth Nayyar	b4760ff2a5	module: deprecate usage of _gpl sections in module loader The _gpl section are not being used populated by modpost anymore. Hence the module loader doesn't need to find and process these sections in modules. This patch also simplifies symbol finding logic in module loader since *_gpl sections don't have to be searched anymore. Signed-off-by: Siddharth Nayyar <sidnayyar@google.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-03-31 23:42:52 +00:00
Siddharth Nayyar	55fcb926b6	module: use kflagstab instead of _gpl sections Read kflagstab section for vmlinux and modules to determine whether kernel symbols are GPL only. This patch eliminates the need for fragmenting the ksymtab for infering the value of GPL-only symbol flag, henceforth stop populating _gpl versions of the ksymtab and kcrctab in modpost. Signed-off-by: Siddharth Nayyar <sidnayyar@google.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2026-03-31 23:42:52 +00:00
Kumar Kartikeya Dwivedi	c76fef7dcd	bpf: Fix grace period wait for tracepoint bpf_link Recently, tracepoints were switched from using disabled preemption (which acts as RCU read section) to SRCU-fast when they are not faultable. This means that to do a proper grace period wait for programs running in such tracepoints, we must use SRCU's grace period wait. This is only for non-faultable tracepoints, faultable ones continue using RCU Tasks Trace. However, bpf_link_free() currently does call_rcu() for all cases when the link is non-sleepable (hence, for tracepoints, non-faultable). Fix this by doing a call_srcu() grace period wait. As far RCU Tasks Trace gp -> RCU gp chaining is concerned, it is deemed unnecessary for tracepoint programs. The link and program are either accessed under RCU Tasks Trace protection, or SRCU-fast protection now. The earlier logic of chaining both RCU Tasks Trace and RCU gp waits was to generalize the logic, even if it conceded an extra RCU gp wait, however that is unnecessary for tracepoints even before this change. In practice no cost was paid since rcu_trace_implies_rcu_gp() was always true. Hence we need not chaining any RCU gp after the SRCU gp. For instance, in the non-faultable raw tracepoint, the RCU read section of the program in __bpf_trace_run() is enclosed in the SRCU gp, likewise for faultable raw tracepoint, the program is under the RCU Tasks Trace protection. Hence, the outermost scope can be waited upon to ensure correctness. Also, sleepable programs cannot be attached to non-faultable tracepoints, so whenever program or link is sleepable, only RCU Tasks Trace protection is being used for the link and prog. Fixes: `a46023d561` ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260331211021.1632902-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-31 16:01:13 -07:00
Alexei Starovoitov	a8502a79e8	bpf: Fix regsafe() for pointers to packet In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: `6d94e741a8` ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com	2026-03-31 15:18:10 -07:00
Waiman Long	2ab7393831	workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask For historical reason, wq_unbound_cpumask is initially set as intersection of HK_TYPE_DOMAIN, HK_TYPE_WQ and workqueue.unbound_cpus boot command line option. At run time, users can update the unbound cpumask via the /sys/devices/virtual/workqueue/cpumask sysfs file. Creation and modification of cpuset isolated partitions will also update wq_unbound_cpumask based on the latest HK_TYPE_DOMAIN cpumask. The HK_TYPE_WQ cpumask is out of the picture with these runtime updates. Complete the transition by taking HK_TYPE_WQ out from the workqueue code and make it depends on HK_TYPE_DOMAIN only from the housekeeping side. The final goal is to eliminate HK_TYPE_WQ as a housekeeping cpumask type. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 11:43:25 -10:00
Linus Torvalds	9147566d80	sched_ext: Fixes for v7.0-rc6 - Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in hardirq context form a cycle. Move the wait to a balance callback which can drop the rq lock and process IPIs. - Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the waker_node used cpu_to_node() while prev_cpu used scx_cpu_node_if_enabled(), leading to undefined behavior when per-node idle tracking is disabled. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiiQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGVILAP44s30JBpNyJ9JhAiCoTYzxzOXqqGbotnpQckMF +7WoJAD/Z9dJO/Sw/AH0fX6WVJDmO0QsQvFXLXJBxWy7A5XVAA0= =2DW5 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in hardirq context form a cycle. Move the wait to a balance callback which can drop the rq lock and process IPIs. - Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the waker_node used cpu_to_node() while prev_cpu used scx_cpu_node_if_enabled(), leading to undefined behavior when per-node idle tracking is disabled. * tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()	2026-03-31 14:23:12 -07:00
Linus Torvalds	0958d657b4	workqueue: Fixes for v7.0-rc6 - Fix false positive stall reports on weakly ordered architectures where the lockless worklist/timestamp check in the watchdog can observe stale values due to memory reordering. Recheck under pool->lock to confirm. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiew4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGX66AQCT5ZNUV46mQdQU8TXV4qMiAW1gdwoUT+rvcQ9Q u2jA/AD/S1uHycAWKqA+TTL8NfNvhgyhgq1AVbYUlTRXCz3iwgU= =BlI5 -----END PGP SIGNATURE----- Merge tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix false positive stall reports on weakly ordered architectures where the lockless worklist/timestamp check in the watchdog can observe stale values due to memory reordering. Recheck under pool->lock to confirm. * tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Better describe stall check workqueue: Fix false positive stall reports	2026-03-31 14:20:39 -07:00
Linus Torvalds	53d85a2056	cgroup: Fixes for v7.0-rc6 - Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink introduced a window where cgroup.procs is empty but the cgroup is still populated, causing rmdir to fail with -EBUSY and selftest failures. Make rmdir wait for dying tasks to fully leave and fix selftests to not depend on synchronous populated updates. - Fix cpuset v1 task migration failure from empty cpusets under strict security policies. When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be migrated to an ancestor without a security_task_setscheduler() check that would block the migration. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwibg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXHEAP98nVEKyl7c7+sXYtwOPn8KEhdHkdpHyPZwhpS2 1wLhaQEAm8yO49s7IgvGPWSz0s/gQdmF5/x8RAee0sJsZALvGQg= =bUUt -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink introduced a window where cgroup.procs is empty but the cgroup is still populated, causing rmdir to fail with -EBUSY and selftest failures. Make rmdir wait for dying tasks to fully leave and fix selftests to not depend on synchronous populated updates. - Fix cpuset v1 task migration failure from empty cpusets under strict security policies. When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be migrated to an ancestor without a security_task_setscheduler() check that would block the migration. * tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Skip security check for hotplug induced v1 task migration cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() cgroup: Fix cgroup_drain_dying() testing the wrong condition selftests/cgroup: Don't require synchronous populated update on task exit cgroup: Wait for dying tasks to leave on rmdir	2026-03-31 13:59:51 -07:00
Waiman Long	089f3fcd69	cgroup/cpuset: Skip security check for hotplug induced v1 task migration When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the cpuset hotplug handler will schedule a work function to migrate tasks in that cpuset with no CPU to its ancestor to enable those tasks to continue running. If a strict security policy is in place, however, the task migration may fail when security_task_setscheduler() call in cpuset_can_attach() returns a -EACCES error. That will mean that those tasks will have no CPU to run on. The system administrators will have to explicitly intervene to either add CPUs to that cpuset or move the tasks elsewhere if they are aware of it. This problem was found by a reported test failure in the LTP's cpuset_hotplug_test.sh. Fix this problem by treating this special case as an exception to skip the setsched security check in cpuset_can_attach() when a v1 cpuset with tasks have no CPU left. With that patch applied, the cpuset_hotplug_test.sh test can be run successfully without failure. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Waiman Long	bbe5ab8191	cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() Centralize the check required to run security_task_setscheduler() in the task iteration loop of cpuset_can_attach() outside of the loop as it has no dependency on the characteristics of the tasks themselves. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Steven Rostedt	8053f49fed	tracing: Remove duplicate latency_fsnotify() stub When the SNAPSHOT is defined but FSNOTIFY is not the latency_fsnotify() function is turned into a static inline stub. But this stub was defined in both trace.h and trace_snapshot.c causing a error in build when CONFIG_SNAPSHOT is defined but FSNOTIFY is not. The stub is not needed in trace_snapshot.c as it will be defined in trace.h, remove it from the C file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260330205859.24c0aae3@gandalf.local.home Fixes: `bade44fe54` ("tracing: Move snapshot code out of trace.c and into trace_snapshot.c") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202603310604.lGE9LDBK-lkp@intel.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-31 14:58:39 -04:00
Wesley Atwell	d1a03c2906	tracing: Preserve repeated trace_trigger boot parameters trace_trigger= tokenizes bootup_trigger_buf in place and stores pointers into that buffer for later trigger registration. Repeated trace_trigger= parameters overwrite the buffer contents from earlier calls, leaving only the last set of parsed event and trigger strings. Keep each new trace_trigger= string at the end of bootup_trigger_buf and parse only the appended range. That preserves the earlier event and trigger strings while still letting repeated parameters queue additional boot-time triggers. This also lets Bootconfig array values work naturally when they expand to repeated trace_trigger= entries. Before this change, only the last trace_trigger= instance survived boot. Link: https://patch.msgid.link/20260330181103.1851230-2-atwellwea@gmail.com Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-31 14:52:56 -04:00
Wesley Atwell	842b74e5ce	tracing: Append repeated boot-time tracing parameters Some tracing boot parameters already accept delimited value lists, but their __setup() handlers keep only the last instance seen at boot. Make repeated instances append to the same boot-time buffer in the format each parser already consumes. Use a shared trace_append_boot_param() helper for the ftrace filters, trace_options, and kprobe_event boot parameters. This also lets Bootconfig array values work naturally when they expand to repeated param=value entries. Before this change, only the last instance from each repeated parameter survived boot. Link: https://patch.msgid.link/20260330181103.1851230-1-atwellwea@gmail.com Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-03-31 14:52:56 -04:00
Loïc Grégoire	bf56987c11	printk: ringbuffer: fix errors in comments The printk ringbuffer implementation is described in the comment as using three ringbuffers, but the current implementation uses two (desc and data). Update the comment so it matches the code. Fix few more known issues in the comments. Signed-off-by: Loïc Grégoire <loicgre@gmail.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20260328021855.53956-1-loicgre@gmail.com [pmladek@suse.com: Fixed few more issues in the comments by John Ogness.] Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2026-03-31 17:40:23 +02:00
Gabriele Monaco	b133207deb	rv: Add nomiss deadline monitor Add the deadline monitors collection to validate the deadline scheduler, both for deadline tasks and servers. The currently implemented monitors are: * nomiss: validate dl entities run to completion before their deadiline Reviewed-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20260330111010.153663-13-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:18 +02:00
Gabriele Monaco	c85dbddad7	sched/deadline: Move some utility functions to deadline.h Some utility functions on sched_dl_entity can be useful outside of deadline.c , for instance for modelling, without relying on raw structure fields. Move functions like dl_task_of and dl_is_implicit to deadline.h to make them available outside. Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20260330111010.153663-12-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:17 +02:00
Gabriele Monaco	820725b0eb	sched: Add deadline tracepoints Add the following tracepoints: * sched_dl_throttle(dl_se, cpu, type): Called when a deadline entity is throttled * sched_dl_replenish(dl_se, cpu, type): Called when a deadline entity's runtime is replenished * sched_dl_update(dl_se, cpu, type): Called when a deadline entity updates without throttle or replenish * sched_dl_server_start(dl_se, cpu, type): Called when a deadline server is started * sched_dl_server_stop(dl_se, cpu, type): Called when a deadline server is stopped Those tracepoints can be useful to validate the deadline scheduler with RV and are not exported to tracefs. Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20260330111010.153663-11-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:17 +02:00
Gabriele Monaco	2b406fdb33	rv: Convert the opid monitor to a hybrid automaton The opid monitor validates that wakeup and need_resched events only occur with interrupts and preemption disabled by following the preemptirq tracepoints. As reported in [1], those tracepoints might be inaccurate in some situations (e.g. NMIs). Since the monitor doesn't validate other ordering properties, remove the dependency on preemptirq tracepoints and convert the monitor to a hybrid automaton to validate the constraint during event handling. This makes the monitor more robust by also removing the workaround for interrupts missing the preemption tracepoints, which was working on PREEMPT_RT only and allows the monitor to be built on kernels without the preemptirqs tracepoints. [1] - https://lore.kernel.org/lkml/20250625120823.60600-1-gmonaco@redhat.com Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20260330111010.153663-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:17 +02:00
Gabriele Monaco	13578a0871	rv: Add sample hybrid monitor stall Add a sample monitor to showcase hybrid/timed automata. The stall monitor identifies tasks stalled for longer than a threshold and reacts when that happens. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20260330111010.153663-7-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:17 +02:00
Gabriele Monaco	f5587d1b6e	rv: Add Hybrid Automata monitor type Deterministic automata define which events are allowed in every state, but cannot define more sophisticated constraint taking into account the system's environment (e.g. time or other states not producing events). Add the Hybrid Automata monitor type as an extension of Deterministic automata where each state transition is validating a constraint on a finite number of environment variables. Hybrid automata can be used to implement timed automata, where the environment variables are clocks. Also implement the necessary functionality to handle clock constraints (ns or jiffy granularity) on state and events. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20260330111010.153663-3-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-03-31 16:47:16 +02:00
Marek Szyprowski	27e2e9b9b4	Merge branch 'dma-contig-for-7.1-modules-prep-v4' into dma-mapping-for-next Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>	2026-03-31 14:52:03 +02:00
Maxime Ripard	6207948f38	dma: contiguous: Export dev_get_cma_area() The CMA dma-buf heap uses the dev_get_cma_area() function to retrieve the default contiguous area. Now that this function is no longer inlined, and since we want to turn the CMA heap into a module, let's export it. Signed-off-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-4-e18fda504419@kernel.org	2026-03-31 13:27:20 +02:00
Maxime Ripard	633040f853	dma: contiguous: Make dma_contiguous_default_area static Now that dev_get_cma_area() is no longer inline, we don't have any user of dma_contiguous_default_area() outside of contiguous.c so we can make it static. Signed-off-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-3-e18fda504419@kernel.org	2026-03-31 13:27:20 +02:00
Maxime Ripard	b3707be95f	dma: contiguous: Make dev_get_cma_area() a proper function As we try to enable dma-buf heaps, and the CMA one in particular, to compile as modules, we need to export dev_get_cma_area(). It's currently implemented as an inline function that returns either the content of device->cma_area or dma_contiguous_default_area. Thus, it means we need to export dma_contiguous_default_area, which isn't really something we want any module to have access to. Instead, let's make dev_get_cma_area() a proper function we will be able to export so we can avoid exporting dma_contiguous_default_area. Signed-off-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-2-e18fda504419@kernel.org	2026-03-31 13:27:20 +02:00

... 2 3 4 5 6 ...

51779 Commits