Currently, if perf_l2_init() fails turbostat exits after issuing the
following error (which was encountered on AlderLake):
turbostat: perf_l2_init(cpu0, 0x0, 0xff24) REFS: Invalid argument
This occurs because perf_l2_init() calls err(). However, the code has been
written in such a manner that it is able to perform cleanup and continue.
Therefore, this issue can be addressed by changing the appropriate calls
to err() to warnx().
Additionally, correct the PMU type arguments passed to the warning strings
in the ecore and lcore blocks so the logs accurately reflect the failing
counter type.
Signed-off-by: David Arcari <darcari@redhat.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Merge cpufreq updates for 7.1-rc1:
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic EPP,
Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add a
boost_freq_req QoS request to save the boost constraint instead of
overwriting the last scaling_max_freq constraint (Pierre Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is written
to its status sysfs attribute while the driver is already off (Fabio
De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
* pm-cpufreq: (38 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
...
The error path after scx_bpf_create_dsq(real_dsq_id, ...) was reporting
test_dsq_id instead of real_dsq_id in the error message, which would
mislead debugging.
Signed-off-by: fangqiurong <fangqiurong@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
The ../generated/protos.a rule had a spurious leading space before the
target name. In make, target rules must start at column 0; only recipe
lines are indented with a tab. The extra space caused make to misparse
the rule.
Remove the leading space to match the style of the adjacent
../lib/ynl.a rule.
Fixes: e0aa0c6175 ("tools: ynl: move samples to tests")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260408-ynl_makefile-v1-1-f9624acc2ad9@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
People (do people still write code or is it all AI?) seem to not
get that ksft_run() can only be called once. If we call it
multiple times KTAP parsers will likely cut off after the first
batch has finished.
Link: https://patch.msgid.link/20260408221952.819822-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add VLAN filter propagation tests through offloaded MACsec devices via
actual traffic.
The tests create MACsec tunnels with matching SAs on both endpoints,
stack VLANs on top, and verify connectivity with ping. Covered:
- Offloaded MACsec with VLAN (filters propagate to HW)
- Software MACsec with VLAN (no HW filter propagation)
- Offload on/off toggle and verifying traffic still works
On netdevsim this makes use of the VLAN filter debugfs file to actually
validate that filters are applied/removed correctly.
On real hardware the traffic should validate actual VLAN filter
propagation.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260408115240.1636047-4-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move MACsec offload API and ethtool feature tests from
tools/testing/selftests/drivers/net/netdevsim/macsec-offload.sh to
tools/testing/selftests/drivers/net/macsec.py using the NetDrvEnv
framework so tests can run against both netdevsim (default) and real
hardware (NETIF=ethX). As some real hardware requires MACsec to use
encryption, add that to the tests.
Netdevsim-specific limit checks (max SecY, max RX SC) were moved into
separate test cases to avoid failures on real hardware.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260408115240.1636047-2-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
netkit: Support for io_uring zero-copy and AF_XDP
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.
This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.
Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.
We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================
Link: https://patch.msgid.link/20260402231031.447597-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add extensive selftests for netkit queue leasing, using io_uring zero
copy test binary inside of a netns with netkit. This checks that memory
providers can be bound against virtual queues in a netkit within a
netns that are leasing from a physical netdev in the default netns.
Also add various test cases around corner cases for the queue creation
itself as well as queue info dumping and teardown in case of netkit in
device pair and single mode.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260402231031.447597-15-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:
name: queue-create
attribute-set: queue
flags: [admin-perm]
do:
request:
attributes:
- ifindex
- type
- lease
reply: &queue-create-op
attributes:
- id
This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.
A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.
In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.
An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.
For leasing queues, the virtual netdev must have real_num_rx_queues
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is optional and can specify a
netns-id relative to the caller's netns. It requires cap_net_admin
and if the netns-id attribute is not specified, the lease ifindex
will be retrieved from the current netns. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
Link: https://patch.msgid.link/20260402231031.447597-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend the verifier_direct_packet_access BPF selftests to exercise the
verifier code paths which ensure that the pkt range is cleared after
add/sub alu with a known scalar. The tests reject the invalid access.
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_direct
[...]
#592/35 verifier_direct_packet_access/direct packet access: pkt_range cleared after sub with known scalar:OK
#592/36 verifier_direct_packet_access/direct packet access: pkt_range cleared after add with known scalar:OK
#592/37 verifier_direct_packet_access/direct packet access: test3:OK
#592/38 verifier_direct_packet_access/direct packet access: test3 @unpriv:OK
#592/39 verifier_direct_packet_access/direct packet access: test34 (non-linear, cgroup_skb/ingress, too short eth):OK
#592/40 verifier_direct_packet_access/direct packet access: test35 (non-linear, cgroup_skb/ingress, too short 1):OK
#592/41 verifier_direct_packet_access/direct packet access: test36 (non-linear, cgroup_skb/ingress, long enough):OK
#592 verifier_direct_packet_access:OK
[...]
Summary: 2/47 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409155016.536608-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extend the PMU test suite to cover overflow interrupts. The test enables
the PMI (Performance Monitor Interrupt), sets counter 0 to one less than
the overflow value, and verifies that an interrupt is raised when the
counter overflows. A guest interrupt handler checks the interrupt cause
and disables further PMU interrupts upon success.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Introduce a basic PMU test that verifies hardware event counting for
four performance counters. The test enables the events for CPU cycles,
instructions retired, branch instructions, and branch misses, runs a
fixed number of loops, and checks that the counter values fall within
expected ranges. It also validates that the host supports PMU and that
the VM feature is enabled.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Add helper macros and functions to read and write CPU configuration
registers (cpucfg) from the guest and from the VMM. This interface is
required in upcoming selftests for querying and setting CPU features,
such as PMU capabilities.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Running both tests cases 126 128 together causes the first test case
126 to fail:
# for i in $(seq 3); do ./perf test 'perf trace BTF general tests' \
'perf trace record and replay'; done
126: perf trace BTF general tests : FAILED!
128: perf trace record and replay : Ok
126: perf trace BTF general tests : FAILED!
128: perf trace record and replay : Ok
126: perf trace BTF general tests : FAILED!
128: perf trace record and replay : Ok
#
Test case 126 fails because test case 128 runs concurrently as can
be observed using a ps -ef | grep perf output list on a different
window. Both do a perf trace command concurrently.
Make test case 'perf trace BTF general tests' exclusive.
Output after:
# for i in $(seq 3); do ./perf test 'perf trace BTF general tests' \
'perf trace record and replay'; done
127: perf trace BTF general tests : Ok
155: perf trace record and replay : Ok
127: perf trace BTF general tests : Ok
155: perf trace record and replay : Ok
127: perf trace BTF general tests : Ok
155: perf trace record and replay : Ok
#
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Howard Chu <howardchu95@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add resource_dump_test() which verifies dumping resources for all
devices and ports, and tests that scope=dev returns only device-level
resources and scope=port returns only port resources.
Skip if userspace does not support the scope parameter.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tests that querying a specific port handle returns the expected
resource name and size.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In "bpf: Disallow freplace on XDP with mismatched xdp_has_frags values" [1],
this XDP test is suggested to add to xdp.py.
1. Verify the failure of updating frag-capable prog with non-frag-capable
prog, when the frag-capable prog attaches to mtu=9k driver.
The test has been verified against Mellanox CX6 and Intel 82599ES NICs.
With dropping other tests, here is the test log.
# ethtool -i eth0
driver: mlx5_core
version: 6.19.0-061900-generic
# NETIF=eth0 python3 xdp.py
TAP version 13
1..1
ok 1 xdp.test_xdp_native_update_mb_to_sb
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
# ethtool -i eth0
driver: ixgbe
version: 6.19.0-061900-generic
# NETIF=eth0 python3 xdp.py
TAP version 13
1..1
# CMD: ip link set dev eth0 xdpdrv obj /path/to/tools/testing/selftests/net/lib/xdp_dummy.bpf.o sec xdp.frags
# EXIT: 2
# STDERR: RTNETLINK answers: Invalid argument
ok 1 xdp.test_xdp_native_update_mb_to_sb # SKIP device does not support multi-buffer XDP
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
Link: https://patch.msgid.link/20260406072655.368173-1-leon.huangfu@shopee.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The piece of code which processes the command line arguments and
populates NETIFS based on them is really unobvious. Rewrite it so that
the intention is clear and the code is easy to follow.
Suggested-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260407102058.867279-1-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
use_stdio was associated with struct perf_data and not perf_data_file
meaning there was implicit use of fd rather than fptr that may not be
safe. For example, in perf_data_file__write. Reorganize perf_data_file
to better abstract use_stdio, add kernel-doc and more consistently use
perf_data__ accessors so that use_stdio is better respected.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
As noticed in a sashiko review for a patch adding a missing libgen.h
in a file using basename():
https://sashiko.dev/#/patchset/20260402001740.2220481-1-acme%40kernel.org
So avoid these subtleties and instead reuse the gnu_basename() function
we had in srcline.c, renaming it to perf_basename() and replace
basename() calls with it, simplifying several cases by removing now
needless strdups.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Instead of using zalloc(nr_entries * sizeof_entry) that is what calloc()
does.
In some places where linux/zalloc.h isn't needed, remove it, add when
needed and was getting it indirectly.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
As suggested in an unrelated sashiko review:
https://sashiko.dev/#/patchset/20260407195145.2372104-1-acme%40kernel.org
"
Could a malformed perf.data file provide out-of-bounds values for cpu and
domain?
These variables are read directly from the file and used as indices for
cd_map and cd_map[cpu]->domains without any validation against
env->nr_cpus_avail or max_sched_domains.
Similar to the issue above, this is an existing lack of validation that
becomes apparent when looking at the allocation boundaries.
"
Validate it.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Sashiko suggests we use some reasonable max number of args to avoid
overflows when reading perf.data files, do it.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Those tables and variables don't change, better capture this by
explicitely using 'const'.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
`make check` will run sparse on the perf code base. A frequent warning
is "warning: symbol '...' was not declared. Should it be static?" Go
through and make global definitions without declarations static.
In some cases it is deliberate due to dlsym accessing the symbol, this
change doesn't clean up the missing declarations for perf test suites.
Sometimes things can opportunistically be made const.
Making somethings static exposed unused functions warnings, so
restructuring of ifdefs was necessary for that.
These changes reduce the size of the perf binary by 568 bytes.
Committer notes:
Refreshed the patch, the original one fell thru the cracks, updated the
size reduction.
Remove the trace-event-scripting.c changes, break the build, noticed
with container builds and with sashiko:
https://sashiko.dev/#/patchset/20260401215306.2152898-1-acme%40kernel.org
Also make two variables static to address another sashiko review
comment:
https://sashiko.dev/#/patchset/20260402001740.2220481-1-acme%40kernel.org
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In fef2a73516 ("perf tools: Kill die()") the die() function was
removed, but not the prototype in util.h, now when building with
LIBPERL=1, during a 'make -C tools/perf build-test' routine test, it is
failing as perl likes die() calls and then this clashes with this
remnant, remove it.
Fixes: fef2a73516 ("perf tools: Kill die()")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmnWIe8ACgkQrB3Eaf9P
W7dDqRAAho59mSQlQAaoj6lkPlBCR8/TZrEHWXeTZvWzzyILiE8GJGdkMoUOk47S
QR2YJ7xTg/eAALJFFPCKj82k5GOt2CjOo30BS901zdBhSZbN/H+tW57QfYRegR3o
BFZ0eBCDc5FHQYRl8QbCi2XtF4Sqr8erLIvNwfaOiuoPCZmoehD2kyMpPhb/w9qQ
DD0OsYWjZuhBP+MwHGCsmtMBoesVKI/86HV0LpeyH7uU+928Tf+TcACJzkLMrUcE
AwrvTL3Mvp2ljsm9mw6mElyiAqemQHM87yg8BrR7NoXlahAEOJx8UWchKpAgGXv5
bO8ng0Y8lNcuG+tN7rVk4/KeyjGNSW6ubRKfZbast6aoj5LfUhOIxxMTyYOEU5rH
wKbIX00ilONs8S+kK/S4D0/1EdszOB/WVUTN5yEH1+FxkpvMGs3LUfhEjzfk9Lnz
sT1ZF65YNwR0qa1SaIU4kYM543mlr/CrFgoPx5VOu0+jG+xCVWiC8fy+/SD688ht
VTQGf8Y6gGX0yRMYJeauHHCBeMwbF7WEu7MYSi+4+7uUCYexh700QpOjaYLrTpgS
NLpT9JPvuyWQ389DjJ+h5cpTqIsLrNs6+SXo+mZ6nkubGe+HRKZnLFwXj+41p3hE
tUv+EcZTKDa+YVGymVcjORC5JjqvJXXklqQFeROuoamdJF0M96c=
=0E3L
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2026-04-08
1) Update outdated comment in xfrm_dst_check().
From kexinsun.
2) Drop support for HMAC-RIPEMD-160 from IPsec.
From Eric Biggers.
* tag 'ipsec-next-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Drop support for HMAC-RIPEMD-160
xfrm: update outdated comment
====================
Link: https://patch.msgid.link/20260408094258.148555-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Report an error for './runner foo' (positional arg instead of -t) and
for './runner -t foo' when the filter matches no tests. Previously both
cases produced no error output.
Pre-scan the test list before the main loop so the error is reported
immediately, avoiding spurious SKIP output from '-s' when no tests
match.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.
Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Link: https://lore.kernel.org/r/20260408190137.101418-3-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When an parent is copied into a child the name array is populated in
address not name order. Make sure the name array isn't flagged as sorted.
Fixes: 659ad3492b ("perf maps: Switch from rbtree to lazily sorted array for addresses")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When an entry in the address array is replaced, the corresponding name
entry is replaced. The entries names may sort differently and so it is
important that the sorted by name property be cleared on the maps.
Fixes: 0d11fab327 ("perf maps: Fixup maps_by_name when modifying maps_by_address")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Getting debug_file can trigger warnings if not set. Avoid getting
these warnings by pushing the use under the controlling if.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Remove global variable addr2line_timeout_ms and add it as a member
to symbol_conf structure.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
[namhyung: move the initialization to util/symbol.c]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Make symbol_conf::addr2line_disable_warn configurable by reading
the perfconfig file.
Use section core and addr2line-disable-warn = value.
Update documentation.
Example:
# perf config -l
core.addr2line-timeout=5000
core.addr2line-disable-warn=1
#
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Rename member symbol_conf::disable_add2line_warn to
symbol_conf::addr2line_disable_warn to make it consistent with other
addr2line_xxx constants.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Introduce a new stress test to check for race conditions in the
nfnetlink_queue subsystem, where an entry is freed while another CPU is
concurrently walking the global rhashtable.
To trigger this, `nf_queue.c` is extended with two new flags:
* -O (out-of-order): Buffers packet IDs and flushes them in reverse.
* -b (bogus verdicts): Floods the kernel with non-existent packet IDs.
The bogus verdict loop forces the kernel's lookup function to perform
full rhashtable bucket traversals (-ENOENT). Combined with reverse-order
flushing and heavy parallel UDP/ping flooding across 8 queues, this puts
the nfnetlink_queue code under pressure.
Joint work with Florian Westphal.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
* kvm-arm64/misc-7.1:
KVM: arm64: selftests: Avoid testing the IMPDEF behavior
KVM: arm64: Destroy stage-2 page-table in kvm_arch_destroy_vm()
KVM: arm64: Don't leave mmu->pgt dangling on kvm_init_stage2_mmu() error
KVM: arm64: Prevent the host from using an smc with imm16 != 0
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/vgic-fixes-7.1:
: .
: FIrst pass at fixing a number of vgic-v5 bugs that were found
: after the merge of the initial series.
: .
KVM: arm64: Advertise ID_AA64PFR2_EL1.GCIE
KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIs
KVM: arm64: set_id_regs: Allow GICv3 support to be set at runtime
KVM: arm64: Don't advertises GICv3 in ID_PFR1_EL1 if AArch32 isn't supported
KVM: arm64: Correctly plumb ID_AA64PFR2_EL1 into pkvm idreg handling
KVM: arm64: Move GICv5 timer PPI validation into timer_irqs_are_valid()
KVM: arm64: Remove evaluation of timer state in kvm_cpu_has_pending_timer()
KVM: arm64: Kill arch_timer_context::direct field
KVM: arm64: vgic-v5: Correctly set dist->ready once initialised
KVM: arm64: vgic-v5: Make the effective priority mask a strict limit
KVM: arm64: vgic-v5: Cast vgic_apr to u32 to avoid undefined behaviours
KVM: arm64: vgic-v5: Transfer edge pending state to ICH_PPI_PENDRx_EL2
KVM: arm64: vgic-v5: Hold config_lock while finalizing GICv5 PPIs
KVM: arm64: Account for RESx bits in __compute_fgt()
KVM: arm64: Fix writeable mask for ID_AA64PFR2_EL1
arm64: Fix field references for ICH_PPI_DVIR[01]_EL2
KVM: arm64: Don't skip per-vcpu NV initialisation
KVM: arm64: vgic: Don't reset cpuif/redist addresses at finalize time
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/vgic-v5-ppi: (40 commits)
: .
: Add initial GICv5 support for KVM guests, only adding PPI support
: for the time being. Patches courtesy of Sascha Bischoff.
:
: From the cover letter:
:
: "This is v7 of the patch series to add the virtual GICv5 [1] device
: (vgic_v5). Only PPIs are supported by this initial series, and the
: vgic_v5 implementation is restricted to the CPU interface,
: only. Further patch series are to follow in due course, and will add
: support for SPIs, LPIs, the GICv5 IRS, and the GICv5 ITS."
: .
KVM: arm64: selftests: Add no-vgic-v5 selftest
KVM: arm64: selftests: Introduce a minimal GICv5 PPI selftest
KVM: arm64: gic-v5: Communicate userspace-driveable PPIs via a UAPI
Documentation: KVM: Introduce documentation for VGICv5
KVM: arm64: gic-v5: Probe for GICv5 device
KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot
KVM: arm64: gic-v5: Introduce kvm_arm_vgic_v5_ops and register them
KVM: arm64: gic-v5: Hide FEAT_GCIE from NV GICv5 guests
KVM: arm64: gic: Hide GICv5 for protected guests
KVM: arm64: gic-v5: Mandate architected PPI for PMU emulation on GICv5
KVM: arm64: gic-v5: Enlighten arch timer for GICv5
irqchip/gic-v5: Introduce minimal irq_set_type() for PPIs
KVM: arm64: gic-v5: Initialise ID and priority bits when resetting vcpu
KVM: arm64: gic-v5: Create and initialise vgic_v5
KVM: arm64: gic-v5: Support GICv5 interrupts with KVM_IRQ_LINE
KVM: arm64: gic-v5: Implement direct injection of PPIs
KVM: arm64: Introduce set_direct_injection irq_op
KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes
KVM: arm64: gic-v5: Check for pending PPIs
KVM: arm64: gic-v5: Clear TWI if single task running
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/hyp-tracing: (40 commits)
: .
: EL2 tracing support, adding both 'remote' ring-buffer
: infrastructure and the tracing itself, courtesy of
: Vincent Donnefort. From the cover letter:
:
: "The growing set of features supported by the hypervisor in protected
: mode necessitates debugging and profiling tools. Tracefs is the
: ideal candidate for this task:
:
: * It is simple to use and to script.
:
: * It is supported by various tools, from the trace-cmd CLI to the
: Android web-based perfetto.
:
: * The ring-buffer, where are stored trace events consists of linked
: pages, making it an ideal structure for sharing between kernel and
: hypervisor.
:
: This series first introduces a new generic way of creating remote events and
: remote buffers. Then it adds support to the pKVM hypervisor."
: .
tracing: selftests: Extend hotplug testing for trace remotes
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: Adjust cmd_check_undefined to show unexpected undefined symbols
tracing: Restore accidentally removed SPDX tag
KVM: arm64: avoid unused-variable warning
tracing: Generate undef symbols allowlist for simple_ring_buffer
KVM: arm64: tracing: add ftrace dependency
tracing: add more symbols to whitelist
tracing: Update undefined symbols allow list for simple_ring_buffer
KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing
tracing: selftests: Add hypervisor trace remote tests
KVM: arm64: Add selftest event support to nVHE/pKVM hyp
KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp
KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
KVM: arm64: Add trace reset to the nVHE/pKVM hyp
KVM: arm64: Sync boot clock with the nVHE/pKVM hyp
KVM: arm64: Add trace remote for the nVHE/pKVM hyp
KVM: arm64: Add tracing capability for the nVHE/pKVM hyp
KVM: arm64: Support unaligned fixmap in the pKVM hyp
KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add a selftest that verifies the dst_cache in seg6 lwtunnel is not
shared between the input (forwarding) and output (locally generated)
paths.
The test creates three namespaces (ns_src, ns_router, ns_dst)
connected in a line. An SRv6 encap route on ns_router encapsulates
traffic destined to cafe::1 with SID fc00::100. The SID is
reachable only for forwarded traffic (from ns_src) via an ip rule
matching the ingress interface (iif veth-r0 lookup 100), and
blackholed in the main table.
The test verifies that:
1. A packet generated locally on ns_router does not reach
ns_dst with an empty cache, since the SID is blackholed;
2. A forwarded packet from ns_src populates the input cache
from table 100 and reaches ns_dst;
3. A packet generated locally on ns_router still does not
reach ns_dst after the input cache is populated,
confirming the output path does not reuse the input
cache entry.
Both the forwarded and local packets are pinned to the same CPU
with taskset, since dst_cache is per-cpu.
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260404004405.4057-3-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The querier-interval test adds h1 (currently a slave of the VRF created
by simple_if_init) to a temporary bridge br1 acting as an outside IGMP
querier. The kernel VRF driver (drivers/net/vrf.c) calls cycle_netdev()
on every slave add and remove, toggling the interface admin-down then up.
Phylink takes the PHY down during the admin-down half of that cycle.
Since h1 and swp1 are cable-connected, swp1 also loses its link may need
several seconds to re-negotiate.
Use setup_wait_dev $h1 0 which waits for h1 to return to UP state, so the
test can rely on the link being back up at this point.
Fixes: 4d8610ee8b ("selftests: net: bridge: add vlan mcast_querier_interval tests")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/c830f130860fd2efae08bfb9e5b25fd028e58ce5.1775424423.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
socat v1.8.1.0 now defaults to shut-null, it sends an extra
0-length UDP packet when sender disconnects. This breaks
our tests which expect the exact packet sequence.
Add shut-none which was the old default where necessary.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260404230103.2719103-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Test overwriting referenced dynptr and clones to make sure it is only
allow when there is at least one other dynptr with the same ref_obj_id.
Also make sure slice is still invalidated after the dynptr's stack slot
is destroyed.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406150548.1354271-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
bpf_testmod (bpf_testmod.c), creating a duplicate symbol condition when
bpf_testmod is loaded. Add subtests that verify kprobe behavior with
this duplicate symbol:
In attach_probe:
- dup-sym-{default,legacy,perf,link}: unqualified attach succeeds
across all four modes, preferring vmlinux over module shadow.
- MOD:SYM qualification attaches to the module version.
In kprobe_multi_test:
- dup_sym: kprobe_multi attach with kprobe and kretprobe succeeds.
bpf_fentry_shadow_test is not invoked via test_run, so tests verify
attach and detach succeed without triggering the probe.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Link: https://lore.kernel.org/r/20260407203912.1787502-3-andrey.grodzovsky@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add iter_buf_null_fail with two tests and a test runner:
- iter_buf_null_deref: verifier must reject direct dereference of
ctx->key (PTR_TO_BUF | PTR_MAYBE_NULL) without a null check
- iter_buf_null_check_ok: verifier must accept dereference after
an explicit null check
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Link: https://lore.kernel.org/r/20260407145421.4315-1-tpluszz77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For tests that carry a __description tag, allow matching on both the
description string and program name for convenience. Before this commit,
the description string must be spelt out to filter the tests.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407145606.3991770-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Add enforce_fs() for defining and enforcing a ruleset in one step
* In some places, dropped "ASSERT_LE(0, fd)" checks after
create_ruleset() call -- create_ruleset() already checks that.
* In some places, rename "file_fd" to "fd" if it is not needed to
disambiguate any more.
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260327164838.38231-12-gnoack3000@gmail.com
[mic: Tweak subjet]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Even when a process is restricted with the new
LANDLOCK_ACCESS_FS_RESOLVE_UNIX right, the kernel can continue writing
its coredump to the configured coredump socket.
In the test, we create a local server and rewire the system to write
coredumps into it. We then create a child process within a Landlock
domain where LANDLOCK_ACCESS_FS_RESOLVE_UNIX is restricted and make
the process crash. The test uses SO_PEERCRED to check that the
connecting client process is the expected one.
Includes a fix by Mickaël Salaün for setting the EUID to 0 (see [1]).
Link[1]: https://lore.kernel.org/all/20260218.ohth8theu8Yi@digikod.net/
Suggested-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260327164838.38231-11-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Add an audit test to check that Landlock denials from
LANDLOCK_ACCESS_FS_RESOLVE_UNIX result in audit logs in the expected
format. (There is one audit test for each filesystem access right, so
we should add one for LANDLOCK_ACCESS_FS_RESOLVE_UNIX as well.)
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260327164838.38231-10-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
* Extract common helpers from an existing IOCTL test that
also uses pathname unix(7) sockets.
* These tests use the common scoped domains fixture which is also used
in other Landlock scoping tests and which was used in Tingmao Wang's
earlier patch set in [1].
These tests exercise the cross product of the following scenarios:
* Stream connect(), Datagram connect(), Datagram sendmsg() and
Seqpacket connect().
* Child-to-parent and parent-to-child communication
* The Landlock policy configuration as listed in the scoped_domains
fixture.
* In the default variant, Landlock domains are only placed where
prescribed in the fixture.
* In the "ALL_DOMAINS" variant, Landlock domains are also placed in
the places where the fixture says to omit them, but with a
LANDLOCK_RULE_PATH_BENEATH that allows connection.
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Tingmao Wang <m@maowtm.org>
Cc: Mickaël Salaün <mic@digikod.net>
Link[1]: https://lore.kernel.org/all/53b9883648225d5a08e82d2636ab0b4fda003bc9.1767115163.git.m@maowtm.org/
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260327164838.38231-9-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
The access_fs_16 variable was originally intended to stay frozen at 16
access rights so that audit tests would not need updating when new
access rights are added. Now that we have 17 access rights, the name
is confusing.
Replace all uses of access_fs_16 with ACCESS_ALL and delete the
variable.
Suggested-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260327164838.38231-8-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
* Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
controls the lookup operations for named UNIX domain sockets. The
resolution happens during connect() and sendmsg() (depending on
socket type).
* Change access_mask_t from u16 to u32 (see below)
* Hook into the path lookup in unix_find_bsd() in af_unix.c, using a
LSM hook. Make policy decisions based on the new access rights
* Increment the Landlock ABI version.
* Minor test adaptations to keep the tests working.
* Document the design rationale for scoped access rights,
and cross-reference it from the header documentation.
With this access right, access is granted if either of the following
conditions is met:
* The target socket's filesystem path was allow-listed using a
LANDLOCK_RULE_PATH_BENEATH rule, *or*:
* The target socket was created in the same Landlock domain in which
LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted.
In case of a denial, connect() and sendmsg() return EACCES, which is
the same error as it is returned if the user does not have the write
bit in the traditional UNIX file system permissions of that file.
The access_mask_t type grows from u16 to u32 to make space for the new
access right. This also doubles the size of struct layer_access_masks
from 32 byte to 64 byte. To avoid memory layout inconsistencies between
architectures (especially m68k), pack and align struct access_masks [2].
Document the (possible future) interaction between scoped flags and
other access rights in struct landlock_ruleset_attr, and summarize the
rationale, as discussed in code review leading up to [3].
This feature was created with substantial discussion and input from
Justin Suess, Tingmao Wang and Mickaël Salaün.
Cc: Tingmao Wang <m@maowtm.org>
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Link[1]: https://github.com/landlock-lsm/linux/issues/36
Link[2]: https://lore.kernel.org/all/20260401.Re1Eesu1Yaij@digikod.net/
Link[3]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20260327164838.38231-5-gnoack3000@gmail.com
[mic: Fix kernel-doc formatting, pack and align access_masks]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Domain deallocation records are emitted asynchronously from kworker
threads (via free_ruleset_work()). Stale deallocation records from a
previous test can arrive during the current test's deallocation read
loop and be picked up by audit_match_record() instead of the expected
record, causing a domain ID mismatch. The audit.layers test (which
creates 16 nested domains) is particularly vulnerable because it reads
16 deallocation records in sequence, providing a large window for stale
records to interleave.
The same issue affects audit_flags.signal, where deallocation records
from a previous test (audit.layers) can leak into the next test and be
picked up by audit_match_record() instead of the expected record.
Fix this by continuing to read records when the type matches but the
content pattern does not. Stale records are silently consumed, and the
loop only stops when both type and pattern match (or the socket times
out with -EAGAIN).
Additionally, extend matches_log_domain_deallocated() with an
expected_domain_id parameter. When set, the regex pattern includes the
specific domain ID as a literal hex value, so that deallocation records
for a different domain do not match the pattern at all. This handles
the case where the stale record has the same denial count as the
expected one (e.g. both have denials=1), which the type+pattern loop
alone cannot distinguish. Callers that already know the expected domain
ID (from a prior denial or allocation record) now pass it to filter
precisely.
When expected_domain_id is set, matches_log_domain_deallocated() also
temporarily increases the socket timeout to audit_tv_dom_drop (1 second)
to wait for the asynchronous kworker deallocation, and restores
audit_tv_default afterward. This removes the need for callers to manage
the timeout switch manually.
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b2297 ("selftests/landlock: Add tests for audit flags and domain IDs")
Link: https://lore.kernel.org/r/20260402192608.1458252-5-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Non-audit Landlock tests generate audit records as side effects when
audit_enabled is non-zero (e.g. from boot configuration). These records
accumulate in the kernel audit backlog while no audit daemon socket is
open. When the next test opens a new netlink socket and registers as
the audit daemon, the stale backlog is delivered, causing baseline
record count checks to fail spuriously.
Fix this by draining all pending records in audit_init() right after
setting the receive timeout. The 1-usec SO_RCVTIMEO causes audit_recv()
to return -EAGAIN once the backlog is empty, naturally terminating the
drain loop.
Domain deallocation records are emitted asynchronously from a work
queue, so they may still arrive after the drain. Remove records.domain
== 0 checks that are not preceded by audit_match_record() calls, which
would otherwise consume stale records before the count. Document this
constraint above audit_count_records().
Increasing the drain timeout to catch in-flight deallocation records was
considered but rejected: a longer timeout adds latency to every
audit_init() call even when no stale record is pending, and any fixed
timeout is still not guaranteed to catch all records under load.
Removing the unprotected checks is simpler and avoids the spurious
failures.
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b2297 ("selftests/landlock: Add tests for audit flags and domain IDs")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260402192608.1458252-4-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
audit_init() opens a netlink socket and configures it, but leaks the
file descriptor if audit_set_status() or setsockopt() fails. Fix this
by jumping to an error path that closes the socket before returning.
Apply the same fix to audit_init_with_exe_filter(), which leaks the file
descriptor from audit_init() if audit_init_filter_exe() or
audit_filter_exe() fails, and to audit_cleanup(), which leaks it if
audit_init_filter_exe() fails in FIXTURE_TEARDOWN_PARENT().
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b2297 ("selftests/landlock: Add tests for audit flags and domain IDs")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260402192608.1458252-3-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
snprintf() returns the number of characters that would have been
written, excluding the terminating NUL byte. When the output is
truncated, this return value equals or exceeds the buffer size. Fix
matches_log_domain_allocated() and matches_log_domain_deallocated() to
detect truncation with ">=" instead of ">".
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b2297 ("selftests/landlock: Add tests for audit flags and domain IDs")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260402192608.1458252-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
LANDLOCK_RESTRICT_SELF_TSYNC does not allow
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF with ruleset_fd=-1, preventing
a multithreaded process from atomically propagating subdomain log muting
to all threads without creating a domain layer. Relax the fd=-1
condition to accept TSYNC alongside LOG_SUBDOMAINS_OFF, and update the
documentation accordingly.
Add flag validation tests for all TSYNC combinations with ruleset_fd=-1,
and audit tests verifying both transition directions: muting via TSYNC
(logged to not logged) and override via TSYNC (not logged to logged).
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 42fc7e6543 ("landlock: Multithreading support for landlock_restrict_self()")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260407164107.2012589-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
hook_cred_transfer() only copies the Landlock security blob when the
source credential has a domain. This is inconsistent with
landlock_restrict_self() which can set LOG_SUBDOMAINS_OFF on a
credential without creating a domain (via the ruleset_fd=-1 path): the
field is committed but not preserved across fork() because the child's
prepare_creds() calls hook_cred_transfer() which skips the copy when
domain is NULL.
This breaks the documented use case where a process mutes subdomain logs
before forking sandboxed children: the children lose the muting and
their domains produce unexpected audit records.
Fix this by unconditionally copying the Landlock credential blob.
Cc: Günther Noack <gnoack@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Fixes: ead9079f75 ("landlock: Add LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF")
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260407164107.2012589-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
CO-RE accessor strings are colon-separated indices that describe a path
from a root BTF type to a target field, e.g. "0:1:2" walks through
nested struct members. bpf_core_parse_spec() parses each component with
sscanf("%d"), so negative values like -1 are silently accepted. The
subsequent bounds checks (access_idx >= btf_vlen(t)) only guard the
upper bound and always pass for negative values because C integer
promotion converts the __u16 btf_vlen result to int, making the
comparison (int)(-1) >= (int)(N) false for any positive N.
When -1 reaches btf_member_bit_offset() it gets cast to u32 0xffffffff,
producing an out-of-bounds read far past the members array. A crafted
BPF program with a negative CO-RE accessor on any struct that exists in
vmlinux BTF (e.g. task_struct) crashes the kernel deterministically
during BPF_PROG_LOAD on any system with CONFIG_DEBUG_INFO_BTF=y
(default on major distributions). The bug is reachable with CAP_BPF:
BUG: unable to handle page fault for address: ffffed11818b6626
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Oops: Oops: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 85 Comm: poc Not tainted 7.0.0-rc6 #18 PREEMPT(full)
RIP: 0010:bpf_core_parse_spec (tools/lib/bpf/relo_core.c:354)
RAX: 00000000ffffffff
Call Trace:
<TASK>
bpf_core_calc_relo_insn (tools/lib/bpf/relo_core.c:1321)
bpf_core_apply (kernel/bpf/btf.c:9507)
check_core_relo (kernel/bpf/verifier.c:19475)
bpf_check (kernel/bpf/verifier.c:26031)
bpf_prog_load (kernel/bpf/syscall.c:3089)
__sys_bpf (kernel/bpf/syscall.c:6228)
</TASK>
CO-RE accessor indices are inherently non-negative (struct member index,
array element index, or enumerator index), so reject them immediately
after parsing.
Fixes: ddc7c30426 ("libbpf: implement BPF CO-RE offset relocation algorithm")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260404161221.961828-2-bestswngs@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Enable the following tests on s390:
* memslot_modification_stress_test
* memslot_perf_test
* mmu_stress_test
Since the first two tests are now supported on all architectures, move
them into TEST_GEN_PROGS_COMMON and out of the indiviual architectures.
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Remove the 1M memslot alignment requirement for s390, since it is not
needed anymore.
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Add --rdonly_shmem_buf option to kublk that registers shared memory
buffers with UBLK_SHMEM_BUF_READ_ONLY (read-only pinning without
FOLL_WRITE) and mmaps with PROT_READ only.
Add test_shmemzc_04.sh which exercises the new flag with a null target,
hugetlbfs buffer, and write workload. Write I/O works because the
server only reads from the shared buffer — the data flows from client
to kernel to the shared pages, and the server reads them out.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add test_shmemzc_03.sh which exercises shmem_zc through the full
filesystem stack: mkfs ext4 on the ublk device, mount it, then run
fio verify on a file inside the filesystem with --mem=mmaphuge.
Extend _mkfs_mount_test() to accept an optional command that runs
between mount and umount. The function cd's into the mount directory
so the command can use relative file paths. Existing callers that
pass only the device are unaffected.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add test_shmem_zc_02.sh which tests the UBLK_IO_F_SHMEM_ZC zero-copy
path on the loop target using a hugetlbfs shared buffer. Both kublk and
fio mmap the same hugetlbfs file with MAP_SHARED, sharing physical
pages. The kernel's PFN matching enables zero-copy — the loop target
reads/writes directly from the shared buffer to the backing file.
Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add test_shmem_zc_01.sh which tests UBLK_IO_F_SHMEM_ZC on the null
target using a hugetlbfs shared buffer. Both kublk (--htlb) and fio
(--mem=mmaphuge:<path>) mmap the same hugetlbfs file with MAP_SHARED,
sharing physical pages. The kernel PFN match enables zero-copy I/O.
Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add loop_queue_shmem_zc_io() which handles I/O requests marked with
UBLK_IO_F_SHMEM_ZC. When the kernel sets this flag, the request data
lives in a registered shared memory buffer — decode index + offset
from iod->addr and use the server's mmap as the I/O buffer.
The dispatch check in loop_queue_tgt_rw_io() routes SHMEM_ZC requests
to this new function, bypassing the normal buffer registration path.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add infrastructure for UBLK_F_SHMEM_ZC shared memory zero-copy:
- kublk.h: struct ublk_shmem_entry and table for tracking registered
shared memory buffers
- kublk.c: per-device unix socket listener that accepts memfd
registrations from clients via SCM_RIGHTS fd passing. The listener
mmaps the memfd and registers the VA range with the kernel for PFN
matching. Also adds --shmem_zc command line option.
- kublk.c: --htlb <path> option to open a pre-allocated hugetlbfs
file, mmap it with MAP_SHARED|MAP_POPULATE, and register it with
the kernel via ublk_ctrl_reg_buf(). Any process that mmaps the same
hugetlbfs file shares the same physical pages, enabling zero-copy
without socket-based fd passing.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260331153207.3635125-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fixes errors in cpupower-frequency-info short option names
to its manpage.
- Fixes cpupower-idle-info perf option name to its manpage.
- Adds boost and epp options to cpupower-frequency-info to its
manpage.
- Adds description for perf-bias option to cpupower-info to its
manpage.
- Removes unnecessary extern declarations from getopt.h in arguments
parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
cpupower-info, and cpupower-set utilities. These functions are
defined getopt.h file.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmnT+88ACgkQCwJExA0N
QxzwFQ/7BaLieo6RjYGnnu+Z91o1O0Hs0q2ivS1qXQ1eefN45kBF0hR7pK6e+fWi
kUXyh9j9E+itiBsZTHxZk6l0oPgw/t9mJJkIvo7Y/F6va5hJkSUdykMsEpDj8rZp
MwLHaXTyZnZBW/W4lBsrgS0/mHNsQm9ru5KRGu6qlDu5Sbb4NokvPPkERh4FPWdi
u5jsusZn96IkgftvpJ1ilBtVjvJtB7dSMm6NOg0nJXHmzFbFj+k3B0yW1E3fjojx
qZVSohPXs1d167yy7KL1nYm7TMhFRQHooSv2/jdFEVJ9nD6IwFX8mmP7Y+2a/Wh9
adiGAbkZx8vt5deqiAgV4UYFkXMronqRz+kbqk8OnHicQL1zub4K2i09vWLdvE4u
kHFF2z2ZSyTYDvn8ttX5MeL/NkygWfD4ubtL9iG6AMLx0JOgFXoYefg1WL1PG76v
xOR+DQVzPyzIknTkxgtrRCMrXIt9+nJKwf0BCX27P7hHl1hlSRJ9qsgw0mkF+Sjq
rdoVTPDdyQIOD2+863sTpMDeH3T/IfjF2Jn9aTbWFUkFYhdnYuIAK7RK4gC0LZ6B
DUdNqva6qeaG+o5FSwgp0oHUOIemObBA4Cr0LzkZR/IZSP7yIPNAYn5EWqgdve5s
6XzX8GFHMSKUfpJoHaUjjxZ3ursj0xw3MHrhv5t4npsx3yKHOIk=
=f1Hu
-----END PGP SIGNATURE-----
Merge tag 'linux-cpupower-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
Pull cpupower utility updates for 7.1-rc1 from Shuah Khan:
"- Fixes errors in cpupower-frequency-info short option names
to its manpage.
- Fixes cpupower-idle-info perf option name to its manpage.
- Adds boost and epp options to cpupower-frequency-info to its
manpage.
- Adds description for perf-bias option to cpupower-info to its
manpage.
- Removes unnecessary extern declarations from getopt.h in arguments
parsing functions in cpufreq-set, cpuidle-info, cpuidle-set,
cpupower-info, and cpupower-set utilities. These functions are
defined getopt.h file."
* tag 'linux-cpupower-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux:
cpupower: remove extern declarations in cmd functions
cpupower-info.1: describe the --perf-bias option
cpupower-frequency-info.1: document --boost and --epp options
cpupower-frequency-info.1: use the proper name of the --perf option
cpupower-idle-info.1: fix short option names
Drop support for HMAC-RIPEMD-160 from IPsec to reduce the UAPI surface
and simplify future maintenance. It's almost certainly unused.
RIPEMD-160 received some attention in the early 2000s when SHA-* weren't
quite as well established. But it never received much adoption outside
of certain niches such as Bitcoin.
It's actually unclear that Linux + IPsec + HMAC-RIPEMD-160 has *ever*
been used, even historically. When support for it was added in 2003, it
was done so in a "cleanup" commit without any justification [1]. It
didn't actually work until someone happened to fix it 5 years later [2].
That person didn't use or test it either [3]. Finally, also note that
"hmac(rmd160)" is by far the slowest of the algorithms in aalg_list[].
Of course, today IPsec is usually used with an AEAD, such as AES-GCM.
But even for IPsec users still using a dedicated auth algorithm, they
almost certainly aren't using, and shouldn't use, HMAC-RIPEMD-160.
Thus, let's just drop support for it. Note: no kconfig update is
needed, since CRYPTO_RMD160 wasn't actually being selected anyway.
References:
[1] linux-history commit d462985fc1941a47
("[IPSEC]: Clean up key manager algorithm handling.")
[2] linux commit a13366c632
("xfrm: xfrm_algo: correct usage of RIPEMD-160")
[3] https://lore.kernel.org/all/1212340578-15574-1-git-send-email-rueegsegger@swiss-it.ch
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The automatic skipping of tests on ENOSYS returns was introduced in
commit 349afc8a52 ("selftests/nolibc: skip tests for unimplemented
syscalls"). It handled the fact that nolibc would return ENOSYS for many
syscall wrappers on riscv32.
Nowadays nolibc handles all these correctly, so this logic is not used
anymore. To make missing nolibc functionality more obvious fail the
tests again if something is not implemented.
Revert the mentioned commit again.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260406-nolibc-no-skip-enosys-v1-2-c046b1ac7d73@weissschuh.net/
Add some standard functions to convert between different byte orders.
Conveniently the UAPI headers provide all the necessary functionality.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260405-nolibc-bswap-v1-1-f7699ca9cee0@weissschuh.net
The standard syscall() function or macro uses the libc return value
convention. Errors returned from the kernel as negative values are
stored in errno and -1 is returned. Users who want to avoid using
errno don't have a way to call raw syscalls and check the returned
error.
Add a new macro _syscall() which works like the standard syscall()
but passes through the return value from the kernel unchanged.
The naming scheme and return values match the named _sys_foo()
system call wrappers already part of nolibc.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260405-nolibc-syscall-v1-3-e5b12bc63211@weissschuh.net
__sysret() transforms the return value from the kernel into the libc
return value convention. There is no reason for it to be called in the
middle of the internals of the syscall() implementation macros.
Move the call up, directly into syscall(), to make the code simpler.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260405-nolibc-syscall-v1-2-e5b12bc63211@weissschuh.net
These macros are the internal implementation of syscall().
They can not be used by users. Align them with the standard naming
scheme for internal symbols.
The current name also prevents the addition of an application-usable
_syscall() symbol.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260405-nolibc-syscall-v1-1-e5b12bc63211@weissschuh.net
In this "delete re-add signal" MPTCP Join subtest, the endpoint linked
to the initial subflow is removed, but readded once with different ID.
It appears that there was an issue when reusing the same ID, recently
fixed by commit d191101dee ("mptcp: pm: in-kernel: always set ID as
avail when rm endp"). The test then now reuses the same ID the first
time, but continue to use another one (88) the second time.
This should then cover more cases.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/615
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-5-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When send() or recv() returns -1 with errno == EINTR, the code skips
the break but still adds the return value to nwritten/nread, making it
decrease by 1. This leads to wrong buffer offsets and wrong bytes count.
Fix it by explicitly continuing the loop on EINTR, so the return value
is only added when it is positive.
Fixes: a8ed71a27e ("vsock/test: add recv_buf() utility function")
Fixes: 12329bd51f ("vsock/test: add send_buf() utility function")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260403093251.30662-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since we have changed how big user defined headroom in umem can be,
change the logic in testapp_stats_rx_dropped() so we pass updated
headroom validation in xdp_umem_reg() and still drop half of frames.
Test works on non-mbuf setup so __xsk_pool_get_rx_frame_size() that is
called on xsk_rcv_check() will not account skb_shared_info size. Taking
the tailroom size into account in test being fixed is needed as
xdp_umem_reg() defaults to respect it.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-9-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently two different XDP programs share a static variable for
different purposes (picking where to redirect on shared umem test &
whether to drop a packet). This can be a problem when running full test
suite - idx can be written by shared umem test and this value can cause
a false behavior within XDP drop half test.
Introduce a dedicated variable for drop half test so that these two
don't step on each other toes. There is no real need for using
__sync_fetch_and_add here as XSK tests are executed on single CPU.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-8-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Skip tail adjust tests in xskxceiver for SKB mode as it is not very
friendly for it. multi-buffer case does not work as xdp_rxq_info that is
registered for generic XDP does not report ::frag_size. The non-mbuf
path copies packet via skb_pp_cow_data() which only accounts for
headroom, leaving us with no tailroom and causing underlying XDP prog to
drop packets therefore.
For multi-buffer test on other modes, change the amount of bytes we use
for growth, assume worst-case scenario and take care of headroom and
tailroom.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-7-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Parametrize current way of getting MAX_SKB_FRAGS value from {sys,proc}fs
so that it can be re-used to get cache line size of system's CPU. All
that just to mimic and compute size of kernel's struct skb_shared_info
which for xsk and test suite interpret as tailroom.
Introduce two variables to ifobject struct that will carry count of skb
frags and tailroom size. Do the reading and computing once, at the
beginning of test suite execution in xskxceiver, but for test_progs such
way is not possible as in this environment each test setups and torns
down ifobject structs.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-6-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A `gotox rX` instruction accepts only values of type PTR_TO_INSN.
The only way to create such a value is to load it from a map of
type insn_array:
rX = *(rY + offset) # rY was read from an insn_array
...
gotox rX
Add instruction-level and C-level selftests to validate loads
with nonzero offsets.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sometimes it's hard to spot the ok / not ok lines in the output.
This is especially true for the GRO tests which retries a lot
so there's a wall of non-fatal output printed.
Try to color the crucial lines green / red / yellow when running
in a terminal.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260402215444.1589893-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure we reject programs that access beyond the maximum syscall ctx
size, i.e. U16_MAX either through direct accesses or helpers/kfuncs.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add coverage for unaligned access with fixed offsets and variable
offsets, and through helpers or kfuncs.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ensure that global subprogs and tail calls can only accept an unmodified
PTR_TO_CTX for syscall programs. For all other program types, fixed or
variable offsets on PTR_TO_CTX is rejected when passed into an argument
of any call instruction type, through the unified logic of
check_func_arg_reg_off.
Finally, add a positive example of a case that should succeed with all
our previous changes.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add various tests to exercise fixed and variable offsets on PTR_TO_CTX
for syscall programs, and cover disallowed cases for other program types
lacking convert_ctx_access callback. Load verifier_ctx with CAP_SYS_ADMIN
so that kfunc related logic can be tested. While at it, convert assembly
tests to C. Unfortunately, ctx_pointer_to_helper_2's unpriv case conflicts
with usage of kfuncs in the file and cannot be run.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Convert existing tests from ASM to C, in prep for future changes to add
more comprehensive tests.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
de6c7d99f8 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.
The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.
The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.
We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.
Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.
All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.
Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kunit.py will attempt to catch SIGINT / ^C in order to ensure the TTY isn't
messed up, but never actually attempts to terminate the running kernel (be
it UML or QEMU). This can lead to a bit of frustration if the kernel has
crashed or hung.
Terminate the kernel process in the signal handler, if it's running. This
requires plumbing through the process handle in a few more places (and
having some checks to see if the kernel is still running in places where it
may have already been killed).
Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Closes: https://lore.kernel.org/all/aaFmiAmg9S18EANA@smile.fi.intel.com/
Signed-off-by: David Gow <david@davidgow.net>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
run_kernel() cleanup and signal_handler() invoke stty unconditionally.
When stdin is not a tty (for example in CI or unit tests), this writes
noise to stderr.
Call stty only when stdin is a tty.
Add regression tests for these paths:
- run_kernel() with non-tty stdin
- signal_handler() with non-tty stdin
- signal_handler() with tty stdin
Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
If no KTAP header is found in the kernel output (e.g., because the kernel
crashed before the KUnit executor was run), it's very useful to re-run the
test with --raw_output=all, as that will show any error output (such as a
stacktrace, log message, BUG, etc). This is not particularly intuitive,
however, as --raw_output=all is not well known.
Add an extra log line to advertise --raw_output=all in this case, as it's
a terrible user experience to just get "Did any KUnit tests run?"
Signed-off-by: David Gow <david@davidgow.net>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Currently, kunit.py allows listing all individual tests via --list_tests.
However, users often need to see only the available test suites.
Add --list_suites to show suites. This option parses the test list output
from the kernel and prints only the suite names.
Example of the output of --list_suites:
example_init
miscdev_init
printk-ringbuffer
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
scx_alloc_free_idx() zeroes the payload of a freed arena allocation
one word at a time. The loop bound was alloc->pool.elem_size / 8, but
elem_size includes sizeof(struct sdt_data) (the 8-byte union sdt_id
header). This caused the loop to write one extra u64 past the
allocation, corrupting the tid field of the adjacent pool element.
Fix the loop bound to (elem_size - sizeof(struct sdt_data)) / 8 so
only the payload portion is zeroed.
Test plan:
- Add a temporary sanity check in scx_task_free() before the free call:
if (mval->data->tid.idx != mval->tid.idx)
scx_bpf_error("tid corruption: arena=%d storage=%d",
mval->data->tid.idx, (int)mval->tid.idx);
- stress-ng --fork 100 -t 10 & sudo ./build/bin/scx_sdt
Without this fix, running scx_sdt under fork-heavy load triggers the
corruption error. With the fix applied, the same workload completes
without error.
Fixes: 36929ebd17 ("tools/sched_ext: add arena based scheduler")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
nolibc should work without libgcc to be compatible with as many
toolchains as possible. Currently the functionality tested by
nolibc-test does not contain any dependencies, make sure it stays
this way by not linking libgcc anymore.
On the ppc target GCC always emits references to '_restgpr_' functions,
so keep linking libgcc there.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-libgcc-v1-1-eb3ecfe0e176@weissschuh.net
On some architectures without native division instructions
the division can generate calls into libgcc/compiler-rt.
This library might not be available, so its use should be avoided.
Use the compiler builtin to check for overflows without needing a
division. The builtin has been available since GCC 3 and clang 3.8.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-asprintf-v2-1-17d2d0df9763@weissschuh.net
extern char *optarg and extern int optind, opterr, optopt are
already declared by <getopt.h>, which is included at the top of
the file. Repeating extern declarations inside a function body
is misleading and unnecessary.
Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Before the fix, teardown of a ublk server that was attempting to recover
a device, but died when it had submitted a nonempty proper subset of the
fetch commands to any queue would loop forever. Add a test to verify
that, after the fix, teardown completes. This is done by:
- Adding a new argument to the fault_inject target that causes it die
after fetching a nonempty proper subset of the IOs to a queue
- Using that argument in a new test while trying to recover an
already-created device
- Attempting to delete the ublk device at the end of the test; this
hangs forever if teardown from the fault-injected ublk server never
completed.
It was manually verified that the test passes with the fix and hangs
without it.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260405-cancel-v2-2-02d711e643c2@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Running perf sched stats requires root and it fails to open the
schedstat file for regular users. Let's skip the test.
$ perf sched stats true
Failed to open /proc/sys/kernel/sched_schedstats
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When the evlist is expanded the metric leader wasn't being updated. As
the original evsel is deleted this creates a use-after-free in
stat-shadow's prepare_metric. This was detected running the "perf stat
--bpf-counters --for-each-cgroup test" with sanitizers.
The change itself puts the copied evsel into the priv field (known
unused because of evsel__clone use) and then in a second pass over the
list updates the copied values using the priv pointer.
Fixes: d1c5a0e86a ("perf stat: Add --for-each-cgroup option")
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add the evsel from evsel__parse_sample into the struct
perf_sample. Sometimes we want to alter the evsel associated with a
sample, such as with off-cpu bpf-output events. In general the evsel
and perf_sample are passed as a pair, but this makes an altered evsel
something of a chore to keep checking for and setting up. Later
patches will remove passing an evsel with the perf_sample and switch
to just using the perf_sample's value.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The deferred stack trace code wasn't using perf_sample__init/exit. Add
the deferred stack trace clean up to perf_sample__exit which requires
proper NULL initialization in perf_sample__init. Make the
perf_sample__exit robust to being called more than once by using
zfree. Make the error paths in evsel__parse_sample exit the
sample. Add a merged_callchain boolean to capture that callchain is
allocated, deferred_callchain doen't suffice for this. Pack the struct
variables to avoid padding bytes for this.
Similiarly powerpc_vpadtl_sample wasn't using perf_sample__init/exit,
use it for consistency and potential issues with uninitialized
variables.
Similarly guest_session__inject_events in builtin-inject wasn't using
perf_sample_init/exit. The lifetime management for fetched events is
somewhat complex there, but when an event is fetched the sample should
be initialized and needs exiting on error. The sample may be left in
place so that future injects have access to it.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add kernel-doc for struct perf_sample capturing the somewhat unusual
population of fields and lifetime relationships.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Store cacheline size during perf record in header, so that cacheline
size can be used for other features, like sort keys for perf report.
Testing example with feat enabled:
$ perf record ./Example
$ perf report --header-only | grep -C 3 cacheline
CPU_DOMAIN_INFO info available, use -I to display
e_machine : 62
e_flags : 0
cacheline size: 64
missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE \
STAT CLOCKID DIR_FORMAT COMPRESSED CLOCK_DATA
========
[namhyung: Update the commit message and remove blank lines]
Signed-off-by: Ricky Ringler <ricky.ringler@proton.me>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Writing to the perf.data file can fail in various contexts such as
continual test. Other tests write to a mktemp-ed file, make the "perf
sched stats tests" follow this convention.
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Doing a `perf sched record` then `perf sched stats report` crashes as
the tp_handler isn't set. Add a dummy tp_handler for it rather than
adding an extra check.
Reported-by: Ian Rogers <irogers@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
tc_tunnel test is based on a send_and_test_data function which takes a
subtest configuration, and a boolean indicating whether the connection
is supposed to fail or not. This boolean is systematically passed to
true, and is a remnant from the first (not integrated) attempts to
convert tc_tunnel to test_progs: those versions validated for
example that a connection properly fails when only one side of the
connection has tunneling enabled. This specific testing has not been
integrated because it involved large timeouts which increased quite a
lot the test duration, for little added value.
Remove the unused boolean from send_and_test_data to simplify the
generic part of subtests.
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260403-tc_tunnel_cleanup-v1-1-4f1bb113d3ab@bootlin.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Verify that bpf_map__get_next_key() correctly returns -ENOENT when
called on the last (and only) key in a cgroup_storage map. Before the
fix in the previous patch, this would succeed with bogus key data
instead of failing.
Suggested-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260403132951.43533-3-bestswngs@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a consistency subtest to htab_reuse that detects torn writes
caused by the BPF_F_LOCK lockless update racing with element
reallocation in alloc_htab_elem().
The test uses three thread roles started simultaneously via a pipe:
- locked updaters: BPF_F_LOCK|BPF_EXIST in-place updates
- delete+update workers: delete then BPF_ANY|BPF_F_LOCK insert
- locked readers: BPF_F_LOCK lookup checking value consistency
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-2-782d071c55e7@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix a CONFIG_SPARSEMEM crash on RV32 by avoiding early phys_to_page()
- Prevent runtime const infrastructure from being used by modules, similar
to what was done for x86
- Avoid problems when shutting down ACPI systems with IOMMUs by adding
a device dependency between IOMMU and devices that use it
- Fix a bug where the CPU pointer masking state isn't properly reset
when tagged addresses aren't enabled for a task
- Fix some incorrect register assignments, and add some missing ones,
in kgdb support code
- Fix compilation of non-kernel code that uses the ptrace uapi header
by replacing BIT() with _BITUL()
- Fix compilation of the validate_v_ptrace kselftest by working around
kselftest macro expansion issues
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAmnSgysACgkQx4+xDQu9
KksznQ//UKuNcpTgGoTOSAi9m5XrLNG7B0Z2Es5n3IuuFLeX4uFwD8pJjUouAqja
Y89HKHcbuawAZLxoEj5QImbFxyM6zgdA24R2kM76+Ds5nMM4hetL1hR1Gphs1ghs
Vg/klLkSQ/QkV8xTZlWe9A3s96PeiYKgwQUYdENjL/OXWjTbi4Ho/EQYjsXWGyuc
sGkWVbGeqPhNlv8bMcA11kM8rCsvyhFnAC5yIbmybmup6ObzS1tEnOXodp1jVDlZ
TPzi7SyjSLiTbsaJGZ1O5oFXSrr8zBLFt2RinR7rUt/8Aq8c5xSSvK9n808jytNP
ubIgqWjW3wGjzbZfQw4WhOIihtAsp2VssWZlt1p0Q7EGOx0g+/zMA6Uq1VVIuEML
+Xm6BwxLFm43NDSa7HPtytCoN/qqIQmiRkiLAG7WHL3mSkYDXYjTXZxTmp0awJ8R
WTlZsQFQlnNd8VydP++cwqi/lCPPqWqZbc8ys0lLt57+oe6eE91W3a4jXnIn/5YR
dtHLdmHF6xG3pVdilEfFgH7CkA1DMlFox5qQRFx4lLWBY7tTEY1S2o1tmIG1zqKd
QTcaO1VbuobTLAy06kD8XNUNh8jzW0zedk37BcxA+J+1B59c0N9J7rW8rkRYu4Le
eeIy9p8kPWUB/JfcMY+6jKUjZgQL9un8M4PpVZ/uWJDxQVDJcRs=
=d0PH
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Fix a CONFIG_SPARSEMEM crash on RV32 by avoiding early phys_to_page()
- Prevent runtime const infrastructure from being used by modules,
similar to what was done for x86
- Avoid problems when shutting down ACPI systems with IOMMUs by adding
a device dependency between IOMMU and devices that use it
- Fix a bug where the CPU pointer masking state isn't properly reset
when tagged addresses aren't enabled for a task
- Fix some incorrect register assignments, and add some missing ones,
in kgdb support code
- Fix compilation of non-kernel code that uses the ptrace uapi header
by replacing BIT() with _BITUL()
- Fix compilation of the validate_v_ptrace kselftest by working around
kselftest macro expansion issues
* tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
ACPI: RIMT: Add dependency between iommu and devices
selftests: riscv: Add braces around EXPECT_EQ()
riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
riscv: make runtime const not usable by modules
riscv: patch: Avoid early phys_to_page()
riscv: kgdb: fix several debug register assignment bugs
A user can invoke mmap_action_map_kernel_pages() to specify that the
mapping should map kernel pages starting from desc->start of a specified
number of pages specified in an array.
In order to implement this, adjust mmap_action_prepare() to be able to
return an error code, as it makes sense to assert that the specified
parameters are valid as quickly as possible as well as updating the VMA
flags to include VMA_MIXEDMAP_BIT as necessary.
This provides an mmap_prepare equivalent of vm_insert_pages(). We
additionally update the existing vm_insert_pages() code to use
range_in_vma() and add a new range_in_vma_desc() helper function for the
mmap_prepare case, sharing the code between the two in range_is_subset().
We add both mmap_action_map_kernel_pages() and
mmap_action_map_kernel_pages_full() to allow for both partial and full VMA
mappings.
We update the documentation to reflect the new features.
Finally, we update the VMA tests accordingly to reflect the changes.
Link: https://lkml.kernel.org/r/926ac961690d856e67ec847bee2370ab3c6b9046.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While the conversion of mmap hooks to mmap_prepare is underway, we will
encounter situations where mmap hooks need to invoke nested mmap_prepare
hooks.
The nesting of mmap hooks is termed 'stacking'. In order to flexibly
facilitate the conversion of custom mmap hooks in drivers which stack, we
must split up the existing __compat_vma_mmap() function into two separate
functions:
* compat_set_desc_from_vma() - This allows the setting of a vm_area_desc
object's fields to the relevant fields of a VMA.
* __compat_vma_mmap() - Once an mmap_prepare hook has been executed upon a
vm_area_desc object, this function performs any mmap actions specified by
the mmap_prepare hook and then invokes its vm_ops->mapped() hook if any
were specified.
In ordinary cases, where a file's f_op->mmap_prepare() hook simply needs
to be invoked in a stacked mmap() hook, compat_vma_mmap() can be used.
However some drivers define their own nested hooks, which are invoked in
turn by another hook.
A concrete example is vmbus_channel->mmap_ring_buffer(), which is invoked
in turn by bin_attribute->mmap():
vmbus_channel->mmap_ring_buffer() has a signature of:
int (*mmap_ring_buffer)(struct vmbus_channel *channel,
struct vm_area_struct *vma);
And bin_attribute->mmap() has a signature of:
int (*mmap)(struct file *, struct kobject *,
const struct bin_attribute *attr,
struct vm_area_struct *vma);
And so compat_vma_mmap() cannot be used here for incremental conversion of
hooks from mmap() to mmap_prepare().
There are many such instances like this, where conversion to mmap_prepare
would otherwise cascade to a huge change set due to nesting of this kind.
The changes in this patch mean we could now instead convert
vmbus_channel->mmap_ring_buffer() to
vmbus_channel->mmap_prepare_ring_buffer(), and implement something like:
struct vm_area_desc desc;
int err;
compat_set_desc_from_vma(&desc, file, vma);
err = channel->mmap_prepare_ring_buffer(channel, &desc);
if (err)
return err;
return __compat_vma_mmap(&desc, vma);
Allowing us to incrementally update this logic, and other logic like it.
Unfortunately, as part of this change, we need to be able to flexibly
assign to the VMA descriptor, so have to remove some of the const
declarations within the structure.
Also update the VMA tests to reflect the changes.
Link: https://lkml.kernel.org/r/24aac3019dd34740e788d169fccbe3c62781e648.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently drivers use vm_iomap_memory() as a simple helper function for
I/O remapping memory over a range starting at a specified physical address
over a specified length.
In order to utilise this from mmap_prepare, separate out the core logic
into __simple_ioremap_prep(), update vm_iomap_memory() to use it, and add
simple_ioremap_prepare() to do the same with a VMA descriptor object.
We also add MMAP_SIMPLE_IO_REMAP and relevant fields to the struct
mmap_action type to permit this operation also.
We use mmap_action_ioremap() to set up the actual I/O remap operation once
we have checked and figured out the parameters, which makes
simple_ioremap_prepare() easy to implement.
We then add mmap_action_simple_ioremap() to allow drivers to make use of
this mode.
We update the mmap_prepare documentation to describe this mode. Finally,
we update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/a08ef1c4542202684da63bb37f459d5dbbeddd91.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously, when a driver needed to do something like establish a
reference count, it could do so in the mmap hook in the knowledge that the
mapping would succeed.
With the introduction of f_op->mmap_prepare this is no longer the case, as
it is invoked prior to actually establishing the mapping.
mmap_prepare is not appropriate for this kind of thing as it is called
before any merge might take place, and after which an error might occur
meaning resources could be leaked.
To take this into account, introduce a new vm_ops->mapped callback which
is invoked when the VMA is first mapped (though notably - not when it is
merged - which is correct and mirrors existing mmap/open/close behaviour).
We do better that vm_ops->open() here, as this callback can return an
error, at which point the VMA will be unmapped.
Note that vm_ops->mapped() is invoked after any mmap action is complete
(such as I/O remapping).
We intentionally do not expose the VMA at this point, exposing only the
fields that could be used, and an output parameter in case the operation
needs to update the vma->vm_private_data field.
In order to deal with stacked filesystems which invoke inner filesystem's
mmap() invocations, add __compat_vma_mapped() and invoke it on vfs_mmap()
(via compat_vma_mmap()) to ensure that the mapped callback is handled when
an mmap() caller invokes a nested filesystem's mmap_prepare() callback.
Update the mmap_prepare documentation to describe the mapped hook and make
it clear what its intended use is.
The vm_ops->mapped() call is handled by the mmap complete logic to ensure
the same code paths are handled by both the compatibility and VMA layers.
Additionally, update VMA userland test headers to reflect the change.
Link: https://lkml.kernel.org/r/4c5e98297eb0aae9565c564e1c296a112702f144.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rather than have the callers handle this both the rmap lock release and
unmapping the VMA on error, handle it within the mmap_action_complete()
logic where it makes sense to, being careful not to unlock twice.
This simplifies the logic and makes it harder to make mistake with this,
while retaining correct behaviour with regard to avoiding deadlocks.
Also replace the call_action_complete() function with a direct invocation
of mmap_action_complete() as the abstraction is no longer required.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/8d1ee8ebd3542d006a47e8382fb80cf5b57ecf10.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the mmap_prepare compatibility layer, we don't need to hold the rmap
lock, as we are being called from an .mmap handler.
The .mmap_prepare hook, when invoked in the VMA logic, is called prior to
the VMA being instantiated, but the completion hook is called after the VMA
is linked into the maple tree, meaning rmap walkers can reach it.
The mmap hook does not link the VMA into the tree, so this cannot happen.
Therefore it's safe to simply disable this in the mmap_prepare
compatibility layer.
Also update VMA tests code to reflect current compatibility layer state.
[akpm@linux-foundation.org: fix comment typo, per Vlastimil]
Link: https://lkml.kernel.org/r/dda74230d26a1fcd79a3efab61fa4101dd1cac64.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Describe when the operation is invoked and the context in which it is
invoked, matching the description already added for vm_op->close().
While we're here, update all outdated references to an 'area' field for
VMAs to the more consistent 'vma'.
Link: https://lkml.kernel.org/r/7d0ca833c12014320f0fa00f816f95e6e10076f2.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: expand mmap_prepare functionality and usage", v4.
This series expands the mmap_prepare functionality, which is intended to
replace the deprecated f_op->mmap hook which has been the source of bugs
and security issues for some time.
This series starts with some cleanup of existing mmap_prepare logic, then
adds documentation for the mmap_prepare call to make it easier for
filesystem and driver writers to understand how it works.
It then importantly adds a vm_ops->mapped hook, a key feature that was
missing from mmap_prepare previously - this is invoked when a driver which
specifies mmap_prepare has successfully been mapped but not merged with
another VMA.
mmap_prepare is invoked prior to a merge being attempted, so you cannot
manipulate state such as reference counts as if it were a new mapping.
The vm_ops->mapped hook allows a driver to perform tasks required at this
stage, and provides symmetry against subsequent vm_ops->open,close calls.
The series uses this to correct the afs implementation which wrongly
manipulated reference count at mmap_prepare time.
It then adds an mmap_prepare equivalent of vm_iomap_memory() -
mmap_action_simple_ioremap(), then uses this to update a number of drivers.
It then splits out the mmap_prepare compatibility layer (which allows for
invocation of mmap_prepare hooks in an mmap() hook) in such a way as to
allow for more incremental implementation of mmap_prepare hooks.
It then uses this to extend mmap_prepare usage in drivers.
Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays
the foundation for future work which will extend mmap_prepare to DMA
coherent mappings.
This patch (of 21):
Rather than passing arbitrary fields, pass a vm_area_desc pointer to mmap
prepare functions to mmap prepare, and an action and vma pointer to mmap
complete in order to put all the action-specific logic in the function
actually doing the work.
Additionally, allow mmap prepare functions to return an error so we can
error out as soon as possible if there is something logically incorrect in
the input.
Update remap_pfn_range_prepare() to properly check the input range for the
CoW case.
Also remove io_remap_pfn_range_complete(), as we can simply set up the
fields correctly in io_remap_pfn_range_prepare() and use
remap_pfn_range_complete() for this.
While we're here, make remap_pfn_range_prepare_vma() a little neater, and
pass mmap_action directly to call_action_complete().
Then, update compat_vma_mmap() to perform its logic directly, as
__compat_vma_map() is not used by anything so we don't need to export it.
Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than
calling the mmap_prepare op directly.
Finally, update the VMA userland tests to reflect the changes.
Link: https://lkml.kernel.org/r/cover.1774045440.git.ljs@kernel.org
Link: https://lkml.kernel.org/r/99f408e4694f44ab12bdc55fe0bd9685d3bd1117.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the mmap() implementation logic implemented in __mmap_region() and
functions invoked by it. The mmap_region() function converts its input
vm_flags_t parameter to a vma_flags_t value which it then passes to
__mmap_region() which uses the vma_flags_t value consistently from then
on.
As part of the change, we convert map_deny_write_exec() to using
vma_flags_t (it was incorrectly using unsigned long before), and place it
in vma.h, as it is only used internal to mm.
With this change, we eliminate the legacy is_shared_maywrite_vm_flags()
helper function which is now no longer required.
We are also able to update the MMAP_STATE() and VMG_MMAP_STATE() macros to
use the vma_flags_t value.
Finally, we update the VMA tests to reflect the change.
Link: https://lkml.kernel.org/r/1fc33a404c962f02da778da100387cc19bd62153.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the vma_modify_flags() and vma_modify_flags_uffd() functions to
accept a vma_flags_t parameter rather than a vm_flags_t one, and propagate
the changes as needed to implement this change.
Also add vma_flags_reset_once() in replacement of vm_flags_reset_once(). We
still need to be careful here because we need to avoid tearing, so maintain
the assumption that the first system word set of flags are the only ones
that require protection from tearing, and retain this functionality.
We can copy the remainder of VMA flags above 64 bits normally. But
hopefully by the time that happens, we will have replaced the logic that
requires these WRITE_ONCE()'s with something else.
We also replace instances of vm_flags_reset() with a simple write of VMA
flags. We are no longer perform a number of checks, most notable of all the
VMA flags asserts becase:
1. We might be operating on a VMA that is not yet added to the tree.
2. We might be operating on a VMA that is now detached.
3. Really in all but core code, you should be using vma_desc_xxx().
4. Other VMA fields are manipulated with no such checks.
5. It'd be egregious to have to add variants of flag functions just to
account for cases such as the above, especially when we don't do so for
other VMA fields. Drivers are the problematic cases and why it was
especially important (and also for debug as VMA locks were introduced),
the mmap_prepare work is solving this generally.
Additionally, we can fairly safely assume by this point the soft dirty
flags are being set correctly, so it's reasonable to drop this also.
Finally, update the VMA tests to reflect this.
Link: https://lkml.kernel.org/r/51afbb2b8c3681003cc7926647e37335d793836e.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now we have established a good foundation for vm_flags_t to vma_flags_t
changes, update mm/vma.c to utilise vma_flags_t wherever possible.
We are able to convert VM_STARTGAP_FLAGS entirely as this is only used in
mm/vma.c, and to account for the fact we can't use VM_NONE to make life
easier, place the definition of this within existing #ifdef's to be
cleaner.
Generally the remaining changes are mechanical.
Also update the VMA tests to reflect the changes.
Link: https://lkml.kernel.org/r/5fdeaf8af9a12c2a5d68497495f52fa627d05a5b.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The tests have existing flag clearing logic, so simply expand this to use
the new VMA-specific flag clearing helpers.
Also correct some trivial formatting issue in a macro define.
Link: https://lkml.kernel.org/r/f5da681d3c33039dd4a838188385796eb8d58373.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a helper function and helper macro to easily clear a VMA's flags
using the new vma_flags_t vma->flags field:
* vma_clear_flags_mask() - Clears all of the flags in a specified mask in
the VMA's flags field.
* vma_clear_flags() - Clears all of the specified individual VMA flag bits
in a VMA's flags field.
Also update the VMA tests to reflect the change.
Link: https://lkml.kernel.org/r/9bd15da35c2c90e7441265adf01b5c2d3b5c6d41.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to be able to do this, we need to change VM_DATA_DEFAULT_FLAGS
and friends and update the architecture-specific definitions also.
We then have to update some KSM logic to handle VMA flags, and introduce
VMA_STACK_FLAGS to define the vma_flags_t equivalent of VM_STACK_FLAGS.
We also introduce two helper functions for use during the time we are
converting legacy flags to vma_flags_t values - vma_flags_to_legacy() and
legacy_to_vma_flags().
This enables us to iteratively make changes to break these changes up into
separate parts.
We use these explicitly here to keep VM_STACK_FLAGS around for certain
users which need to maintain the legacy vm_flags_t values for the time
being.
We are no longer able to rely on the simple VM_xxx being set to zero if
the feature is not enabled, so in the case of VM_DROPPABLE we introduce
VMA_DROPPABLE as the vma_flags_t equivalent, which is set to
EMPTY_VMA_FLAGS if the droppable flag is not available.
While we're here, we make the description of do_brk_flags() into a kdoc
comment, as it almost was already.
We use vma_flags_to_legacy() to not need to update the vm_get_page_prot()
logic as this time.
Note that in create_init_stack_vma() we have to replace the BUILD_BUG_ON()
with a VM_WARN_ON_ONCE() as the tested values are no longer build time
available.
We also update mprotect_fixup() to use VMA flags where possible, though we
have to live with a little duplication between vm_flags_t and vma_flags_t
values for the time being until further conversions are made.
While we're here, update VM_SPECIAL to be defined in terms of
VMA_SPECIAL_FLAGS now we have vma_flags_to_legacy().
Finally, we update the VMA tests to reflect these changes.
Link: https://lkml.kernel.org/r/d02e3e45d9a33d7904b149f5604904089fd640ae.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> [SELinux]
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the VMA tests to assert that vma_flags_count() behaves as expected,
as well as vma_flags_test_single_mask() and vma_test_single_mask().
For the test functions we can simply update the existing vma_test(), et
al. test to also test the single_mask variants.
We also add some explicit testing of an empty VMA flag to this test to
ensure this is handled properly.
In order to test vma_flags_count() we simply take an existing set of flags
and gradually remove flags ensuring the count remains as expected
throughout.
We also update the vma[_flags]_test_all() tests to make clear the
semantics that we expect vma[_flags]_test_all(..., EMPTY_VMA_FLAGS) to
return true, as trivially, all flags of none are always set in VMA flags.
Link: https://lkml.kernel.org/r/4af95d559cd2af0ba3388de1e1386b9f94c0e009.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vma_flags_count() determines how many bits are set in VMA flags, using
bitmap_weight().
vma_flags_test_single_mask() determines if a vma_flags_t set of flags
contains a single flag specified as another vma_flags_t value, or if the
sought flag mask is empty, it is defined to return false.
This is useful when we want to declare a VMA flag as optionally a single
flag in a mask or empty depending on kernel configuration.
This allows us to have VM_NONE-like semantics when checking whether the
flag is set.
In a subsequent patch, we introduce the use of VMA_DROPPABLE of type
vma_flags_t using precisely these semantics.
It would be actively confusing to use vma_flags_test_any_single_mask() for
this (and vma_flags_test_all_mask() is not correct to use here, as it
trivially returns true when tested against an empty vma flags mask).
We introduce vma_flags_count() to be able to assert that the compared flag
mask is singular or empty, checked when CONFIG_DEBUG_VM is enabled.
Also update the VMA tests as part of this change.
Link: https://lkml.kernel.org/r/cd778dd02b9f2a01eb54d25a49dea8ec2ddf7753.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the existing test logic to assert that vma_test(), vma_test_any()
and vma_test_any_mask() (implicitly tested via vma_test_any()) are
functioning correctly.
We already have tests for other variants like this, so it's simply a
matter of expanding those tests to also include tests for the VMA-specific
helpers.
Link: https://lkml.kernel.org/r/dea3e97c6c3dd86f1a3f1a0703241b03f6e3a33f.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce helper functions and macros to make it convenient to test flags
and flag masks for VMAs, specifically:
* vma_test() - determine if a single VMA flag is set in a VMA.
* vma_test_any_mask() - determine if any flags in a vma_flags_t value are
set in a VMA.
* vma_test_any() - Helper macro to test if any of specific flags are set.
Also, there are a mix of 'inline's and '__always_inline's in VMA helper
function declarations, update to consistently use __always_inline.
Finally, update the VMA tests to reflect the changes.
Link: https://lkml.kernel.org/r/be1d71f08307d747a82232cbd8664a88c0f41419.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While we are still converting VMA flags from vma_flags_t to vm_flags_t,
introduce helpers to convert between the two to allow for iterative
development without having to 'change the world' in a single commit'.
Also update VMA flags tests to reflect the change.
Finally, refresh vma_flags_overwrite_word(),
vma_flag_overwrite_word_once(), vma_flags_set_word() and
vma_flags_clear_word() in the VMA tests to reflect current kernel
implementations - this should make no functional difference, but keeps the
logic consistent between the two.
Link: https://lkml.kernel.org/r/d3569470dbb3ae79134ca7c3eb3fc4df7086e874.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add helpers to determine if two sets of VMA flags are precisely the same,
that is - that every flag set one is set in another, and neither contain
any flags not set in the other.
We also introduce vma_flags_same_pair() for cases where we want to compare
two sets of VMA flags which are both non-const values.
Also update the VMA tests to reflect the change, we already implicitly
test that this functions correctly having used it for testing purposes
previously.
Link: https://lkml.kernel.org/r/4f764bf619e77205837c7c819b62139ef6337ca3.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a simple test for append_vma_flags() to assert that it behaves as
expected.
Additionally, include the VMA_REMAP_FLAGS definition in the VMA tests to
allow us to use this value in the testing.
Link: https://lkml.kernel.org/r/eebd946c5325ad7fae93027245a562eb1aeb68a2.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to be able to efficiently combine VMA flag masks with additional
VMA flag bits we need to extend the concept introduced in mk_vma_flags()
and __mk_vma_flags() by allowing the specification of a VMA flag mask to
append VMA flag bits to.
Update __mk_vma_flags() to allow for this and update mk_vma_flags()
accordingly, and also provide append_vma_flags() to allow for the caller
to specify which VMA flags mask to append to.
Finally, update the VMA flags tests to reflect the change.
Link: https://lkml.kernel.org/r/9f928cd4688270002f2c0c3777fcc9b49cc7a8ea.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The VMA tests are incorrectly referencing NUM_VMA_FLAGS, which doesn't
exist, rather they should reference NUM_VMA_FLAG_BITS.
Additionally, remove the custom-written implementation of __mk_vma_flags()
as this means we are not testing the code as present in the kernel, rather
add the actual __mk_vma_flags() to dup.h and add #ifdef's to handle
declarations differently depending on NUM_VMA_FLAG_BITS.
Link: https://lkml.kernel.org/r/b19c63af3d5efdfe712bf5d5f89368a5360a60f7.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the new vma_flags_t flags implementation to perform the logic around
sticky flags and what flags are ignored on VMA merge.
We make use of the new vma_flags_empty(), vma_flags_diff_pair(), and
vma_flags_and_mask() functionality.
Also update the VMA tests accordingly.
Link: https://lkml.kernel.org/r/369574f06360ffa44707047e3b58eb4897345fba.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert the test code to utilise vma_flags_t as opposed to the deprecate
vm_flags_t as much as possible.
As part of this change, add VMA_STICKY_FLAGS and VMA_SPECIAL_FLAGS as
early versions of what these defines will look like in the kernel logic
once this logic is implemented.
Link: https://lkml.kernel.org/r/df90efe29300bd899989f695be4ae3adc901a828.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to utilise the new vma_flags_t type, we currently place it in
union with legacy vm_flags fields of type vm_flags_t to make the
transition smoother.
Add vma_flags_t union entries for mm->def_flags and vmg->vm_flags -
mm->def_vma_flags and vmg->vma_flags respectively.
Once the conversion is complete, these will be replaced with vma_flags_t
entries alone.
Also update the VMA tests to reflect the change.
Link: https://lkml.kernel.org/r/d507d542c089ba132e9da53f2ff7f80ca117c3b4.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add VMA unit tests to assert that:
* vma_flags_empty()
* vma_flags_diff_pair()
* vma_flags_and_mask()
* vma_flags_and()
All function as expected.
In additional to the added tests, in order to make testing easier, add
vma_flags_same_mask() and vma_flags_same() for testing only. If/when
these are required in kernel code, they can be moved over.
Also add ASSERT_FLAGS_[NOT_]SAME[_MASK](), ASSERT_FLAGS_[NON]EMPTY() test
helpers to make asserting flag state easier and more convenient.
Link: https://lkml.kernel.org/r/471ce7ceb1d32e5fc9c0660966b9eacdf899b4d1.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code", v4.
This series converts a lot of the existing use of the legacy vm_flags_t
data type to the new vma_flags_t type which replaces it.
In order to do so it adds a number of additional helpers:
* vma_flags_empty() - Determines whether a vma_flags_t value has no bits
set.
* vma_flags_and() - Performs a bitwise AND between two vma_flags_t values.
* vma_flags_diff_pair() - Determines which flags are not shared between a
pair of VMA flags (typically non-constant values)
* append_vma_flags() - Similar to mk_vma_flags(), but allows a vma_flags_t
value to be specified (typically a constant value) which will be copied
and appended to to create a new vma_flags_t value, with additional flags
specified to append to it.
* vma_flags_same() - Determines if a vma_flags_t value is exactly equal to
a set of VMA flags.
* vma_flags_same_mask() - Determines if a vma_flags_t value is eactly equal
to another vma_flags_t value (typically constant).
* vma_flags_same_pair() - Determines if a pair of vma_flags_t values are
exactly equal to one another (typically both non-constant).
* vma_flags_to_legacy() - Converts a vma_flags_t value to a vm_flags_t
value, used to enable more iterative introduction of the use of
vma_flags_t.
* legacy_to_vma_flags() - Converts a vm_flags_t value to a vma_flags-t
value, for the same purpose.
* vma_flags_test_single_mask() - Tests whether a vma_flags_t value contain
the single flag specified in an input vma_flags_t flag mask, or if that
flag mask is empty, is defined to return false. Useful for
config-predicated VMA flag mask defines.
* vma_test() - Tests whether a VMA's flags contain a specific singular VMA
flag.
* vma_test_any() - Tests whether a VMA's flags contain any of a set of VMA
flags.
* vma_test_any_mask() - Tests whether a VMA's flags contain any of the
flags specified in another, typically constant, vma_flags_t value.
* vma_test_single_mask() - Tests whether a VMA's flags contain the single
flag specified in an input vma_flags_t flag mask, or if that flag mask is
empty, is defined to return false. Useful for config-predicated VMA flag
mask defines.
* vma_clear_flags() - Clears a specific set of VMA flags from a vma_flags_t
value.
* vma_clear_flags_mask() - Clears those flag set in a vma_flags_t value
(typically constant) from a (typically not constant) vma_flags_t value.
The series mostly focuses on the the VMA specific code, especially that
contained in mm/vma.c and mm/vma.h.
It updates both brk() and mmap() logic to utils vma_flags_t values as much
as is practiaclly possible at this point, changing surrounding logic to be
able to do so.
It also updates the vma_modify_xxx() functions where they interact with VMA
flags directly to use vm_flags_t values where possible.
There is extensive testing added in the VMA userland tests to assert that
all of these new VMA flag functions work correctly.
This patch (of 25):
Firstly, add the ability to determine if VMA flags are empty, that is no
flags are set in a vma_flags_t value.
Next, add the ability to obtain the equivalent of the bitwise and of two
vma_flags_t values, via vma_flags_and_mask().
Next, add the ability to obtain the difference between two sets of VMA
flags, that is the equivalent to the exclusive bitwise OR of the two sets
of flags, via vma_flags_diff_pair().
vma_flags_xxx_mask() typically operates on a pointer to a vma_flags_t
value, which is assumed to be an lvalue of some kind (such as a field in a
struct or a stack variable) and an rvalue of some kind (typically a
constant set of VMA flags obtained e.g. via mk_vma_flags() or
equivalent).
However vma_flags_diff_pair() is intended to operate on two lvalues, so
use the _pair() suffix to make this clear.
Finally, update VMA userland tests to add these helpers.
We also port bitmap_xor() and __bitmap_xor() to the tools/ headers and
source to allow the tests to work with vma_flags_diff_pair().
Link: https://lkml.kernel.org/r/cover.1774034900.git.ljs@kernel.org
Link: https://lkml.kernel.org/r/53ab55b7da91425775e42c03177498ad6de88ef4.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The added folio_split_race_test is a modified C port of the race condition
test from [1]. The test creates shmem huge pages, where the main thread
punches holes in the shmem to cause folio_split() in the kernel and a set
of 16 threads reads the shmem to cause filemap_get_entry() in the kernel.
filemap_get_entry() reads the folio and xarray split by folio_split()
locklessly. The original test[2] is written in rust and uses memfd (shmem
backed). This C port uses shmem directly and use a single process.
Note: the initial rust to C conversion is done by Cursor.
Link: https://lore.kernel.org/all/CAKNNEtw5_kZomhkugedKMPOG-sxs5Q5OLumWJdiWXv+C9Yct0w@mail.gmail.com/ [1]
Link: https://github.com/dfinity/thp-madv-remove-test [2]
Link: https://lkml.kernel.org/r/20260323163717.184107-1-ziy@nvidia.com
Co-developed-by: Bas van Dijk <bas@dfinity.org>
Signed-off-by: Bas van Dijk <bas@dfinity.org>
Co-developed-by: Adam Bratschi-Kaye <adam.bratschikaye@dfinity.org>
Signed-off-by: Adam Bratschi-Kaye <adam.bratschikaye@dfinity.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Concurrent reads and writes of sysctl_max_map_count are possible, so we
should READ_ONCE() and WRITE_ONCE().
The sysctl procfs logic already enforces WRITE_ONCE(), so abstract the
read side with get_sysctl_max_map_count().
While we're here, also move the field to mm/internal.h and add the getter
there since only mm interacts with it, there's no need for anybody else to
have access.
Finally, update the VMA userland tests to reflect the change.
Link: https://lkml.kernel.org/r/0715259eb37cbdfde4f9e5db92a20ec7110a1ce5.1773249037.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Jianzhou Zhao <luckd0g@163.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The test_memcg_sock test in memcontrol.c sets up an IPv6 socket and send
data over it to consume memory and verify that memory.stat.sock and
memory.current values are close.
On systems where IPv6 isn't enabled or not configured to support
SOCK_STREAM, the test_memcg_sock test always fails. When the socket()
call fails, there is no way we can test the memory consumption and verify
the above claim. I believe it is better to just skip the test in this
case instead of reporting a test failure hinting that there may be
something wrong with the memcg code.
Link: https://lkml.kernel.org/r/20260311200526.885899-1-longman@redhat.com
Fixes: 5f8f019380 ("selftests: cgroup/memcontrol: add basic test for socket accounting")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extend the near-full DAMON parameters commit selftest to commit goal_tuner
and confirm the internal status is updated as expected.
Link: https://lkml.kernel.org/r/20260310010529.91162-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update drgn_dump_damon_status.py, which is being used to dump the
in-kernel DAMON status for tests, to dump goal_tuner setup status.
Link: https://lkml.kernel.org/r/20260310010529.91162-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add support of goal_tuner setup to the test-purpose DAMON sysfs interface
control helper, _damon_sysfs.py.
Link: https://lkml.kernel.org/r/20260310010529.91162-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Droppable mappings must not be lockable. There is a check for VMAs with
VM_DROPPABLE set in mlock_fixup() along with checks for other types of
unlockable VMAs which ensures this when calling mlock()/mlock2().
For mlockall(MCL_FUTURE), the check for unlockable VMAs is different. In
apply_mlockall_flags(), if the flags parameter has MCL_FUTURE set, the
current task's mm's default VMA flag field mm->def_flags has VM_LOCKED
applied to it. VM_LOCKONFAULT is also applied if MCL_ONFAULT is also set.
When these flags are set as default in this manner they are cleared in
__mmap_complete() for new mappings that do not support mlock. A check for
VM_DROPPABLE in __mmap_complete() is missing resulting in droppable
mappings created with VM_LOCKED set. To fix this and reduce that chance
of similar bugs in the future, introduce and use vma_supports_mlock().
Link: https://lkml.kernel.org/r/20260310155821.17869-1-anthony.yznaga@oracle.com
Fixes: 9651fcedf7 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Tested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vma_mmu_pagesize() is also queried on non-hugetlb VMAs and does not really
belong into hugetlb.c.
PPC64 provides a custom overwrite with CONFIG_HUGETLB_PAGE, see
arch/powerpc/mm/book3s64/slice.c, so we cannot easily make this a static
inline function.
So let's move it to vma.c and add some proper kerneldoc.
To make vma tests happy, add a simple vma_kernel_pagesize() stub in
tools/testing/vma/include/custom.h.
Link: https://lkml.kernel.org/r/20260309151901.123947-3-david@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
CONFIG_DAMON_DEBUG_SANITY is recommended for DAMON development and test
setups. Enable it on the build config for DAMON selftests.
Link: https://lkml.kernel.org/r/20260306152914.86303-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now we have helpers which test singular VMA flags - vma_flags_test() and
vma_desc_test() - add a test to explicitly assert that these behave as
expected.
[ljs@kernel.org: test_vma_flags_test(): use struct initializer, per David]
Link: https://lkml.kernel.org/r/f6f396d2-1ba2-426f-b756-d8cc5985cc7c@lucifer.local
Link: https://lkml.kernel.org/r/376a39eb9e134d2c8ab10e32720dd292970b080a.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Similar to vma_flags_test(), we have previously renamed vma_desc_test() to
vma_desc_test_any(). Now that is in place, we can reintroduce
vma_desc_test() to explicitly check for a single VMA flag.
As with vma_flags_test(), this is useful as often flag tests are against a
single flag, and vma_desc_test_any(flags, VMA_READ_BIT) reads oddly and
potentially causes confusion.
As with vma_flags_test() a combination of sparse and vma_flags_t being a
struct means that users cannot misuse this function without it getting
flagged.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/3a65ca23defb05060333f0586428fe279a484564.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since we've now renamed vma_flags_test() to vma_flags_test_any() to be
very clear as to what we are in fact testing, we now have the opportunity
to bring vma_flags_test() back, but for explicitly testing a single VMA
flag.
This is useful, as often flag tests are against a single flag, and
vma_flags_test_any(flags, VMA_READ_BIT) reads oddly and potentially causes
confusion.
We use sparse to enforce that users won't accidentally pass vm_flags_t to
this function without it being flagged so this should make it harder to
get this wrong.
Of course, passing vma_flags_t to the function is impossible, as it is a
struct.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/f33f8d7f16c3f3d286a1dc2cba12c23683073134.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Be explicit about __mk_vma_flags() (which is used by the mk_vma_flags()
macro) always being inline, as we rely on the compiler to evaluate the
loop in this function and determine that it can replace the code with the
an equivalent constant value, e.g. that:
__mk_vma_flags(2, (const vma_flag_t []){ VMA_WRITE_BIT, VMA_EXEC_BIT });
Can be replaced with:
(1UL << VMA_WRITE_BIT) | (1UL << VMA_EXEC_BIT)
= (1UL << 1) | (1UL << 2) = 6
Most likely an 'inline' will suffice for this, but be explicit as we can
be.
Also update all of the functions __mk_vma_flags() ultimately invokes to be
always inline too.
Note that test_bitmap_const_eval() asserts that the relevant bitmap
functions result in build time constant values.
Additionally, vma_flag_set() operates on a vma_flags_t type, so it is
inconsistently named versus other VMA flags functions.
We only use vma_flag_set() in __mk_vma_flags() so we don't need to worry
about its new name being rather cumbersome, so rename it to
vma_flags_set_flag() to disambiguate it from vma_flags_set().
Also update the VMA test headers to reflect the changes.
Link: https://lkml.kernel.org/r/241f49c52074d436edbb9c6a6662a8dc142a8f43.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
erofs and zonefs are using vma_desc_test_any() twice to check whether all
of VMA_SHARED_BIT and VMA_MAYWRITE_BIT are set, this is silly, so add
vma_desc_test_all() to test all flags and update erofs and zonefs to use
it.
While we're here, update the helper function comments to be more
consistent.
Also add the same to the VMA test headers.
Link: https://lkml.kernel.org/r/568c8f8d6a84ff64014f997517cba7a629f7eed6.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: vma flag tweaks".
The ongoing work around introducing non-system word VMA flags has
introduced a number of helper functions and macros to make life easier
when working with these flags and to make conversions from the legacy use
of VM_xxx flags more straightforward.
This series improves these to reduce confusion as to what they do and to
improve consistency and readability.
Firstly the series renames vma_flags_test() to vma_flags_test_any() to
make it abundantly clear that this function tests whether any of the flags
are set (as opposed to vma_flags_test_all()).
It then renames vma_desc_test_flags() to vma_desc_test_any() for the same
reason. Note that we drop the 'flags' suffix here, as
vma_desc_test_any_flags() would be cumbersome and 'test' implies a flag
test.
Similarly, we rename vma_test_all_flags() to vma_test_all() for
consistency.
Next, we have a couple of instances (erofs, zonefs) where we are now
testing for vma_desc_test_any(desc, VMA_SHARED_BIT) &&
vma_desc_test_any(desc, VMA_MAYWRITE_BIT).
This is silly, so this series introduces vma_desc_test_all() so these
callers can instead invoke vma_desc_test_all(desc, VMA_SHARED_BIT,
VMA_MAYWRITE_BIT).
We then observe that quite a few instances of vma_flags_test_any() and
vma_desc_test_any() are in fact only testing against a single flag.
Using the _any() variant here is just confusing - 'any' of single item
reads strangely and is liable to cause confusion.
So in these instances the series reintroduces vma_flags_test() and
vma_desc_test() as helpers which test against a single flag.
The fact that vma_flags_t is a struct and that vma_flag_t utilises sparse
to avoid confusion with vm_flags_t makes it impossible for a user to
misuse these helpers without it getting flagged somewhere.
The series also updates __mk_vma_flags() and functions invoked by it to
explicitly mark them always inline to match expectation and to be
consistent with other VMA flag helpers.
It also renames vma_flag_set() to vma_flags_set_flag() (a function only
used by __mk_vma_flags()) to be consistent with other VMA flag helpers.
Finally it updates the VMA tests for each of these changes, and introduces
explicit tests for vma_flags_test() and vma_desc_test() to assert that
they behave as expected.
This patch (of 6):
On reflection, it's confusing to have vma_flags_test() and
vma_desc_test_flags() test whether any comma-separated VMA flag bit is
set, while also having vma_flags_test_all() and vma_test_all_flags()
separately test whether all flags are set.
Firstly, rename vma_flags_test() to vma_flags_test_any() to eliminate this
confusion.
Secondly, since the VMA descriptor flag functions are becoming rather
cumbersome, prefer vma_desc_test*() to vma_desc_test_flags*(), and also
rename vma_desc_test_flags() to vma_desc_test_any().
Finally, rename vma_test_all_flags() to vma_test_all() to keep the
VMA-specific helper consistent with the VMA descriptor naming convention
and to help avoid confusion vs. vma_flags_test_all().
While we're here, also update whitespace to be consistent in helper
functions.
Link: https://lkml.kernel.org/r/cover.1772704455.git.ljs@kernel.org
Link: https://lkml.kernel.org/r/0f9cb3c511c478344fac0b3b3b0300bb95be95e9.1772704455.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Chunhai Guo <guochunhai@vivo.com>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yue Hu <zbestahu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eliminate the `kho_finalize()` function and its associated state from the
KHO subsystem. The transition to a radix tree for memory tracking makes
the explicit "finalize" state and its serialization step obsolete.
Remove the `kho_finalize()` and `kho_finalized()` APIs and their stub
implementations. Update KHO client code and the debugfs interface to no
longer call or depend on the `kho_finalize()` mechanism.
Complete the move towards a stateless KHO, simplifying the overall design
by removing unnecessary state management.
Link: https://lkml.kernel.org/r/20260206021428.3386442-3-jasonmiu@google.com
Signed-off-by: Jason Miu <jasonmiu@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove duplicate inclusion of unistd.h in memory-failure.c to clean up
redundant code.
Link: https://lkml.kernel.org/r/20260211064311.2981726-1-nichen@iscas.ac.cn
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add test_zswap_incompressible() to verify that the zswap_incomp memcg stat
correctly tracks incompressible pages.
The test allocates memory filled with random data from /dev/urandom, which
cannot be effectively compressed by zswap. When this data is swapped out
to zswap, it should be stored as-is and tracked by the zswap_incomp
counter.
The test verifies that:
1. Pages are swapped out to zswap (zswpout increases)
2. Incompressible pages are tracked (zswap_incomp increases)
test:
dd if=/dev/zero of=/swapfile bs=1M count=2048
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo Y > /sys/module/zswap/parameters/enabled
./test_zswap
TAP version 13
1..8
ok 1 test_zswap_usage
ok 2 test_swapin_nozswap
ok 3 test_zswapin
ok 4 test_zswap_writeback_enabled
ok 5 test_zswap_writeback_disabled
ok 6 test_no_kmem_bypass
ok 7 test_no_invasive_cgroup_shrink
ok 8 test_zswap_incompressible
Totals: pass:8 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20260213071827.5688-3-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the migration test asserts that numa_available() returns 0. On
systems where NUMA is not available (returning -1), such as certain ARM64
configurations or single-node systems, this assertion fails and crashes
the test.
Update the test to check the return value of numa_available(). If it is
less than 0, skip the test gracefully instead of failing.
This aligns the behavior with other MM selftests (like rmap) that skip
when NUMA support is missing.
Link: https://lkml.kernel.org/r/20260218163941.13499-1-anishm7030@gmail.com
Fixes: 0c2d087284 ("mm: add selftests for migration entries")
Signed-off-by: AnishMulay <anishm7030@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Sayali Patil <sayalip@linux.ibm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of using the maple big node, use the maple copy node for reduced
stack usage and aligning with mas_wr_rebalance() and
mas_wr_spanning_store().
Splitting a node is similar to rebalancing, but a new evaluation of when
to ascend is needed. The only other difference is that the data is pushed
and never rebalanced at each level.
The testing must also align with the changes to this commit to ensure the
test suite continues to pass.
Link: https://lkml.kernel.org/r/20260130205935.2559335-27-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During the big node removal, an incorrect rebalance step went too far up
the tree causing insufficient nodes. Test the faulty condition by
recreating the scenario in the userspace testing.
Link: https://lkml.kernel.org/r/20260130205935.2559335-24-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stop using the maple subtree state and big node in favour of using three
destinations in the maple copy node. That is, expand the way leaves were
handled to all levels of the tree and use the maple copy node to track the
new nodes.
Extract out the sibling init into the data calculation since this is where
the insufficient data can be detected. The remainder of the sibling code
to shift the next iteration is moved to the spanning_ascend() function,
since it is not always needed.
Next introduce the dst_setup() function which will decide how many nodes
are needed to contain the data at this level. Using the destination
count, populate the copy node's dst array with the new nodes and set
d_count to the correct value. Note that this can be tricky in the case of
a leaf node with exactly enough room because of the rule against NULLs at
the end of leaves.
Once the destinations are ready, copy the data by altering the
cp_data_write() function to copy from the sources to the destinations
directly. This eliminates the use of the big node in this code path. On
node completion, node_finalise() will zero out the remaining area and set
the metadata, if necessary.
spanning_ascend() is used to decide if the operation is complete. It may
create a new root, converge into one destination, or continue upwards by
ascending the left and right write maple states.
One test case setup needed to be tweaked so that the targeted node was
surrounded by full nodes.
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20260130205935.2559335-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Spanning store had some corner cases which showed up during rcu stress
testing. Add explicit tests for those cases.
At the same time add some locking for easier visibility of the rcu stress
testing. Only a single dump of the tree will happen on the first detected
issue instead of flooding the console with output.
Link: https://lkml.kernel.org/r/20260130205935.2559335-13-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andrew Ballance <andrewjballance@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This version includes the following changes:
- Setting current base frequency as maximum for SST-BF with
kernel QOS changes
- Harmonize extended family decoded with the rest of the kernel
- Minor changes for error codes and messages
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
When running intel-speed-select on unsupported CLX platforms, it prints
intel-speed-select: Invalid CPU model (85)
: Success
Because this is not a system error and errno is not set.
Replace err() with exit().
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
When running an old version intel-speed-select tool on newer platforms,
even with "intel-speed-select -v", the tool only complains about
"Incompatible API version", without giving the current version info.
Print Version info whenever Incompatible API version is detected.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
When running the "intel-speed-select -h" command, it returns
1. 0 when using a version that is API incompatible.
2. 1 when using a version that is API compatible.
And this is confusing.
Fix the program to return 0 for "-h" parameter, and return 1 whenever
"Incompatible API versions" is detected.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
When decode and use CPU extended family ID in intel-speed-select, there
are several potential issues,
1. Mask with 0x0f to get CPU extended family ID is bogus because
CPU extended family ID takes 8 bits (bit 27:20).
2. Use CPU extended family ID fields without checking CPU family ID is
risky. Because Intel SDM says, "The Extended Family ID needs to be
examined only when the Family ID is 0FH."
3. Saving cpu family ID and cpu extended family ID separately doesn't
align with Linux kernel. And it may bring extra complexity when
making family specific changes in the future.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
SST-PP level change results in online/offline of CPUs with -o option.
The Linux intel-pstate driver internally stores the current HWP_REQ MSR
value during offline and restores them during online.
It is possible that during SST-PP level change, the new HWP_CAP limits
can be updated. So, when a CPU is online, the HWP_REQ MSR should be
updated to new values based on HWP_CAP values.
This is particularly problematic when either turbo is disabled or the
current HWP_REQ value (stored before online) is less than the base
frequency from the updated HWP_CAP MSR guaranteed value. If the HWP_REQ
MSR is not updated, then the performance will be limited to the value
before perf level change.
Hence the tool updates cpufreq scaling_max_freq to the newer
base_frequency value in this case. This step is not required when HWP
interrupts are enabled, as the perf level change should result in a new
interrupt with HWP_GUARANTEED_PERF_CHANGE_STATUS and the intel_pstate
driver will update to new limits.
But the tool needs to handle the case when HWP interrupts are not
enabled but there is no way for the tool to know that HWP interrupts are
enabled or not. So, it has to still update the scaling_max_freq.
With the QOS changes in the kernel, user space writes to scaling_max_freq
are treated as hard limits. So, when base frequency is increased with
SST-BF enabled, the cpufreq subsystem will still not allow setting to the
SST-BF high priority core frequency. So, the HWP_REQ MSR will still be
capped to the user-set scaling_max_freq after SST-PP level change.
To address this, instead of setting scaling_max_freq to the current HWP_CAP
highest frequency, set it to the maximum integer value to set the QOS limit
as unconstrained. In this case, the actual HWP_REQ maximum frequency will
still be capped to HWP_CAP highest performance by the intel-pstate driver.
So, it will not result in invalid HWP_REQ values.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the
indirect branch tracking prctl()s with something more readable, and
suggesting the use of the speculation control prctl()s as an exemplar,
reimplement the prctl()s and related constants that control per-task
forward-edge control flow integrity.
This primarily involves two changes. First, the prctls are
restructured to resemble the style of the speculative execution
workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them
easier to extend in the future. Second, the "indir_br_lp" abbrevation
is expanded to "branch_landing_pads" to be less telegraphic. The
kselftest and documentation is adjusted accordingly.
Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Similar to the recent change to expand "LP" to "branch landing pad",
let's expand "SS" in the ptrace uapi macros to "shadow stack" as well.
This aligns with the existing prctl() arguments, which use the
expanded "shadow stack" names, rather than just the abbreviation.
Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/
Cc: Deepak Gupta <debug@rivosinc.com>
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Per Linus' comments about the unreadability of abbreviations such as
"LP", rename the RISC-V ptrace landing pad CFI macro names to be more
explicit. This primarily involves expanding "LP" in the names to some
variant of "branch landing pad."
Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/
Cc: Deepak Gupta <debug@rivosinc.com>
Signed-off-by: Paul Walmsley <pjw@kernel.org>
EXPECT_EQ() expands to multiple lines, breaking up one-line if
statements. This issue was not present in the patch on the mailing list
but was instead introduced by the maintainer when attempting to fix up
checkpatch warnings. Add braces around EXPECT_EQ() to avoid the error
even though checkpatch suggests them to be removed:
validate_v_ptrace.c:626:17: error: ‘else’ without a previous ‘if’
Fixes: 3789d5eecd ("selftests: riscv: verify syscalls discard vector context")
Fixes: 30eb191c89 ("selftests: riscv: verify ptrace rejects invalid vector csr inputs")
Fixes: 849f05ae1e ("selftests: riscv: verify ptrace accepts valid vector csr values")
Signed-off-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
Reviewed-and-tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Link: https://patch.msgid.link/20260309-fix_selftests-v2-2-9d5a553a531e@gmail.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Add support for new features:
* CPPC performance priority
* Dynamic EPP
* Raw EPP
* New unit tests for new features
Fixes for:
* PREEMPT_RT
* sysfs files being present when HW missing
* Broken/outdated documentation
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmnOpNgTHHN1cGVybTFA
a2VybmVsLm9yZwAKCRAtGSymJHcCduR7EADexgetxq0l6/iV2DyI1/YJcf+cNPoS
yxE93vN9i3A2xcx87klncVF0C2zIZaZFkp6o7VY/AReL/UyUOh6snz371OXBl7pm
A/uppkT5QdzTpmknJMyqkLRlHfkMjNRzWv4sdh4kyJSB3SkgaN7zSVi6Zxamt/vJ
VNCgExZQeDqk4VL2X/NBfaBagYSnPnBmBdXoY6aPYqFrqKj4SlDxYNbJsQlcyE9Z
z0naVGb5YPEJOaMvE+5z+DwX4EmtN3si+vfi8VuQOXPnoDGOG763rpMLnz7xYvfW
poPu2fnitN39MaT96btRShD6XuCg9eaPAEmpb3j6c93n1kUo+joLLbalhfc0HMeL
1/8ndz+KatEUMQTCVgs8cboob1PpRvqhIb+vrs6aTEqCsgqUKUZ7GYgglBamyRka
mivC5Q+ssCxq47/ilGfECFr8vK0oV3rTu9Ltp4MS5zN70tI0YYZk3o1454nY5dhc
Byv5e9bft/n9AA576y5vXENcWCSez/8UFGl5RjoxQZ7SFKNFnbSic1BT4uMRVX/G
4QUk5TWwC8WdOp7YsO30LwZ0y9vtxmfBn8BF/6n/dYGhM1/DVQ1nX9iyzhCHZ3XH
fgyrkUktdI1dsm/xKvbqxK9Djw0tkMsfH1yI6iQccefnlo4gRSvTRFiM2yepY6py
E8MZpz1ML8T2Pw==
=XTdh
-----END PGP SIGNATURE-----
Merge tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Pull amd-pstate new content for 7.1 (2026-04-02) from Mario Limonciello:
"Add support for new features:
* CPPC performance priority
* Dynamic EPP
* Raw EPP
* New unit tests for new features
Fixes for:
* PREEMPT_RT
* sysfs files being present when HW missing
* Broken/outdated documentation"
* tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (22 commits)
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
amd-pstate: Make certain freq_attrs conditionally visible
...
The current custom implementation of offsetof() fails UBSAN:
runtime error: member access within null pointer of type 'struct ...'
This means that all its users, including container_of(), free() and
realloc(), fail.
Use __builtin_offsetof() instead which does not have this issue and
has been available since GCC 4 and clang 3.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260401-nolibc-asprintf-v1-1-46292313439f@weissschuh.net
fstatat() contains two open-coded copies of makedev() to handle minor
numbers >= 256. Now that the regular makedev() handles both large minor
and major numbers correctly use the common function.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-makedev-v2-6-456a429bf60c@weissschuh.net
statx() returns both 32-bit minor and major numbers. For both of them to
fit into the 'dev_t' in 'struct stat', that needs to be 64 bits wide.
The other uses of 'dev_t' in nolibc are makedev() and friends and
mknod(). makedev() and friends are going to be adapted in an upcoming
commit and mknod() will silently truncate 'dev_t' to 'unsigned int' in
the kernel, similar to other libcs.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-makedev-v2-4-456a429bf60c@weissschuh.net
Functions make it easier to keep the input and output types straight and
avoid duplicate evaluations of their arguments.
Also these functions will become a bit more complex to handle full
64-bit 'dev_t' which is easier to read in a function.
Still stay compatible with code which expects these to be macros.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-makedev-v2-3-456a429bf60c@weissschuh.net
These functions/macros are about to be changed.
Add some tests to make sure they continue working.
As they only handle small dev_t values, only test those for now.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260404-nolibc-makedev-v2-1-456a429bf60c@weissschuh.net
The test checks both invalid GPAs as well as unmappable GPAs, so drop
'invalid' from its name.
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-10-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
The test currently allegedly makes sure that VMRUN causes a #GP in
vmcb12 GPA is valid but unmappable. However, it calls run_guest() with
an the test vmcb12 GPA, and the #GP is produced from VMLOAD, not VMRUN.
Additionally, the underlying logic just changed to match architectural
behavior, and all of VMRUN/VMLOAD/VMSAVE fail emulation if vmcb12 cannot
be mapped. The CPU still injects a #GP if the vmcb12 GPA exceeds
maxphyaddr.
Rework the test such to use the KVM_ONE_VCPU_TEST[_SUITE] harness, and
test all of VMRUN/VMLOAD/VMSAVE with both an invalid GPA (-1ULL) causing
a #GP, and a valid but unmappable GPA causing emulation failure. Execute
the instructions directly from L1 instead of run_guest() to make sure
the #GP or emulation failure is produced by the right instruction.
Leave the #VMEXIT with unmappable GPA test case as-is, but wrap it with
a test harness as well.
Opportunisitically drop gp_triggered, as the test already checks that
a #GP was injected through a SYNC. Also, use the first unmapped GPA
instead of the maximum legal GPA, as some CPUs inject a #GP for the
maximum legal GPA (likely in a reserved area).
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-9-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
We have a test for coalescing with bad TCP checksum, let's also
test bad IPv4 header checksum.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We explicitly test ipip encap. Let's add ip6ip6, too. Having
just ipip seems like favoring IPv4 which we should not do :)
Testing all combinations is left for future work, not sure
it's actually worth it.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When constructing the packets for large_* test cases we use
a static value for packet count and MSS. It works okay for
ipv4 vs ipv6 but the gap between ipv4 and ip6ip6 is going to
be quite significant.
Make the defines calculate the worst case values, those
are only used for sizing stack arrays. Create helpers for
calculating precise values based on the exact test case.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Willem points out TOTAL_HDR_LEN is identical to MAX_HDR_LEN.
This seems to have been the case ever since the test was added.
Replace the uses of TOTAL_HDR_LEN with MAX_HDR_LEN, MAX seems
more common for what this value is.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Try to use already calculated offsets and not depend on the ipip
flag as much. This patch should not change any functionality,
it's just a cleanup to make ip6ip6 support easier.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The new capacity/order test exits as soon as it sees the expected
packet sequence. This may allow the "flushing" FIN packet to spill
over to the next test. Let's always wait for the FIN before exiting.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Small IPv4 packets get padded to 60B, this may break / confuse
some buggy implementations. Add a test to coalesce a 1B payload.
Keep this separate from the lrg_sml test because I suspect some
implementations may not handle this case (treat padded frames
as ineligible for coalescing).
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a test trying to induce a GRO context timeout followed
by another sequence of packets for the same flow. The second
burst arrives 100ms after the first one so any implementation
(SW or HW) must time out waiting at that point. We expect both
bursts to be aggregated successfully but separately.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260402210000.1512696-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor CXL core/region code to make region code more manageable by
splitting out DAX and PMEM code from RAM handling code.
cxl/core: use cleanup.h for devm_cxl_add_dax_region
cxl/core/region: move dax region device logic into region_dax.c
cxl/core/region: move pmem region driver logic into region_pmem.c
The series addresses conflicts between HMEM and CXL when handling Soft
Reserved memory ranges. CXL will try best effort in claiming the Soft
Reserved memory region that are CXL regions. If fails, it will punt
back to HMEM.
tools/testing/cxl: Test dax_hmem takeover of CXL regions
tools/testing/cxl: Simulate auto-assembly failure
dax/hmem: Parent dax_hmem devices
dax/hmem: Fix singleton confusion between dax_hmem_work and hmem devices
dax/hmem: Reduce visibility of dax_cxl coordination symbols
cxl/region: Constify cxl_region_resource_contains()
cxl/region: Limit visibility of cxl_region_contains_resource()
dax/cxl: Fix HMEM dependencies
cxl/region: Fix use-after-free from auto assembly failure
dax/hmem, cxl: Defer and resolve Soft Reserved ownership
cxl/region: Add helper to check Soft Reserved containment by CXL regions
dax: Track all dax_region allocations under a global resource tree
dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding
dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL
dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges
dax/hmem: Factor HMEM registration into __hmem_register_device()
dax/bus: Use dax_region_put() in alloc_dax_region() error path
Prep patches for CXL type2 accelerator basic support
cxl/region: Factor out interleave granularity setup
cxl/region: Factor out interleave ways setup
cxl: Make region type based on endpoint type
cxl/pci: Remove redundant cxl_pci_find_port() call
cxl: Move pci generic code from cxl_pci to core/cxl_pci
cxl: export internal structs for external Type2 drivers
cxl: support Type2 when initializing cxl_dev_state
The cxl_test module currently hard-codes auto regions in the mock
topology, limiting coverage of the driver's region auto-assembly
logic.
Teach cxl_test to replay previously committed decoder programming
across a cxl_acpi unbind/bind cycle. Decoder programming is recorded
in a registry keyed by a stable port identity and decoder id. The
registry is updated on decoder commit and reset events and consulted
during enumeration to restore previously enabled decoders.
This allows regions created through the user interface to be replayed
during enumeration and treated as auto-discovered regions, enabling
testing of region auto-assembly using configurations created in the
cxl_test topology.
Example workflow:
# cxl create-region ...
# echo 1 > /sys/bus/platform/devices/cxl_acpi.0/decoder_reset_preserve_registry
# echo cxl_acpi.0 > /sys/bus/platform/drivers/cxl_acpi/unbind
# echo cxl_acpi.0 > /sys/bus/platform/drivers/cxl_acpi/bind
# echo 0 > /sys/bus/platform/devices/cxl_acpi.0/decoder_reset_preserve_registry
The NDCTL CXL unit test, cxl-region-replay.sh, demonstrates the usage.
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260314061952.2221030-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Drop the explicit KVM_SEV_LAUNCH_UPDATE_VMSA call when creating an SEV-ES
VM in the SEV migration test, as sev_vm_create() automatically updates the
VMSA pages for SEV-ES guests. The only reason the duplicate call doesn't
cause visible problems is because the test doesn't actually try to run the
vCPUs. That will change when KVM adds a check to prevent userspace from
re-launching a VMSA (which corrupts the VMSA page due to KVM writing
encrypted private memory).
Fixes: 69f8e15ab6 ("KVM: selftests: Use the SEV library APIs in the intra-host migration test")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310234829.2608037-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add two passes before the main verifier pass:
bpf_compute_const_regs() is a forward dataflow analysis that tracks
register values in R0-R9 across the program using fixed-point
iteration in reverse postorder. Each register is tracked with
a six-state lattice:
UNVISITED -> CONST(val) / MAP_PTR(map_index) /
MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN
At merge points, if two paths produce the same state and value for
a register, it stays; otherwise it becomes UNKNOWN.
The analysis handles:
- MOV, ADD, SUB, AND with immediate or register operands
- LD_IMM64 for plain constants, map FDs, map values, and subprogs
- LDX from read-only maps: constant-folds the load by reading the
map value directly via bpf_map_direct_read()
Results that fit in 32 bits are stored per-instruction in
insn_aux_data and bitmasks.
bpf_prune_dead_branches() uses the computed constants to evaluate
conditional branches. When both operands of a conditional jump are
known constants, the branch outcome is determined statically and the
instruction is rewritten to an unconditional jump.
The CFG postorder is then recomputed to reflect new control flow.
This eliminates dead edges so that subsequent liveness analysis
doesn't propagate through dead code.
Also add runtime sanity check to validate that precomputed
constants match the verifier's tracked state.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add few tests for topo sort:
- linear chain: main -> A -> B
- diamond: main -> A, main -> B, A -> C, B -> C
- mixed global/static: main -> global -> static leaf
- shared callee: main -> leaf, main -> global -> leaf
- duplicate calls: main calls same subprog twice
- no calls: single subprog
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-4-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a pass that sorts subprogs in topological order so that iterating
subprog_topo_order[] walks leaf subprogs first, then their callers.
This is computed as a DFS post-order traversal of the CFG.
The pass runs after check_cfg() to ensure the CFG has been validated
before traversing and after postorder has been computed to avoid
walking dead code.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The metric code uses the event parsing code but it generally assumes
all events are supported. Arnaldo reported AMD supporting
stalled-cycles-frontend but not stalled-cycles-backend [1]. An issue
with this is that before parsing happens the metric code tries to
share events within groups to reduce the number of events and
multiplexing. If the group has some supported and not supported
events, the whole group will become broken. To avoid this situation
add has_event tests to the metrics for stalled-cycles-frontend and
stalled-cycles-backend. has_events is evaluated when parsing the
metric and its result constant propagated (with if-elses) to reduce
the number of events. This means when the metric code considers
sharing the events, only supported events will be shared.
Note for backporting. This change updates
tools/perf/pmu-events/empty-pmu-events.c a convenience file for builds
on systems without python present. While the metrics.json code should
backport easily there can be conflicts on empty-pmu-events.c. In this
case the build will have left a file test-empty-pmu-events.c that can
be copied over empty-pmu-events.c to resolve issues and make an
appropriate empty-pmu-events.c for the json in the source tree at the
time of the build.
[1] https://lore.kernel.org/lkml/abm1nR-2xjOUBroD@x1/
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Closes: https://lore.kernel.org/lkml/abm1nR-2xjOUBroD@x1/
Fixes: c7adeb0974 ("perf jevents: Add set of common metrics based on default ones")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add basic kwork coverage tests for record, report, latency, timehist
and top.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Handle the finished_round event. Set up the CTF events when the
feature event desc is read. In pipe mode the attr events will create
the evsels and the feature event desc events will name the evsels. The
CTF events need the evsel name, so wait until feature event descs are
read (in pipe mode) before setting up the events except for tracepoint
events. Handle the tracing_data event so that tracepoint information
is available when setting up tracepoint events.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In situations like the perf data converter the evsel__name will be
used to create babeltrace events. If the events have the same name
then creation can fail. Avoid these failures by including more
information into the unknown event names.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Some event processing functions like perf_event__process_tracing_data
return a zero or positive value on success. Ordered event processing
handles any non-zero value as an error, which is inconsistent with
reader__process_events and reader__read_event that only treat negative
values as errors. Make the ordered events error handling consistent
with that of the events reader.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In non-pipe/data mode the header has a 256-bit bitmap representing
whether a feature is enabled or not. In pipe mode features are written
out in perf_event__synthesize_features as PERF_RECORD_HEADER_FEATURE
events with a special zero sized marker for the last feature. If a new
feature is added the last feature marker event appears as that feature
from old pipe mode perf data. As the event is zero sized it will fail
to be processed and generally terminate perf.
Add a last_feat variable to the header that in non-pipe/data mode is
just HEADER_LAST_FEATURE. In pipe mode compute the last_feat by
handling zero sized feature events, assuming they are the marker and
updating last_feat accordingly. Potentially a feature event could be
zero sized and so still process the feature event, just ignore the
error if it fails.
As perf_event__process_feature can properly handle pipe mode data,
migrate users to it except for report that still wants to group events
and stop header printing with the last feature marker. Make
perf_event__process_feature non-fatal in the case of a newer feature
than this version of perf's HEADER_LAST_FEATURE, which was the
behavior all users wanted.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Print log information in ordered event processing so that the cause of
finished round failing is clearer. Print the event name along with its
number when an event isn't processed. Add extra detail about where the
failure happened.
The following log lines come from running `perf data convert`. Before:
0xa250 [0x10]: failed to process type: 80
After:
0xa250 [0x10]: piped event processing failed for event of type: FEATURE (80)
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
By removing the features from feat_ops with ifdefs the previous logic
would print "# (null)" when perf processed a feature that lacked
builtin support. Remove the ifdefs from feat_ops and in the relevant
functions print errors/messages about the lack of support.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
For logging and debug messages it can be convenient to convert a
feature number to a name. Add header_feat__name for this and reuse the
data already within the feat_ops struct.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
clockid_t is declared in time.h but the include is missing. Reordering
header files may result in build breakages. Add the include to avoid
this.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnPGdMACgkQ6rmadz2v
bTrNxw/9Hcn2V/Jqp/cEagmKIKqSAUFgEE+AwRbQU5YL2Yem/6Q15rnOk8pOSDT5
jqk7VbuchVmWa+a9DVy7d3XVWohk332QbvQRHfqV8P0ZpnfJa0YqdZlKg2/4/8P/
yVhLzVrGIGcvvz9CfhIynRhq/fvr7iYbSSv9JT3nig4qCYpUf7kPbXSLtxyElNWN
xX36KfTxQO4xI2+iezsNwklXF25Tv59V1fNuKF2lshxS+DwaroAzAJLd3MGvTHRj
8y5kU1UDb+HeJh9DpEFjppQp4qUQjIKAiNVvXGUOe7TI/i9VTIiMfesniWKNwzYv
Alo2G8fLb4nJhzNL2ol4R0I5BCYmMT55tBFvSNJQ+9Esy6azkbExmKuE1hXsUXo1
jY0TbNt58zSZEmyz9SYoFKlg4lOW4ZIMl0RtnSBRoDwtK3ThGV7QFlnKq3uPZ6ce
RcpMk7cOnERLzwPnpSiACrQmzhMk+j5HG1u+Eb3rXKxYCQO6bAhpQyPDKsiXNgkL
uezq2zqAnNho0/CInHGlRj7E1JnvRoHCcLBT4zzyIY/jruI8fzK0aMqGMvk/qOby
BWDnJ9GG3VmGSUc/FOp3IchKCnxXhkYqsjBCP03cbIZgr1MuixZeom81OsPNmSX8
Ke+FeGNsU5zOUJ1iG2BZjdya/DAgP8hd85WVtaXyX60KKhuu45c=
=w0RY
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix register equivalence for pointers to packet (Alexei Starovoitov)
- Fix incorrect pruning due to atomic fetch precision tracking (Daniel
Borkmann)
- Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya
Dwivedi)
- Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima)
- Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang)
- Reject sleepable kprobe_multi programs at attach time (Varun R
Mallya)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add more precision tracking tests for atomics
bpf: Fix incorrect pruning due to atomic fetch precision tracking
bpf: Reject sleepable kprobe_multi programs at attach time
bpf: reject direct access to nullable PTR_TO_BUF pointers
bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready().
bpf: Fix grace period wait for tracepoint bpf_link
bpf: Fix regsafe() for pointers to packet
With the changes to the verifier in previous commits, we're not
expecting any invariant violations anymore. We should therefore always
enable BPF_F_TEST_REG_INVARIANTS to fail on invariant violations. Turns
out that's already the case and we've been explicitly setting this flag
in selftests when it wasn't necessary. This commit removes those flags
from selftests, which should hopefully make clearer that it's always
enabled.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/9afce92510a7d44569dc3af63c9b8c608e69298a.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds a selftest for the change in the previous patch. The
selftest is derived from a syzbot reproducer from [1] (among the 22
reproducers on that page, only 4 still reproduced on latest bpf tree,
all being small variants of the same invariant violation).
The test case failure without the previous patch is shown below.
0: R1=ctx() R10=fp0
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (bf) r5 = r0 ; R0=scalar(id=1) R5=scalar(id=1)
2: (57) r5 &= -4 ; R5=scalar(smax=0x7ffffffffffffffc,umax=0xfffffffffffffffc,smax32=0x7ffffffc,umax32=0xfffffffc,var_off=(0x0; 0xfffffffffffffffc))
3: (bf) r7 = r0 ; R0=scalar(id=1) R7=scalar(id=1)
4: (57) r7 &= 1 ; R7=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
5: (07) r7 += -43 ; R7=scalar(smin=smin32=-43,smax=smax32=-42,umin=0xffffffffffffffd5,umax=0xffffffffffffffd6,umin32=0xffffffd5,umax32=0xffffffd6,var_off=(0xffffffffffffffd4; 0x3))
6: (5e) if w5 != w7 goto pc+1
verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xffffffd5, 0xffffffffffffffd4] s64=[0x80000000ffffffd5, 0x7fffffffffffffd4] u32=[0xffffffd5, 0xffffffd4] s32=[0xffffffd5, 0xffffffd4] var_off=(0xffffffd4, 0xffffffff00000000)
R5 and R7 are prepared such that their tnums intersection results in a
known constant but that constant isn't within R7's u32 bounds.
is_branch_taken isn't able to detect this case today, so the verifier
walks the impossible fallthrough branch. After regs_refine_cond_op and
reg_bounds_sync refine R5 on the assumption that the branch is taken,
the impossibility becomes apparent and results in an invariant violation
for R5: umin32 is greater than umax32.
The previous patch fixes this by using regs_refine_cond_op and
reg_bounds_sync in is_branch_taken to detect the impossible branch. The
fallthrough branch is therefore correctly detected as dead code.
Link: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 [1]
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/b1e22233a3206ead522f02eda27b9c5c991a0de9.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The build_id parsing functions calculate a filename length from the
event header size and read directly into a stack buffer of PATH_MAX
bytes without bounds checking. A malformed perf.data file with a
crafted header.size can cause the length to be negative or exceed
PATH_MAX, resulting in a stack buffer overflow.
Add bounds checking for the filename length in both
perf_header__read_build_ids() and the ABI quirk variant. Print a
warning message when invalid length is detected.
Signed-off-by: SeungJu Cheon <suunj1331@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Return -ENOENT when no metric/group matches, and directly use the return
value from expr__find_ids(), so -EINVAL is reserved for parse failures.
Print separate logs to make it clear.
Before:
perf stat -C 5 -vvv
Using CPUID 0x00000000410fd490
metric expr 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES) for backend_bound
parsing metric: 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES)
Failure to read '#slots'
literal: #slots = nan
syntax error
Cannot find metric or group `Default'
After:
perf stat -C 5 -vvv
Using CPUID 0x00000000410fd490
metric expr 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES) for backend_bound
parsing metric: 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES)
Failure to read '#slots'
literal: #slots = nan
syntax error
Fail to parse metric or group `Default'
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add a trailing newline for logs.
Before:
perf stat -C 5
Failure to read '#slots'Cannot find metric or group `Default'
After:
perf stat -C 5
Failure to read '#slots'
Cannot find metric or group `Default'
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
expr__find_ids() propagates the parser return value directly. For syntax
errors, the parser can return a positive value, but callers treat it as
success, e.g., for below case on Arm64 platform:
metric expr 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES) for backend_bound
parsing metric: 100 * (STALL_SLOT_BACKEND / (CPU_CYCLES * #slots) - BR_MIS_PRED * 3 / CPU_CYCLES)
Failure to read '#slots' literal: #slots = nan
syntax error
Convert positive parser returns in expr__find_ids() to -EINVAL, as a
result, the error value will be respected by callers.
Before:
perf stat -C 5
Failure to read '#slots'Failure to read '#slots'Failure to read '#slots'Failure to read '#slots'Segmentation fault
After:
perf stat -C 5
Failure to read '#slots'Cannot find metric or group `Default'
Fixes: ded80bda8b ("perf expr: Migrate expr ids table to a hashmap")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
If TLD_FREE_DATA_ON_THREAD_EXIT is not enabled in a translation unit
that calls __tld_create_key() first, another translation unit that
enables it will not get the auto cleanup feature as pthread key is only
created once when allocation metadata. Fix it by always try to create
the pthread key when __tld_create_key() is called.
Also improve the documentation:
- Discourage user from using different options in different translation
units
- Specify calling tld_free() before thread exit as undefined behavior
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-6-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
TLD_READ_ONCE() is redundant as the only reference passed to it is
defined as _Atomic. The load is guaranteed to be atomic in C11 standard
(6.2.6.1). Drop the macro.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-5-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Without specifying constructor priority of the hidden constructor
function defined by TLD_DEFINE_KEY, __tld_create_key(..., dyn_data =
false) may run after tld_get_data() called from other constructors.
Threads calling tld_get_data() before __tld_create_key(..., dyn_data
= false) will not allocate enough memory for all TLDs and later result
in OOB access. Therefore, set it to the lowest value available to
users. Note that lower means higher priority and 0-100 is reserved to
the compiler.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-4-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Simplify data allocation by always using aligned_alloc() and passing
size_pot, size rounded up to the closest power of two to alignment.
Currently, aligned_alloc(page_size, size) is only intended to be used
with memory allocators that can fulfill the request without rounding
size up to page_size to conserve memory. This is enabled by defining
TLD_DATA_USE_ALIGNED_ALLOC. The reason to align to page_size is due to
the limitation of UPTR where only a page can be pinned to the kernel.
Otherwise, malloc(size * 2) is used to allocate memory for data.
However, we don't need to call aligned_alloc(page_size, size) to get
a contiguous memory of size bytes within a page. aligned_alloc(size_pot,
...) will also do the trick. Therefore, just use aligned_alloc(size_pot,
...) universally.
As for the size argument, create a new option,
TLD_DONT_ROUND_UP_DATA_SIZE, to specify not rounding up the size.
This preserves the current TLD_DATA_USE_ALIGNED_ALLOC behavior, allowing
memory allocators with low overhead aligned_alloc() to not waste memory.
To enable this, users need to make sure it is not an undefined behavior
for the memory allocator to have size not being an integral multiple of
alignment.
Compared to the current implementation, !TLD_DATA_USE_ALIGNED_ALLOC
used to always waste size-byte of memory due to malloc(size * 2).
Now the worst case becomes size - 1 and the best case is 0 when the size
is already a power of two.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when allocating memory for data, size of tld_data_u->start
is not taken into account. This may cause OOB access. Fixed it by adding
the non-flexible array part of tld_data_u.
Besides, explicitly align tld_data_u->data to 8 bytes in case some
fields are added before data in the future. It could break the
assumption that every data field is 8 byte aligned and
sizeof(tld_data_u) will no longer be equal to
offsetof(struct tld_data_u, data), which we use interchangeably.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, attach_probe covers manual single-kprobe attaches by
func_name, but not the raw-address form that the PMU-based
single-kprobe path can accept.
This commit adds PERF and LINK raw-address coverage. It resolves
SYS_NANOSLEEP_KPROBE_NAME through kallsyms, passes the absolute address
in bpf_kprobe_opts.offset with func_name = NULL, and verifies that
kprobe and kretprobe are still triggered. It also verifies that LEGACY
rejects the same form.
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-4-hoyeon.lee@suse.com
bpf_program__attach_kprobe_opts() documents single-kprobe attach
through func_name, with an optional offset. For the PMU-based path,
func_name = NULL with an absolute address in offset already works as
well, but that is not described in the API.
This commit clarifies this existing non-legacy behavior. For PMU-based
attach, callers can use func_name = NULL with an absolute address in
offset as the raw-address form. For legacy tracefs/debugfs kprobes,
reject this form explicitly.
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-3-hoyeon.lee@suse.com
perf_event_open_probe() and perf_event_{k,u}probe_open_legacy() helpers
are returning negative error codes directly on failure. This commit
changes bpf_program__attach_{k,u}probe_opts() to use those return
values directly instead of re-reading possibly changed errno.
Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-2-hoyeon.lee@suse.com
Align bpf_program__clone() with bpf_object_load_prog() by gating
BTF func/line info on FEAT_BTF_FUNC kernel support, and resolve
caller-provided prog_btf_fd before checking obj->btf so that callers
with their own BTF can use clone() even when the object has no BTF
loaded.
While at it, treat func_info and line_info fields as atomic groups
to prevent mismatches between pointer and count from different sources.
Move bpf_program__clone() to libbpf 1.8.
Fixes: 970bd2dced ("libbpf: Introduce bpf_program__clone()")
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260401151640.356419-1-mykyta.yatsenko5@gmail.com
Test case 'perf data type profiling tests' fails on s390 with this
error:
# ./perf mem record -- ./perf test -w code_with_type
failed: no PMU supports the memory events
# echo $?
255
#
because s390 does not support memory events at all. According to the
man page, perf annotate --code-with-type only works with memory
instructions only. As command 'perf mem record ...' is not supported
on s390, skip this test for s390.
Output before:
# ./perf test 'perf data type profiling tests'
77: perf data type profiling tests : FAILED!
Output after:
# ./perf test 'perf data type profiling tests'
77: perf data type profiling tests : Skip
Fixes: f60a5c2296 ("perf tests: Test annotate with data type profiling and rust")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Dmitrii Dolgov <9erthalion6@gmail.com>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When sorting the dso array we sometimes get a crash due to null
comparisons in comparator functions. So prevent __dsos__add from
adding null to the dso array to avoid out-of-memory related errors.
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
test__ratio_to_prev() assumed the first event in a group is the leader,
which is not the case when the event is expanded into two event groups
on hybrid PMU's with auto counter reload support. Instead, iterate over the
event group generated for each core PMU. Also update "wrong leader" test to
check that the subordinate event has the correct leader instead of checking
that it is not the group leader. Finally, do not exit immediately if a PMU
without auto counter reload support is found.
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Fixes: 56be0fe5f6 ("perf record: Add auto counter reload parse and regression tests")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
When perf resolves symbols from kernel module ELF files (ET_REL),
it converts symbol addresses to file offsets so that sample IPs
can be matched to the correct symbol. The conversion adjusts each
symbol's st_value:
sym->st_value -= shdr->sh_addr - shdr->sh_offset;
For vmlinux (ET_EXEC), st_value is a virtual address and sh_addr
is the section's virtual base, so subtracting sh_addr and adding
sh_offset correctly yields a file offset.
For kernel modules (ET_REL), st_value is a section-relative
offset. The module loader ignores sh_addr entirely and places
symbols at module_base + st_value. Converting to file offset
requires only adding sh_offset; subtracting sh_addr introduces an
error equal to sh_addr bytes.
When .text has sh_addr == 0 -- the historical norm for simple
modules -- both formulas produce the same result and the bug is
latent. As modules gain more metadata sections before .text (.note,
.static_call.text, etc.), the linker assigns .text a non-zero
sh_addr, exposing the defect. For example, nfsd.ko on this kernel
has sh_addr=0xa80, kvm-intel.ko has sh_addr=0x1e90.
The effect is that all .text symbols in affected modules
shift by sh_addr bytes relative to sample IPs, causing perf
report to attribute samples to incorrect, nearby symbols. This
was observed as 13% of LLC-load-miss samples misattributed
to nfsd_file_get_dio_attrs when the actual hot function was
nfsd_cache_lookup, approximately 0xa80 bytes away in the symbol
table.
Use the existing dso__rel() flag (already set for ET_REL modules)
to select the correct adjustment: add sh_offset for ET_REL,
subtract (sh_addr - sh_offset) for ET_EXEC/ET_DYN.
Fixes: 0131c4ec79 ("perf tools: Make it possible to read object code from kernel modules")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
It needs to synthesize task info for the comm name. The mmap
information is only needed for callchain symbolization which is not used
by the summary mode. Also total or cgroup summary mode don't require
the task info. Let's skip the processing if possible.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Perf stat is crashing on arm64 hosts with the following issue:
# make -C tools/perf DEBUG=1
# perf stat sleep 1
perf: util/evsel.c:2034: get_group_fd: Assertion `!(!leader->core.fd)' failed.
[1] 1220794 IOT instruction (core dumped) ./perf stat
The sorting function introduced by commit a745c0831c ("perf stat:
Sort default events/metrics") compares events based on their individual
properties. This can cause events from different groups to be
interleaved, resulting in group members appearing before their leaders
in the sorted evlist.
When the iterator opens events in list order, a group member may be
processed before its leader has been opened.
For example, CPU_CYCLES (idx=32) with leader STALL_SLOT_BACKEND (idx=37)
could be sorted before its leader, causing the crash when CPU_CYCLES
tries to get its group fd from the not-yet-opened leader.
Fix this by comparing events based on their leader's attributes instead
of their own attributes when the events are in different groups. This
ensures all members of a group share the same sort key as their leader,
keeping groups together and guaranteeing leaders are opened before their
members.
Fixes: a745c0831c ("perf stat: Sort default events/metrics")
Reported-by: Denis Yaroshevskiy <dyaroshev@meta.com>
Tested-by: Dmitry Ilvokhin <d@ilvokhin.com>
Tested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add verifier precision tracking tests for BPF atomic fetch operations.
Validate that backtrack_insn correctly propagates precision from the
fetch dst_reg to the stack slot for {fetch_add,xchg,cmpxchg} atomics.
For the first two src_reg gets the old memory value, and for the last
one r0. The fetched register is used for pointer arithmetic to trigger
backtracking. Also add coverage for fetch_{or,and,xor} flavors which
exercises the bitwise atomic fetch variants going through the same
insn->imm & BPF_FETCH check but with different imm values.
Add dual-precision regression tests for fetch_add and cmpxchg where
both the fetched value and a reread of the same stack slot are tracked
for precision. After the atomic operation, the stack slot is STACK_MISC,
so the ldx does not set INSN_F_STACK_ACCESS. These tests verify that
stack precision propagates solely through the atomic fetch's load side.
Add map-based tests for fetch_add and cmpxchg which validate that non-
stack atomic fetch completes precision tracking without falling back
to mark_all_scalars_precise. Lastly, add 32-bit variants for {fetch_add,
cmpxchg} on map values to cover the second valid atomic operand size.
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_precision
[...]
+ /etc/rcS.d/S50-startup
./test_progs -t verifier_precision
[ 1.697105] bpf_testmod: loading out-of-tree module taints kernel.
[ 1.700220] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[ 1.777043] tsc: Refined TSC clocksource calibration: 3407.986 MHz
[ 1.777619] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc6d7268, max_idle_ns: 440795260133 ns
[ 1.778658] clocksource: Switched to clocksource tsc
#633/1 verifier_precision/bpf_neg:OK
#633/2 verifier_precision/bpf_end_to_le:OK
#633/3 verifier_precision/bpf_end_to_be:OK
#633/4 verifier_precision/bpf_end_bswap:OK
#633/5 verifier_precision/bpf_load_acquire:OK
#633/6 verifier_precision/bpf_store_release:OK
#633/7 verifier_precision/state_loop_first_last_equal:OK
#633/8 verifier_precision/bpf_cond_op_r10:OK
#633/9 verifier_precision/bpf_cond_op_not_r10:OK
#633/10 verifier_precision/bpf_atomic_fetch_add_precision:OK
#633/11 verifier_precision/bpf_atomic_xchg_precision:OK
#633/12 verifier_precision/bpf_atomic_fetch_or_precision:OK
#633/13 verifier_precision/bpf_atomic_fetch_and_precision:OK
#633/14 verifier_precision/bpf_atomic_fetch_xor_precision:OK
#633/15 verifier_precision/bpf_atomic_cmpxchg_precision:OK
#633/16 verifier_precision/bpf_atomic_fetch_add_dual_precision:OK
#633/17 verifier_precision/bpf_atomic_cmpxchg_dual_precision:OK
#633/18 verifier_precision/bpf_atomic_fetch_add_map_precision:OK
#633/19 verifier_precision/bpf_atomic_cmpxchg_map_precision:OK
#633/20 verifier_precision/bpf_atomic_fetch_add_32bit_precision:OK
#633/21 verifier_precision/bpf_atomic_cmpxchg_32bit_precision:OK
#633/22 verifier_precision/bpf_neg_2:OK
#633/23 verifier_precision/bpf_neg_3:OK
#633/24 verifier_precision/bpf_neg_4:OK
#633/25 verifier_precision/bpf_neg_5:OK
#633 verifier_precision:OK
Summary: 1/25 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260331222020.401848-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
to each PR carrying 30%+ more fixes than in previous era. The good
news is that so far none of the "extra" fixes are themselves
causing real regressions. Not sure how much comfort that is.
Current release - fix to a fix:
- netdevsim: fix build if SKB_EXTENSIONS=n
- eth: stmmac: skip VLAN restore when VLAN hash ops are missing
Previous releases - regressions:
- wifi: iwlwifi: mvm: don't send a 6E related command when
not supported
Previous releases - always broken:
- some info leak fixes
- add missing clearing of skb->cb[] on ICMP paths from tunnels
- ipv6: flowlabel: defer exclusive option free until RCU teardown
- ipv6: avoid overflows in ip6_datagram_send_ctl()
- mpls: add seqcount to protect platform_labels from OOB access
- bridge: improve safety of parsing ND options
- Bluetooth: fix leaks, overflows and races in hci_sync
- netfilter: add more input validation, some to address bugs directly
some to prevent exploits from cooking up broken configurations
- wifi: ath: avoid poor performance due to stopping the wrong
aggregation session
- wifi: virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
- eth: fec: fix the PTP periodic output sysfs interface
- eth: enetc: safely reinitialize TX BD ring when it has unsent frames
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnOldMACgkQMUZtbf5S
IrvaPQ/9EdZIY8AnvdgZmzVrMkTbbshpOy/lLxkpFE4yX1Hgw9BLSZqoC3rq2b41
78Q6Zk7tbOHQb8rBLawi3+YuY+Eq5R4ajt4MNWWd1sYaaHnOXwp91jO4rvocSCjz
8o8/Z3VU4znG+cK85mcuYqNZcar/0dI8m01136Dtoi0dtZ4KKdUBBDT/Zq7Ov3gJ
pKrSMZBFT5UwnhlLi+xZ65KjdUMlbTujlQf0vH815p+iM+5E8fJNK5h+a6ZefXB4
Un+jXxhD/Vj5TBwq8ZouDSAWVCAG26Yy9RGcn5O7w0mlzv48mWB1bIoXFEyc2F8s
EbsiEqCNygHLoVTsBU1+0psYqey7aZDfceokzYMONHpJgpWbFmmHjfcFxfgeq9Of
iI3DU7IQMBKdN7uC4dCKc94Ty9Jye+DvCnkeMUEwxV4Dkhnr+2wP0pGqo6r2K0sT
9mFBh8YP2KyRd5+Ei8D4zmQrGpqpsXwSIwrhnGHEkWGjMAW+TltyOPzPzUgvMBHX
XllZIAFpTFaZiR9ZZU8PRyUNRfh93AmV0tY4xYCqVArf85A/LjqmJCw6K6Pthcmw
RzezpyQUCJ044EyDfDhjVgK/YEEkdT+wUcKKLw31pdOvQVAPJ4pI95pWbeVz4kLk
30DE7PR+2hExm44GHUfG/v8MJTE2OkSRu26Ci4dQsm3sT2zvv2g=
=3Pjk
-----END PGP SIGNATURE-----
Merge tag 'net-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"With fixes from wireless, bluetooth and netfilter included we're back
to each PR carrying 30%+ more fixes than in previous era.
The good news is that so far none of the "extra" fixes are themselves
causing real regressions. Not sure how much comfort that is.
Current release - fix to a fix:
- netdevsim: fix build if SKB_EXTENSIONS=n
- eth: stmmac: skip VLAN restore when VLAN hash ops are missing
Previous releases - regressions:
- wifi: iwlwifi: mvm: don't send a 6E related command when
not supported
Previous releases - always broken:
- some info leak fixes
- add missing clearing of skb->cb[] on ICMP paths from tunnels
- ipv6:
- flowlabel: defer exclusive option free until RCU teardown
- avoid overflows in ip6_datagram_send_ctl()
- mpls: add seqcount to protect platform_labels from OOB access
- bridge: improve safety of parsing ND options
- bluetooth: fix leaks, overflows and races in hci_sync
- netfilter: add more input validation, some to address bugs directly
some to prevent exploits from cooking up broken configurations
- wifi:
- ath: avoid poor performance due to stopping the wrong
aggregation session
- virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
- eth:
- fec: fix the PTP periodic output sysfs interface
- enetc: safely reinitialize TX BD ring when it has unsent frames"
* tag 'net-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
eth: fbnic: Increase FBNIC_QUEUE_SIZE_MIN to 64
ipv6: avoid overflows in ip6_datagram_send_ctl()
net: hsr: fix VLAN add unwind on slave errors
net: hsr: serialize seq_blocks merge across nodes
vsock: initialize child_ns_mode_locked in vsock_net_init()
selftests/tc-testing: add tests for cls_fw and cls_flow on shared blocks
net/sched: cls_flow: fix NULL pointer dereference on shared blocks
net/sched: cls_fw: fix NULL pointer dereference on shared blocks
net/x25: Fix overflow when accumulating packets
net/x25: Fix potential double free of skb
bnxt_en: Restore default stat ctxs for ULP when resource is available
bnxt_en: Don't assume XDP is never enabled in bnxt_init_dflt_ring_mode()
bnxt_en: Refactor some basic ring setup and adjustment logic
net/mlx5: Fix switchdev mode rollback in case of failure
net/mlx5: Avoid "No data available" when FW version queries fail
net/mlx5: lag: Check for LAG device before creating debugfs
net: macb: properly unregister fixed rate clocks
net: macb: fix clk handling on PCI glue driver removal
virtio_net: clamp rss_max_key_size to NETDEV_RSS_KEY_LEN
net/sched: sch_netem: fix out-of-bounds access in packet corruption
...
Add a test to verify the issue: kprobe_write_ctx can be abused to modify
struct pt_regs of kernel functions via kprobe_write_ctx=true freplace
progs.
Without the fix, the issue is verified:
kprobe_write_ctx=true freplace prog is allowed to attach to
kprobe_write_ctx=false kprobe prog. Then, the first arg of
bpf_fentry_test1 will be set as 0, and bpf_prog_test_run_opts() gets
-EFAULT instead of 0.
With the fix, the issue is rejected at attach time.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260331145353.87606-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some future AMD processors have feature named "CPPC Performance
Priority" which lets userspace specify different floor performance
levels for different CPUs. The platform firmware takes these different
floor performance levels into consideration while throttling the CPUs
under power/thermal constraints. The presence of this feature is
indicated by bit 16 of the EDX register for CPUID leaf
0x80000007. More details can be found in AMD Publication titled "AMD64
Collaborative Processor Performance Control (CPPC) Performance
Priority" Revision 1.10.
Define a new feature bit named X86_FEATURE_CPPC_PERF_PRIO to map to
CPUID 0x80000007.EDX[16].
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
kvm_arch_has_default_irqchip is required for irqfd_test and returns
true if an in-kernel interrupt controller is supported.
Fixes: a133052666 ("KVM: selftests: Fix irqfd_test for non-x86 architectures")
Signed-off-by: Mayuresh Chitale <mayuresh.chitale@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260402101818.2982071-1-mayuresh.chitale@oss.qualcomm.com
Signed-off-by: Anup Patel <anup@brainfault.org>
The hotplug testing only tries reading a trace remote buffer, loaded
before a CPU is offline. Extend this testing to cover:
* A trace remote buffer loaded after a CPU is offline.
* A trace remote buffer loaded before a CPU is online.
Because of these added test cases, move the hotplug testing into a
separate hotplug.tc file.
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260401045100.3394299-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Regression tests for the shared-block NULL derefs fixed in the previous
two patches:
- fw: attempt to attach an empty fw filter to a shared block and
verify the configuration is rejected with EINVAL.
- flow: create a flow filter on a shared block without a baseclass
and verify the configuration is rejected with EINVAL.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-3-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new selftest - ethtool_std_stats.sh - which validates the
eth-ctrl, eth-mac and pause standard statistics exported by an
interface. Collision related eth-mac counters as well as the error ones
will be checked against zero since that is the most likely correct
scenario.
The central part of this patch is the traffic_test() function which
gathers the 'before' counter values, sends a batch of traffic and then
interrogates again the same counters in order to determine if the delta
is on target. The function receives an array through which the caller
can request what counters to be interrogated and, for each of them, what
is their target delta value.
The output from this selftest looks as follows on a LX2160ARDB board:
$ ./run_kselftest.sh -t drivers/net/hw:ethtool_std_stats.sh
TAP version 13
1..1
# timeout set to 0
# selftests: drivers/net/hw: ethtool_std_stats.sh
# TAP version 13
# 1..26
# ok 1 ethtool_std_stats.eth-ctrl-MACControlFramesTransmitted
# ok 2 ethtool_std_stats.eth-ctrl-MACControlFramesReceived
# ok 3 ethtool_std_stats.eth-mac-FrameCheckSequenceErrors
# ok 4 ethtool_std_stats.eth-mac-AlignmentErrors
# ok 5 ethtool_std_stats.eth-mac-FramesLostDueToIntMACXmitError
# ok 6 ethtool_std_stats.eth-mac-CarrierSenseErrors # SKIP
# ok 7 ethtool_std_stats.eth-mac-FramesLostDueToIntMACRcvError
# ok 8 ethtool_std_stats.eth-mac-InRangeLengthErrors # SKIP
# ok 9 ethtool_std_stats.eth-mac-OutOfRangeLengthField # SKIP
# ok 10 ethtool_std_stats.eth-mac-FrameTooLongErrors # SKIP
# ok 11 ethtool_std_stats.eth-mac-FramesAbortedDueToXSColls # SKIP
# ok 12 ethtool_std_stats.eth-mac-SingleCollisionFrames # SKIP
# ok 13 ethtool_std_stats.eth-mac-MultipleCollisionFrames # SKIP
# ok 14 ethtool_std_stats.eth-mac-FramesWithDeferredXmissions # SKIP
# ok 15 ethtool_std_stats.eth-mac-LateCollisions # SKIP
# ok 16 ethtool_std_stats.eth-mac-FramesWithExcessiveDeferral # SKIP
# ok 17 ethtool_std_stats.eth-mac-BroadcastFramesXmittedOK
# ok 18 ethtool_std_stats.eth-mac-OctetsTransmittedOK
# ok 19 ethtool_std_stats.eth-mac-BroadcastFramesReceivedOK
# ok 20 ethtool_std_stats.eth-mac-OctetsReceivedOK
# ok 21 ethtool_std_stats.eth-mac-FramesTransmittedOK
# ok 22 ethtool_std_stats.eth-mac-MulticastFramesXmittedOK
# ok 23 ethtool_std_stats.eth-mac-FramesReceivedOK
# ok 24 ethtool_std_stats.eth-mac-MulticastFramesReceivedOK
# ok 25 ethtool_std_stats.pause-tx_pause_frames
# ok 26 ethtool_std_stats.pause-rx_pause_frames
# # 10 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# # Totals: pass:16 fail:0 xfail:0 xpass:0 skip:10 error:0
ok 1 selftests: drivers/net/hw: ethtool_std_stats.sh
Please note that not all MACs are counting the software injected pause
frames as real Tx pause. For example, on a LS1028ARDB the selftest
output will reflect the fact that neither the ENETC MAC, nor the Felix
switch MAC are able to detect Tx pause frames injected by software.
$ ./run_kselftest.sh -t drivers/net/hw:ethtool_std_stats.sh
(...)
# # software sent pause frames not detected
# ok 25 ethtool_std_stats.pause-tx_pause_frames # XFAIL
# ok 26 ethtool_std_stats.pause-rx_pause_frames
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-10-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch finalizes the transition to work with a single local
interface for the ethtool_rmon.sh test. Each 'ip link' and 'ethtool'
command used by the test is annotated with the necessary run_on in
order to be executed on the necessary target system, be it local, in
another network namespace or through ssh.
Since we need NETIF up and running also for control traffic, we now
expect that the interfaces are up and running and do not touch bring
them up or down at the end of the test. This is also documented in the
drivers/net/README.rst.
The ethtool_rmon.sh script can still be used in the older fashion by
passing two interfaces as command line arguments, the only restriction
is that those interfaces need to be already up.
$ DRIVER_TEST_CONFORMANT=no ./ethtool_rmon.sh eth0 eth1
As part of the kselftest infrastructure, this test can be run in the
following manner:
$ make -C tools/testing/selftests/ TARGETS="drivers/net drivers/net/hw" \
install INSTALL_PATH=/tmp/ksft-net-drv
$ cd /tmp/ksft-net-drv/
$ cat > ./drivers/net/net.config <<EOF
NETIF=endpmac17
LOCAL_V4=17.0.0.1
REMOTE_V4=17.0.0.2
REMOTE_TYPE=ssh
REMOTE_ARGS=root@192.168.5.200
EOF
$ ./run_kselftest.sh -t drivers/net/hw:ethtool_rmon.sh
TAP version 13
1..1
# timeout set to 0
# selftests: drivers/net/hw: ethtool_rmon.sh
# TAP version 13
# 1..14
# ok 1 ethtool_rmon.rx-pkts64to64
# ok 2 ethtool_rmon.rx-pkts65to127
# ok 3 ethtool_rmon.rx-pkts128to255
# ok 4 ethtool_rmon.rx-pkts256to511
# ok 5 ethtool_rmon.rx-pkts512to1023
# ok 6 ethtool_rmon.rx-pkts1024to1518
# ok 7 ethtool_rmon.rx-pkts1519to10240
# ok 8 ethtool_rmon.tx-pkts64to64
# ok 9 ethtool_rmon.tx-pkts65to127
# ok 10 ethtool_rmon.tx-pkts128to255
# ok 11 ethtool_rmon.tx-pkts256to511
# ok 12 ethtool_rmon.tx-pkts512to1023
# ok 13 ethtool_rmon.tx-pkts1024to1518
# ok 14 ethtool_rmon.tx-pkts1519to10240
# # Totals: pass:14 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: drivers/net/hw: ethtool_rmon.sh
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-9-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Update the ethtool_rmon.sh test so that it uses the KTAP format for its
output. This is achieved by using the helpers found in ktap_helpers.sh.
An example output can be found below.
$ ./ethtool_rmon.sh endpmac3 endpmac4
TAP version 13
1..14
ok 1 ethtool_rmon.rx-pkts64to64
ok 2 ethtool_rmon.rx-pkts65to127
ok 3 ethtool_rmon.rx-pkts128to255
ok 4 ethtool_rmon.rx-pkts256to511
ok 5 ethtool_rmon.rx-pkts512to1023
ok 6 ethtool_rmon.rx-pkts1024to1518
ok 7 ethtool_rmon.rx-pkts1519to10240
ok 8 ethtool_rmon.tx-pkts64to64
ok 9 ethtool_rmon.tx-pkts65to127
ok 10 ethtool_rmon.tx-pkts128to255
ok 11 ethtool_rmon.tx-pkts256to511
ok 12 ethtool_rmon.tx-pkts512to1023
ok 13 ethtool_rmon.tx-pkts1024to1518
ok 14 ethtool_rmon.tx-pkts1519to10240
# Totals: pass:14 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-8-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The ethtool_rmon.sh script checks that the number of packets sent /
received during a test matches the expected value with a 1% tolerance.
Since in the next patches this test will gain the capability to also be
run on systems with a single interface where the traffic generator is
accesible through ssh, use the UINT32_MAX as the upper limit. This is
necessary since the same interface will be used also for control traffic
(the ssh commands) as well as the mausezahn generated one.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-7-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The selftests in drivers/net are slowly transitioning to being able to
be used on systems with a single network interface. The first step for the
ethtool_rmon.sh test is to only validate that the rmon counters are
properly exported on the first interface supplied as an argument.
Remove the rmon_histogram calls which intend to test also the rmon
counters on the 2nd interface. This also removes the need for the remote
system, which should be used only to inject traffic, to also support
rmon counters.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-6-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If run on the ethtool_rmon.sh script, shellcheck generates a bunch of
false positive errors. Suppress those checks that generate them.
Also cleanup the remaining warnings by using double quoting around the
used variables.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-5-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Update some helpers so that they are capable to run commands on
different targets than the local one. This patch makes the necesasy
modification for those helpers / sections of code which are needed for
the ethtool_rmon.sh test that will be converted in the next patches.
For example, mac_addr_prepare() and mac_addr_restore() used when
STABLE_MAC_ADDRS=yes need to ensure stable MAC addresses on interfaces
located even in other namespaces. In order to do that, append the 'ip
link' commands with a 'run_on $dev' tag.
The same run_on is necessary also when verifying if all the interfaces
listed in NETIFS are indeed available.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-4-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Extend lib.sh so that it's able to parse driver/net/net.config and
environment variables such as NETIF, REMOTE_TYPE, LOCAL_V4 etc described
in drivers/net/README.rst.
In order to make the transition towards running with a single local
interface smoother for the bash networking driver tests, beside sourcing
the net.config file also translate the new env variables into the old
style based on the NETIFS array. Since the NETIFS array only holds the
network interface names, also add a new array - TARGETS - which keeps
track of the target on which a specific interfaces resides - local,
netns or accesible through an ssh command.
For example, a net.config which looks like below:
NETIF=eth0
LOCAL_V4=192.168.1.1
REMOTE_V4=192.168.1.2
REMOTE_TYPE=ssh
REMOTE_ARGS=root@192.168.1.2
will generate the NETIFS and TARGETS arrays with the following data.
NETIFS[p1]="eth0"
NETIFS[p2]="eth2"
TARGETS[eth0]="local:"
TARGETS[eth2]="ssh:root@192.168.1.2"
The above will be true if on the remote target, the interface which has
the 192.168.1.2 address is named eth2.
Since the TARGETS array is indexed by the network interface name,
document a new restriction README.rst which states that the remote
interface cannot have the same name as the local one. Keep the old way
of populating the NETIFS variable based on the command line arguments.
This will be invoked in case DRIVER_TEST_CONFORMANT = "no".
Also add a couple of helpers which can be used by tests which need to
run a specific bash command on a different target than the local system,
be it either another netns or a remote system accessible through ssh.
The __run_on() function is passed through $1 the target on which the
command should be executed while run_on() is passed the name of the
interface that is then used to retrieve the target from the TARGETS
array.
Also add a stub run_on() function in net/lib.sh so that users of the
net/lib.sh are going through the stub only since neither NETIFS nor
TARGETS are valid in that circumstance.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-3-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Even though pause frame statistics are not exported through the same
ethtool command, there is no point in adding another helper just for
them. Extent the ethtool_std_stats_get() function so that we are able to
interrogate using the same helper all the standard statistics.
And since we are touching the function, convert the initial ethtool call
as well to the jq --arg form in order to be easier to read.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260330152933.2195885-2-ioana.ciornei@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add test case for dirty tracking on a domain attached to PASID, also
confirm attachment to PASID fail if device doesn't support dirty tracking.
Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20260330101108.12594-5-zhenzhong.duan@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCac2n1wAKCRAraS+Z+3y5
E7INAPwOyqMJws2kswrIPZ8jqfaBIcNVe9MM9a9Ldp8qZmWUHAD/ayqW4hHP6eMA
WBNcVCDGStYeI4lyINS5AqPN8mMhOAI=
=iquL
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2026-04-01
We've added 2 non-merge commits during the last 2 day(s) which contain
a total of 3 files changed, 139 insertions(+), 23 deletions(-).
The main changes are:
1) skb_dst_drop(skb) when bpf prog does a encap or decap,
from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Test that dst is cleared on same-protocol encap
net: Clear the dst when performing encap / decap
====================
Link: https://patch.msgid.link/20260401233956.4133413-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The test constantly fails on my Intel hybrid machine. The issue was it
has two events in the output even if I only gave it one event.
$ perf stat -e instructions -- perf test -w sqrtloop
Performance counter stats for 'perf test -w sqrtloop':
910,856,421 cpu_atom/instructions/ (28.05%)
14,852,865,997 cpu_core/instructions/ (96.79%)
1.014313341 seconds time elapsed
1.004114000 seconds user
0.008174000 seconds sys
Let's modify the awk script to add the values for each line and print
the total. The variable 'i' has a number of input lines that have valid
output and variable 'c' has the sum of actual counter values. That way
it should work on any platforms.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Writing to the test output files in the current working directory can
fail in various contexts such as continual test. Other tests write to
a mktemp-ed file, make the "perf script task-analyszer tests" follow
this convention too. Currently this isn't possible for the perf.data
file due to a lack of perf script support, add a variable for when
this support is available.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The index into the cpumap array and the number of entries within the
array can never be negative, so let's make them unsigned. This is
prompted by reports that gcc 13 with -O6 is giving a
alloc-size-larger-than errors. The change makes the cpumap changes and
then updates the declaration of index variables throughout perf and
libperf to be unsigned. The two things are hard to separate as
compiler warnings about mixing signed and unsigned types breaks the
build.
Reported-by: Chingbin Li <liqb365@163.com>
Closes: https://lore.kernel.org/lkml/20260212025127.841090-1-liqb365@163.com/
Tested-by: Chingbin Li <liqb365@163.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
For (eg) "%*.*s" treat a negative field width as a request to left align
the output (the same as the '-' flag), and a negative precision to
request the default precision.
Set the default precision to -1 (not INT_MAX) and add explicit checks
to the string handling for negative values (makes the tet unsigned).
For numeric output check for 'precision >= 0' instead of testing
_NOLIBC_PF_FLAGS_CONTAIN(flags, '.').
This needs an inverted test, some extra goto and removes an indentation.
The changed conditionals fix printf("%0-#o", 0) - but '0' and '-' shouldn't
both be specified.
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Link: https://patch.msgid.link/20260323112247.3196-1-david.laight.linux@gmail.com
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
When platform firmware is committed to publishing EFI_CONVENTIONAL_MEMORY
in the memory map, but CXL fails to assemble the region, dax_hmem can
attempt to attach a dax device to the memory range.
Take advantage of the new ability to support multiple "hmem_platform"
devices, and to enable regression testing of several scenarios:
* CXL correctly assembles a region, check dax_hmem fails to attach dax
* CXL fails to assemble a region, check dax_hmem successfully attaches dax
* Check that loading the dax_cxl driver loads the dax_hmem driver
* Attempt to race cxl_mock_mem async probe vs dax_hmem probe flushing.
Check that both positive and negative cases.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260327052821.440749-10-dan.j.williams@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Add a cxl_test module option to skip setting up one of the members of the
default auto-assembled region.
This simulates a device failing between firmware setup and OS boot, or
region configuration interrupted by an event like kexec.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260327052821.440749-9-dan.j.williams@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
set_id_regs creates a GIC3 guest when possible, and then proceeds
to write the ID registers as if they were not affected by the presence
of a GIC. As it turns out, ID_AA64PFR1_EL1 is the proof of the
contrary.
KVM now makes a point in exposing the GIC support to the guest,
no matter what userspace says (userspace such as QEMU is known to
write silly things at times).
Accommodate for this level of nonsense by teaching set_id_regs about
fields that are mutable, and only compare registers that have been
re-sanitised first.
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260401103611.357092-17-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since commit 0c43094f8c ("eventpoll: Replace rwlock with spinlock"),
epoll_wait is real-time-safe syscall for sleeping.
Add epoll_wait to the list of rt-safe sleeping APIs.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260401130828.3115428-1-namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
free_reserved_area() is related to memblock as it frees reserved memory
back to the buddy allocator, similar to what memblock_free_late() does.
Move free_reserved_area() to mm/memblock.c to prepare for further
consolidation of the functions that free reserved memory.
No functional changes.
Link: https://patch.msgid.link/20260323074836.3653702-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
reserve_bootmem_region() is only called from
memmap_init_reserved_pages() and it was in mm/mm_init.c because of its
dependecies on static init_deferred_page().
Since init_deferred_page() is not static anymore, move
reserve_bootmem_region(), rename it to memmap_init_reserved_range() and
make it static.
Update the comment describing it to better reflect what the function
does and drop bogus comment about reserved pages in free_bootmem_page().
Update memblock test stubs to reflect the core changes.
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260323072042.3651061-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
When using the "reserve_mem" parameter, users aim at having an
area that (hopefully) persists across boots, so pstore infrastructure
(like ramoops module) can make use of that to save oops/ftrace logs,
for example.
There is no easy way to determine if this kernel parameter is properly
set though; the kernel doesn't show information about this memory in
memblock debugfs, neither in /proc/iomem nor dmesg. This is a relevant
information for tools like kdumpst[0], to determine if it's reliable
to use the reserved area as ramoops persistent storage; checking only
/proc/cmdline is not sufficient as it doesn't tell if the reservation
effectively succeeded or not.
Add here a new file under memblock debugfs showing properly set memory
reservations, with name and size as passed to "reserve_mem". Notice that
if no "reserve_mem=" is passed on command-line or if the reservation
attempts fail, the file is not created.
[0] https://aur.archlinux.org/packages/kdumpst
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://patch.msgid.link/20260324012839.1991765-2-gpiccoli@igalia.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
The _fill_states() method returns a list of strings, but the type
annotation incorrectly specified str. Update the annotation to
list[str] to match the actual return value.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-20-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Pyright static analysis reports a "possibly unbound variable" warning
for the loop variable `i` in the `abbreviate_atoms` function. The
variable is accessed after the inner loop terminates to slice the atom
string. While the loop logic currently ensures execution, the analyzer
flags the reliance on the loop variable persisting outside its scope.
Refactor the prefix length calculation into a nested `find_share_length`
helper function. This encapsulates the search logic and uses explicit
return statements, ensuring the length value is strictly defined. This
satisfies the type checker and improves code readability without
altering the runtime behavior.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-19-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The __get_state_variables() method parses DOT files to identify the
automaton's initial state. If the input file lacks a node with the
required initialization prefix, the initial_state variable is referenced
before assignment, causing an UnboundLocalError or a generic error
during the state removal step.
Initialize the variable explicitly and validate that a start node was
found after parsing. Raise a descriptive AutomataError if the definition
is missing to improve debugging and ensure the automaton is valid.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-18-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Add a node_marker class constant to the Automata class to replace the
hardcoded "{node" string literal used throughout the DOT file parsing
logic. This follows the existing pattern established by the init_marker
and invalid_state_str class constants in the same class.
The "{node" string is used as a marker to identify node declaration
lines in DOT files during state variable extraction and cursor
positioning. Extracting it to a named constant improves code
maintainability and makes the marker's purpose explicit.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-17-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The Variable.expand() method in ltl2ba.py performs contradiction
detection by checking if a negated variable already exists in the
graph node's old set. However, the isinstance check was incorrectly
testing the ASTNode wrapper instead of the wrapped operator, causing
the check to always return False.
The old set contains ASTNode instances which wrap LTL operators via
their .op attribute. The fix changes isinstance(f, NotOp) to
isinstance(f.op, NotOp) to correctly examine the wrapped operator
type. This follows the established pattern used elsewhere in the
file, such as the iteration at lines 572-574 which accesses
o.op.is_temporal() on items from node.old.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260223162407.147003-16-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Add required=True to the monitor subcommand arguments for class, spec,
and monitor_type in rvgen. These arguments are essential for monitor
generation and attempting to run without them would cause AttributeError
exceptions later in the code when the script tries to access them.
Making these arguments explicitly required provides clearer error
messages to users at parse time rather than cryptic exceptions during
execution. This improves the user experience by catching missing
arguments early with helpful usage information.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260223162407.147003-15-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The __get_main_name() method in the generator module is never called
from anywhere in the codebase. Remove this dead code to improve
maintainability.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-14-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The sys module was imported in the dot2c frontend script but never
used. This import was likely left over from earlier development or
copied from a template that required sys for exit handling.
Remove the unused import to clean up the code and satisfy linters
that flag unused imports as errors.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-13-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Refactor the DOT file parsing logic in automata.py to use Python's
iterator-based patterns instead of manual cursor indexing. The previous
implementation relied on while loops with explicit cursor management,
which made the code prone to off-by-one errors and would crash on
malformed input files containing empty lines.
The new implementation uses enumerate and itertools.islice to iterate
over lines, eliminating manual cursor tracking. Functions that search
for specific markers now use for loops with early returns and explicit
AutomataError exceptions for missing markers, rather than assuming the
markers exist. Additional bounds checking ensures that split line
arrays have sufficient elements before accessing specific indices,
preventing IndexError exceptions on malformed DOT files.
The matrix creation and event variable extraction methods now use
functional patterns with map combined with itertools.islice,
making the intent clearer while maintaining the same behavior. Minor
improvements include using extend instead of append in a loop, adding
empty file validation, and replacing enumerate with range where the
enumerated value was unused.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-12-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Replace hardcoded string literal and magic number with a class
constant for the initial state marker in DOT file parsing. The
previous implementation used the magic string "__init_" directly
in the code along with a hardcoded length of 7 for substring
extraction, which made the code less maintainable and harder to
understand.
This change introduces a class constant init_marker to serve as
a single source of truth for the initial state prefix. The code
now uses startswith() for clearer intent and calculates the
substring position dynamically using len(), eliminating the magic
number. If the marker value needs to change in the future, only
the constant definition requires updating rather than multiple
locations in the code.
The refactoring improves code readability and maintainability
while preserving the exact same runtime behavior.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260223162407.147003-11-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Fix incorrect boolean logic in automata DOT file format validation
that allowed malformed files to pass undetected. The previous
implementation used a logical AND operator where OR was required,
causing the validation to only reject files when both the first
token was not "digraph" AND the second token was not
"state_automaton". This meant a file starting with "digraph" but
having an incorrect second token would incorrectly pass validation.
The corrected logic properly rejects DOT files where either the
first token is not "digraph" or the second token is not
"state_automaton", ensuring that only properly formatted automaton
definition files are accepted for processing. Without this fix,
invalid DOT files could cause downstream parsing failures or
generate incorrect C code for runtime verification monitors.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-10-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Fix whitespace violations throughout the rvgen codebase to comply
with PEP 8 style guidelines. The changes address missing whitespace
after commas, around operators, and in collection literals that
were flagged by pycodestyle.
The fixes include adding whitespace after commas in string replace
chains and function arguments, adding whitespace around arithmetic
operators, removing extra whitespace in list comprehensions, and
fixing dictionary literal spacing. These changes improve code
readability and consistency with Python coding standards.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260223162407.147003-9-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Fix two typos in the Automata class documentation that have been
present since the initial implementation. Fix the class
docstring: "part it" instead of "parses it". Additionally, a
comment describing transition labels contained the misspelling
"lables" instead of "labels".
Fix a typo in the comment describing the insertion of the initial
state into the states list: "bein og" should be "beginning of".
Fix typo in the module docstring: "Abtract" should be "Abstract".
Fix several occurrences of "automata" where it should be the singular
form "automaton".
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-8-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
GCC 15 reports the below false positive '-Wmaybe-uninitialized' warning
in vphn_unpack_associativity() when building the powerpc selftests.
# make -C tools/testing/selftests TARGETS="powerpc"
[...]
CC test-vphn
In file included from test-vphn.c:3:
In function ‘vphn_unpack_associativity’,
inlined from ‘test_one’ at test-vphn.c:371:2,
inlined from ‘test_vphn’ at test-vphn.c:399:9:
test-vphn.c:10:33: error: ‘be_packed’ may be used uninitialized [-Werror=maybe-uninitialized]
10 | #define be16_to_cpup(x) bswap_16(*x)
| ^~~~~~~~
vphn.c:42:27: note: in expansion of macro ‘be16_to_cpup’
42 | u16 new = be16_to_cpup(field++);
| ^~~~~~~~~~~~
In file included from test-vphn.c:19:
vphn.c: In function ‘test_vphn’:
vphn.c:27:16: note: ‘be_packed’ declared here
27 | __be64 be_packed[VPHN_REGISTER_COUNT];
| ^~~~~~~~~
cc1: all warnings being treated as errors
When vphn_unpack_associativity() is called from hcall_vphn() in kernel
the error is not seen while building vphn.c during kernel compilation.
This is because the top level Makefile includes '-fno-strict-aliasing'
flag always.
The issue here is that GCC 15 emits '-Wmaybe-uninitialized' due to type
punning between __be64[] and __b16* when accessing the buffer via
be16_to_cpup(). The underlying object is fully initialized but GCC 15
fails to track the aliasing due to the strict aliasing violation here.
Please refer [1] and [2]. This results in a false positive warning which
is promoted to an error under '-Werror'. This problem is not seen when
the compilation is performed with GCC 13 and 14. An issue [1] has also
been created on GCC bugzilla.
The selftest compiles fine with '-fno-strict-aliasing'. Since this GCC
flag is used to compile vphn.c in kernel too, the same flag should be
used to build vphn tests when compiling vphn.c in the selftest as well.
Fix this by including '-fno-strict-aliasing' during vphn.c compilation
in the selftest. This keeps the build working while limiting the scope
of the suppression to building vphn tests.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124427
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99768
Fixes: 58dae82843 ("selftests/powerpc: Add test for VPHN")
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260313165426.43259-1-amachhiw@linux.ibm.com
As it is not really used when compiling anything, just being parsed to
collect number->string tables for 'perf trace'.
$ git grep fadvise.h tools/
tools/perf/Makefile.perf:$(fadvise_advice_array): $(beauty_uapi_linux_dir)/fadvise.h $(fadvise_advice_tbl)
tools/perf/check-headers.sh: "include/uapi/linux/fadvise.h"
tools/perf/trace/beauty/fadvise.sh:grep -E $regex ${header_dir}/fadvise.h | \
tools/perf/trace/beauty/fadvise.sh:# tools/include/uapi/linux/fadvise.h for details.
$
Link: https://lore.kernel.org/r/CAP-5=fVBNQVF8k3JUQjH1nkP69ZVp8BqP+uwygcx=xO0zC4xrg@mail.gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
As it is used only to parse ioctl numbers, not to build perf and so far
no other tools/ living tool uses it, so to clean up tools/include/ to be
used just for building tools, to have access to things available in the
kernel and not yet in the system headers, move it to the directory where
just the tools/perf/trace/beauty/ scripts can use to generate tables
used by perf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Commit 3bc753c06d ("kbuild: treat char as always unsigned") made
chars unsigned by default in the Linux kernel. To avoid similar kinds
of bugs and warnings, make unsigned chars the default for the perf tool.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Test authors need to know about variants, existing tests don't use
them because variants are relatively recent.
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260331001930.3411279-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The str* family of fortified functions all use member-sized limits
for a while now, so the FORTIFY_STR_OBJECT test is redundant to
FORTIFY_STR_MEMBER. While here, replace the strncpy() use with strscpy(),
as strncpy() is being removed.
Link: https://patch.msgid.link/20260324020726.work.624-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
When running veristat across many BPF objects, expected load failures
produce noisy stderr output that obscures actual issues. Gate these
diagnostic messages behind --verbose.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260331172634.57402-2-mykyta.yatsenko5@gmail.com
Add the testing to access the bpf_ringbuf with the map pointer.
"consumer_pos" and "producer_pos" is accessed in this testing. We reserve
128 bytes in the ringbuf to test the producer_pos, which should be
"128 + BPF_RINGBUF_HDR_SZ".
It will be helpful if we want to evaluate the usage of the ringbuf in bpf
prog with the consumer and producer position.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/bpf/20260331070434.10037-1-dongml2@chinatelecom.cn
Clarify bpf_ringbuf_discard() documentation for BPF_RB_NO_WAKEUP.
Discarded ring buffer records are still left in the ring buffer and are
only skipped when user space consumes them. This can matter when
BPF_RB_NO_WAKEUP is used: a later submit relying on adaptive wakeup
might not wake the consumer, because the discarded record still needs to
be consumed first.
Scenario:
epoll_wait(rb_fd); // blocks
rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_discard(rec, BPF_RB_NO_WAKEUP);
rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_submit(rec, 0); // valid record, but no wakeup
Document this in bpf_ringbuf_discard() to make the interaction between
discarded records, user-space consumption, and adaptive wakeups explicit.
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260331130612.3762433-1-eyal.birger@gmail.com
----
v2: adapt wording per feedback from Andrii.
If CONFIG_HOTPLUG_CPU is disabled, /sys/devices/system/cpu/cpu*
directories are still populated, so the test fails to correctly detect
that CPU hotplug is not supported.
Fix this by checking for the presence of 'online' files in those
directories instead. The 'online' node is created for the given CPU if
and only if this CPU supports hotplug. So if none of the CPUs have
'online' nodes, it means CPU hotplug is not supported.
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
Link: https://lore.kernel.org/r/20260319153825.2813576-1-dmaluka@chromium.org
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in
hardirq context form a cycle. Move the wait to a balance callback which
can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the
waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when per-node
idle tracking is disabled.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiiQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVILAP44s30JBpNyJ9JhAiCoTYzxzOXqqGbotnpQckMF
+7WoJAD/Z9dJO/Sw/AH0fX6WVJDmO0QsQvFXLXJBxWy7A5XVAA0=
=2DW5
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other
in hardirq context form a cycle. Move the wait to a balance callback
which can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where
the waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when
per-node idle tracking is disabled.
* tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test
sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback
sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
Commit 85506aca2e ("selftests/mqueue: Set timeout to 180 seconds")
intended to increase the timeout for mq_perf_tests from the default
kselftest limit of 45 seconds to 180 seconds.
Unfortunately, the file storing this information was incorrectly named
`setting` instead of `settings`, causing the kselftest runner not to
pick up the limit and keep using the default 45 seconds limit.
Fix this by renaming it to `settings` to ensure that the kselftest
runner uses the increased timeout of 180 seconds for this test.
Fixes: 85506aca2e ("selftests/mqueue: Set timeout to 180 seconds")
Cc: <stable@vger.kernel.org> # 5.10.y
Signed-off-by: Simon Liebold <simonlie@amazon.de>
Link: https://lore.kernel.org/r/20260312140200.2224850-1-simonlie@amazon.de
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
- Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink
introduced a window where cgroup.procs is empty but the cgroup is still
populated, causing rmdir to fail with -EBUSY and selftest failures. Make
rmdir wait for dying tasks to fully leave and fix selftests to not depend
on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies. When CPU hotplug removes the last CPU from a v1
cpuset, tasks must be migrated to an ancestor without a
security_task_setscheduler() check that would block the migration.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwibg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXHEAP98nVEKyl7c7+sXYtwOPn8KEhdHkdpHyPZwhpS2
1wLhaQEAm8yO49s7IgvGPWSz0s/gQdmF5/x8RAee0sJsZALvGQg=
=bUUt
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix cgroup rmdir racing with dying tasks.
Deferred task cgroup unlink introduced a window where cgroup.procs
is empty but the cgroup is still populated, causing rmdir to fail
with -EBUSY and selftest failures.
Make rmdir wait for dying tasks to fully leave and fix selftests to
not depend on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies.
When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be
migrated to an ancestor without a security_task_setscheduler() check
that would block the migration.
* tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Skip security check for hotplug induced v1 task migration
cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach()
cgroup: Fix cgroup_drain_dying() testing the wrong condition
selftests/cgroup: Don't require synchronous populated update on task exit
cgroup: Wait for dying tasks to leave on rmdir
Instead of manually writing ktap messages, we should use the formal
ktap helpers in runner.sh. Brendan did some work in commit d9e6269e33
("selftests/run_kselftest.sh: exit with error if tests fail") to make
run_kselftest.sh exit with the correct return value. However, the output
does not include the total results, such as how many tests passed or failed.
Let’s convert all manually printed messages in runner.sh to use the
formal ktap helpers. Here are what I changed:
1. Move TAP header from runner.sh to run_kselftest.sh, since
run_kselftest.sh is the only caller of run_many().
2. In run_kselftest.sh, call run_many() in main process to count the
pass/fail numbers.
3. In run_kselftest.sh, do not generate kselftest_failures_file. Just
use ktap_print_totals to report the result.
4. In runner.sh run_one(), get the return value and use ktap helpers for
all pass/fail reporting. This allows counting pass/fail numbers in the
main process.
5. In runner.sh run_in_netns(), also return the correct rc, so we can
count results during wait.
After the change, the printed result looks like:
not ok 4 4 selftests: clone3: clone3_cap_checkpoint_restore # exit=1
# Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0
]# echo $?
1
Fixed change log commit description errors and long lines:
Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/r/20260225010833.11301-1-liuhangbin@gmail.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Users may accidentally use the kselftest_test_result_*() functions in
their harness tests. If ksft_finished() is not used, the results
reported in this way are silently ignored.
Detect such false-positive cases and fail the test.
A more correct test would be to reject *any* usage of the ksft APIs but
that would force code churn on users.
Correct usages, which do use ksft_finished() will not trigger this
validation as the test will exit before it.
Reported-by: Yuwen Chen <ywen.chen@foxmail.com>
Link: https://lore.kernel.org/lkml/tencent_56D79AF3D23CEFAF882E83A2196EC1F12107@qq.com/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://lore.kernel.org/r/20260302-kselftest-harness-v2-4-3143aa41d989@linutronix.de
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The harness treats these tests as successful, as does pytest.
Align kselftest.h to the rest of the ecosystem.
None of the Linux selftests seem to actually use this anyways.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://lore.kernel.org/r/20260302-kselftest-harness-v2-1-3143aa41d989@linutronix.de
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Since commit a0aa283c53 ("selftest/ftrace: Generalise ftracetest to
use with RV") moved the default LOG_DIR setting after --logdir option
parser, it overwrites the user given LOG_DIR.
This fixes it to check the --logdir option parameter when setting new
default LOG_DIR with a new TOP_DIR.
Fixes: a0aa283c53 ("selftest/ftrace: Generalise ftracetest to use with RV")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/177071725191.2369897.14781037901532893911.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Replace manual file open and close operations with context managers
throughout the rvgen codebase. The previous implementation used
explicit open() and close() calls, which could lead to resource leaks
if exceptions occurred between opening and closing the file handles.
This change affects three file operations: reading DOT specification
files in the automata parser, reading template files in the generator
base class, and writing generated monitor files. All now use the with
statement to ensure proper resource cleanup even in error conditions.
Context managers provide automatic cleanup through the with statement,
which guarantees that file handles are closed when the with block
exits regardless of whether an exception occurred. This follows PEP
343 recommendations and is the standard Python idiom for resource
management. The change also reduces code verbosity while improving
safety and maintainability.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-7-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Remove unnecessary semicolons from Python code in the rvgen tool.
Python does not require semicolons to terminate statements, and
their presence goes against PEP 8 style guidelines. These semicolons
were likely added out of habit from C-style languages.
This cleanup improves consistency with Python coding standards and
aligns with the recent improvements to remove other Python
anti-patterns from the codebase.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-6-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Replace all direct calls to the __len__() dunder method with the
idiomatic len() built-in function across the rvgen codebase. This
change eliminates a Python anti-pattern where dunder methods are
called directly instead of using their corresponding built-in
functions.
The changes affect nine instances across two files. In automata.py,
the empty string check is further improved by using truthiness
testing instead of explicit length comparison. In dot2c.py, all
length checks in the get_minimun_type, __get_max_strlen_of_states,
and get_aut_init_function methods now use the standard len()
function. Additionally, spacing around keyword arguments has been
corrected to follow PEP 8 guidelines.
Direct calls to dunder methods like __len__() are discouraged in
Python because they bypass the language's abstraction layer and
reduce code readability. Using len() provides the same functionality
while adhering to Python community standards and making the code more
familiar to Python developers.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260223162407.147003-5-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Replace all instances of percent-style string formatting with
f-strings across the rvgen codebase. This modernizes the string
formatting to use Python 3.6+ features, providing clearer and more
maintainable code while improving runtime performance.
The conversion handles all formatting cases including simple variable
substitution, multi-variable formatting, and complex format specifiers.
Dynamic width formatting is converted from "%*s" to "{var:>{width}}"
using proper alignment syntax. Template strings for generated C code
properly escape braces using double-brace syntax to produce literal
braces in the output.
F-strings provide approximately 2x performance improvement over percent
formatting and are the recommended approach in modern Python.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-4-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Remove bare except clauses from the generator module that were
catching all exceptions including KeyboardInterrupt and SystemExit.
This follows the same exception handling improvements made in the
previous AutomataError commit and addresses PEP 8 violations.
The bare except clause in __create_directory was silently catching
and ignoring all errors after printing a message, which could mask
serious issues. For __write_file, the bare except created a critical
bug where the file variable could remain undefined if open() failed,
causing a NameError when attempting to write to or close the file.
These methods now let OSError propagate naturally, allowing callers
to handle file system errors appropriately. This provides clearer
error reporting and allows Python's exception handling to show
complete stack traces with proper error types and locations.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-3-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Replace the generic except Exception block with a custom AutomataError
class that inherits from Exception. This provides more precise exception
handling for automata parsing and validation errors while avoiding
overly broad exception catches that could mask programming errors like
SyntaxError or TypeError.
The AutomataError class is raised when DOT file processing fails due to
invalid format, I/O errors, or malformed automaton definitions. The
main entry point catches this specific exception and provides a
user-friendly error message to stderr before exiting.
Also, replace generic exceptions raising in HA and LTL with
AutomataError.
Co-authored-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260223162407.147003-2-wander@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Add the deadline monitors collection to validate the deadline scheduler,
both for deadline tasks and servers.
The currently implemented monitors are:
* nomiss:
validate dl entities run to completion before their deadiline
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20260330111010.153663-13-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The special per-object monitor type was just introduced in RV, this
requires the user to define some functions and type specific to the
object.
Adapt rvgen to add stub definitions for the monitor_target type and
other modifications required to create per-object monitors.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260330111010.153663-10-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The opid monitor validates that wakeup and need_resched events only
occur with interrupts and preemption disabled by following the
preemptirq tracepoints.
As reported in [1], those tracepoints might be inaccurate in some
situations (e.g. NMIs).
Since the monitor doesn't validate other ordering properties, remove the
dependency on preemptirq tracepoints and convert the monitor to a hybrid
automaton to validate the constraint during event handling.
This makes the monitor more robust by also removing the workaround for
interrupts missing the preemption tracepoints, which was working on
PREEMPT_RT only and allows the monitor to be built on kernels without
the preemptirqs tracepoints.
[1] - https://lore.kernel.org/lkml/20250625120823.60600-1-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260330111010.153663-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Add a sample monitor to showcase hybrid/timed automata.
The stall monitor identifies tasks stalled for longer than a threshold
and reacts when that happens.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260330111010.153663-7-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Add the possibility to parse dot files as hybrid automata and generate
the necessary code from rvgen.
Hybrid automata are very similar to deterministic ones and most
functionality is shared, the dot files include also constraints together
with event names (separated by ;) and state names (separated by \n).
The tool can now generate the appropriate code to validate constraints
at runtime according to the dot specification.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260330111010.153663-5-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Currently the automata parser assumes event strings don't have any
space, this stands true for event names, but can be a wrong assumption
if we want to store other information in the event strings (e.g.
constraints for hybrid automata).
Adapt the parser logic to allow spaces in the event strings.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260330111010.153663-4-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Verify that bpf_skb_adjust_room() clears the routing dst even when
the encap L3 protocol matches the original packet (e.g. IPIP).
The dst selected for the inner packet is not valid for the
encapsulated result; a stale dst could lead to misrouting.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260329180428.2657785-2-kuba@kernel.org
This function was a masterclass in bad naming, for various historical
reasons.
It claimed to be a non-cached user copy. It is literally _neither_ of
those things. It's a specialty memory copy routine that uses
non-temporal stores for the destination (but not the source), and that
does exception handling for both source and destination accesses.
Also note that while it works for unaligned targets, any unaligned parts
(whether at beginning or end) will not use non-temporal stores, since
only words and quadwords can be non-temporal on x86.
The exception handling means that it _can_ be used for user space
accesses, but not on its own - it needs all the normal "start user space
access" logic around it.
But typically the user space access would be the source, not the
non-temporal destination. That was the original intention of this,
where the destination was some fragile persistent memory target that
needed non-temporal stores in order to catch machine check exceptions
synchronously and deal with them gracefully.
Thus that non-descriptive name: one use case was to copy from user space
into a non-cached kernel buffer. However, the existing users are a mix
of that intended use-case, and a couple of random drivers that just did
this as a performance tweak.
Some of those random drivers then actively misused the user copying
version (with STAC/CLAC and all) to do kernel copies without ever even
caring about the exception handling, _just_ for the non-temporal
destination.
Rename it as a first small step to actually make it halfway sane, and
change the prototype to be more normal: it doesn't take a user pointer
unless the caller has done the proper conversion, and the argument size
is the full size_t (it still won't actually copy more than 4GB in one
go, but there's also no reason to silently truncate the size argument in
the caller).
Finally, use this now sanely named function in the NTB code, which
mis-used a user copy version (with STAC/CLAC and all) of this interface
despite it not actually being a user copy at all.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix build failure when libbpf does not exist
RTLA supports building without BPF libraries, but a recent change
added a libbpf.h include outside of the BPF protection which caused
build failures when libbpf was not installed.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacqphhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjglAQDZFyZlJ5x13SbmxcIkA+pSy7zrWkxt
3hB09dkdY2q2uAEA+PMALreOSF2A1dyH8c6/yuxf3ftcUZH+/XnkQeheows=
=f4XK
-----END PGP SIGNATURE-----
Merge tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull rtla build fix from Steven Rostedt:
- Fix build failure when libbpf does not exist
RTLA supports building without BPF libraries, but a recent change
added a libbpf.h include outside of the BPF protection which caused
build failures when libbpf was not installed.
* tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix build without libbpf header
Add new rcutorture config NOCB02 that enables rcu_nocb_poll boot
parameter combined with CONFIG_RCU_NOCB_CPU to exercise the polling
mode code paths in the NOCB implementation.
This config exercises poll-mode paths not covered by other configs,
where callback invocation uses active polling instead of kthread
wakeups.
This config is not added to CFLIST to avoid increasing the default
test duration; it can be run explicitly when poll-mode testing
is needed.
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Add new rcutorture config NOCB01 that enables CONFIG_RCU_LAZY combined
with CONFIG_RCU_NOCB_CPU to exercise the lazy callback code paths in
the NOCB implementation.
This config exercises lazy callback paths not covered by other configs,
including lazy-only wake and lazy defer logic.
This config is not added to CFLIST to avoid increasing the default
test duration; it can be run explicitly when lazy callback testing
is needed.
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The torture_shutdown_init() function spawns a shutdown kthread in
a manner very similar to that implemented by rcu_scale_shutdown().
This commit therefore re-implements rcu_scale_shutdown() in terms of
torture_shutdown_init().
This patch was generated by Claude given as input the patch making the
same transformation of ref_scale_shutdown().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The torture_shutdown_init() function spawns a shutdown kthread in
a manner very similar to that implemented by ref_scale_shutdown().
This commit therefore re-implements ref_scale_shutdown in terms of
torture_shutdown_init().
The initial draft of this patch was generated by version 2.1.16 of the
Claude AI/LLM, but trained and configured for use by my employer, and
prompted to refer to Linux-kernel source code. This initial draft failed
to provide a forward reference to ref_scale_cleanup(), passed zero to
torture_shutdown_init() for an unwelcome insta-shutdown, and failed to
pass the kvm.sh --duration argument in as a refscale module parameter.
On the other hand, it did catch the need to NULL main_task on the
post-test self-shutdown code path, which I might well have forgotten
to do.
This version of the patch fixes those problems, and in fact very little
of the initial draft remains.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This commit switches from "-eq" to "=" to handle the non-numeric
comparisons in srcu_lockdep.sh. While in the area, adjust SRCU flavor
to improve coverage.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
If a type of torture test lacks a recheck file, a bash diagnostic is
printed, which looks like a torture-test bug. This commit gets rid of
this false positive by explicitly checking for the file, invoking it if
it exists, otherwise printing an informative non-diagnostic message.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This commit labels "QEMU killed" lines so that they will be picked up
by torture.sh processing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The kvm-series.sh script is an order-of-magnitude optimization of
kvm-check-branches.sh, so remove the old script.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This commit adds a trivial textbook implementation of preemptible RCU
to rcutorture ("torture_type=trivial-preempt"), similar in spirit to the
existing "torture_type=trivial" textbook implementation of non-preemptible
RCU. Neither trivial RCU implementation has any value for production use,
and are intended only to keep Paul honest in his introductory writings
and presentations.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Conflict in kernel/sched/ext.c init_sched_ext_class() between:
415cb193bb ("sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait
to balance callback")
which adds cpus_to_sync cpumask allocation, and:
84b1a0ea0b ("sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs")
8c1b9453fd ("sched_ext: Convert deferred_reenq_locals from llist to
regular list")
which add deferred_reenq init code at the same location. Both are
independent additions. Include both.
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a test that creates a 3-CPU kick_wait cycle (A->B->C->A). A BPF
scheduler kicks the next CPU in the ring with SCX_KICK_WAIT on every
enqueue while userspace workers generate continuous scheduling churn via
sched_yield(). Without the preceding fix, this hangs the machine within seconds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Checking for regressions at kernel-doc can be hard. Add a helper
tool to make such task easier.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <24b3116a78348b13a74d1ff5e141160ef9705dd3.1774551940.git.mchehab+huawei@kernel.org>
rtla supports building without libbpf. However, BPF actions
patchset [1] adds an include of bpf/libbpf.h into timerlat_bpf.h,
which breaks build on systems that don't have libbpf headers
installed.
This is a leftover from a draft version of the patchset where
timerlat_bpf_set_action() (which takes a struct bpf_program * argument)
was defined in the header. timerlat_bpf.c already includes bpf/libbpf.h
via timerlat.skel.h when libbpf is present.
Remove the redundant include to fix build on systems without libbpf
headers.
[1] https://lore.kernel.org/linux-trace-kernel/20251126144205.331954-1-tglozar@redhat.com/T/
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Crystal Wood <crwood@redhat.com>
Cc: Costa Shulyupin <costa.shul@redhat.com>
Link: https://patch.msgid.link/20260330091207.16184-1-tglozar@redhat.com
Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20260329122202.65a8b575@robin/
Fixes: 8cd0f08ac7 ("rtla/timerlat: Support tail call from BPF program")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As reported by Jacob, there are troubles when KBUILD_VERBOSE is
set at the environment.
Fix it on both kernel-doc and sphinx-build-wrapper.
Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Closes: https://lore.kernel.org/linux-doc/9367d899-53af-4d9c-9320-22fc4dbadca5@intel.com/
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Tested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <7a99788db75630fb14828d612c0fd77c45ec1891.1774591065.git.mchehab+huawei@kernel.org>
Add a target module and livepatch pair that verify module function
patching via a proc entry. Two test cases cover both the
klp_enable_patch path (target loaded before livepatch) and the
klp_module_coming path (livepatch loaded before target).
Signed-off-by: Pablo Alessandro Santos Hugen <phugen@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260320201135.1203992-1-phugen@redhat.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Please consider pulling these changes from the signed vfs-7.0-rc6.fixes tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCacmRjQAKCRCRxhvAZXjc
olJnAQD2iiLqih8Y8nX3ESMkkIQWUoSikrfSVw/GqmuKTmlrDgEA/z+LRgDGnI/+
6xzkEw4UNmJ9JoJsiPSlHq18yyga/ww=
=DxTb
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix netfs_limit_iter() hitting BUG() when an ITER_KVEC iterator
reaches it via core dump writes to 9P filesystems. Add ITER_KVEC
handling following the same pattern as the existing ITER_BVEC code.
- Fix a NULL pointer dereference in the netfs unbuffered write retry
path when the filesystem (e.g., 9P) doesn't set the prepare_write
operation.
- Clear I_DIRTY_TIME in sync_lazytime for filesystems implementing
->sync_lazytime. Without this the flag stays set and may cause
additional unnecessary calls during inode deactivation.
- Increase tmpfs size in mount_setattr selftests. A recent commit
bumped the ext4 image size to 2 GB but didn't adjust the tmpfs
backing store, so mkfs.ext4 fails with ENOSPC writing metadata.
- Fix an invalid folio access in iomap when i_blkbits matches the folio
size but differs from the I/O granularity. The cur_folio pointer
would not get invalidated and iomap_read_end() would still be called
on it despite the IO helper owning it.
- Fix hash_name() docstring.
- Fix read abandonment during netfs retry where the subreq variable
used for abandonment could be uninitialized on the first pass or
point to a deleted subrequest on later passes.
- Don't block sync for filesystems with no data integrity guarantees.
Add a SB_I_NO_DATA_INTEGRITY superblock flag replacing the per-inode
AS_NO_DATA_INTEGRITY mapping flag so sync kicks off writeback but
doesn't wait for flusher threads. This fixes a suspend-to-RAM hang on
fuse-overlayfs where the flusher thread blocks when the fuse daemon
is frozen.
- Fix a lockdep splat in iomap when reads fail. iomap_read_end_io()
invokes fserror_report() which calls igrab() taking i_lock in hardirq
context while i_lock is normally held with interrupts enabled. Kick
failed read handling to a workqueue.
- Remove the redundant netfs_io_stream::front member and use
stream->subrequests.next instead, fixing a potential issue in the
direct write code path.
* tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the handling of stream->front by removing it
iomap: fix lockdep complaint when reads fail
writeback: don't block sync for filesystems with no data integrity guarantees
netfs: Fix read abandonment during retry
vfs: fix docstring of hash_name()
iomap: fix invalid folio access when i_blkbits differs from I/O granularity
selftests/mount_setattr: increase tmpfs size for idmapped mount tests
fs: clear I_DIRTY_TIME in sync_lazytime
netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry
netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators
This test loads xdp_metadata.bpf which calls bpf_xdp_metadata_rx_hash() on
incoming packets. The metadata from that packet is then sent to a BPF
map for validation. It borrows structure from xdp.py, reusing common
functions.
The test checks the device's xdp-rx-metadata-features via netlink
before running and skips on devices that do not advertise hash support.
This can be run on veth devices as well as real hardware.
The test is fairly simple and just verifies that a TCP or UDP packet can be
identified as an L4 flow. This minimal test also passes if run on a veth
device.
Signed-off-by: Chris J Arges <carges@cloudflare.com>
Link: https://patch.msgid.link/20260325201139.2501937-7-carges@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This moves a few functions which can be useful to other python programs
that manipulate XDP programs. This also refactors xdp.py to use the
refactored functions.
Signed-off-by: Chris J Arges <carges@cloudflare.com>
Link: https://patch.msgid.link/20260325201139.2501937-6-carges@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add few more alu32 shift tests using div-by-zero on provably dead paths
to check both verifier and JIT xlation resp. runtime correctness.
If the verifier mistracks the result, it rejects due to the div by 0;
if the JIT computes a wrong value, then runtime hits the dead path and
retval changes.
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_subreg
[...]
#644/76 verifier_subreg/arsh32_imm1_value:OK
#644/77 verifier_subreg/lsh32_reg0_zero_extend_check:OK
#644/78 verifier_subreg/rsh32_reg0_zero_extend_check:OK
#644/79 verifier_subreg/arsh32_reg0_zero_extend_check:OK
#644/80 verifier_subreg/lsh32_imm31_value:OK
#644/81 verifier_subreg/rsh32_imm31_value:OK
#644/82 verifier_subreg/arsh32_imm31_value:OK
#644/83 verifier_subreg/lsh32_unknown_precise_bounds:OK
#644/84 verifier_subreg/rsh32_unknown_bounds:OK
#644 verifier_subreg:OK
Summary: 1/84 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260327220629.343327-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Update selftests to use the new non-_impl kfuncs marked with
KF_IMPLICIT_ARGS by removing redundant declarations and macros from
bpf_experimental.h (the new kfuncs are present in the vmlinux.h) and
updating relevant callsites.
Fix spin_lock verifier-log matching for lock_id_kptr_preserve by
accepting variable instruction numbers. The calls to kfuncs with
implicit arguments do not have register moves (e.g. r5 = 0)
corresponding to dummy arguments anymore, so the order of instructions
has shifted.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The following kfuncs currently accept void *meta__ign argument:
* bpf_obj_new_impl
* bpf_obj_drop_impl
* bpf_percpu_obj_new_impl
* bpf_percpu_obj_drop_impl
* bpf_refcount_acquire_impl
* bpf_list_push_back_impl
* bpf_list_push_front_impl
* bpf_rbtree_add_impl
The __ign suffix is an indicator for the verifier to skip the argument
in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may
set the value of this argument to struct btf_struct_meta *
kptr_struct_meta from insn_aux_data.
BPF programs must pass a dummy NULL value when calling these kfuncs.
Additionally, the list and rbtree _impl kfuncs also accept an implicit
u64 argument, which doesn't require __ign suffix because it's a
scalar, and BPF programs explicitly pass 0.
Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each
_impl kfunc accepting meta__ign. The existing _impl kfuncs remain
unchanged for backwards compatibility.
To support this, add "btf_struct_meta" to the list of recognized
implicit argument types in resolve_btfids.
Implement is_kfunc_arg_implicit() in the verifier, that determines
implicit args by inspecting both a non-_impl BTF prototype of the
kfunc.
Update the special_kfunc_list in the verifier and relevant checks to
support both the old _impl and the new KF_IMPLICIT_ARGS variants of
btf_struct_meta users.
[1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Update kcpuid's CSV to version 3.0, as generated by x86-cpuid-db.
Summary of the v2.5 changes:
- Reduce the verbosity of leaf and bitfields descriptions, as formerly
requested by Boris.
- Leaf 0x8000000a: Add Page Modification Logging (PML) bit.
Summary of the v3.0 changes:
- Leaf 0x23: Introduce subleaf 2, Auto Counter Reload (ACR)
- Leaf 0x23: Introduce subleaf 4/5, PEBS capabilities and counters
- Leaf 0x1c: Return LBR depth as a bitmask instead of individual bits
- Leaf 0x0a: Use more descriptive PMU bitfield names
- Leaf 0x0a: Add various missing PMU events
- Leaf 0x06: Add missing IA32_HWP_CTL flag
- Leaf 0x0f: Add missing non-CPU (IO) Intel RDT bits
Thanks to Dave Hansen for reporting multiple missing bits.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v2.5/CHANGELOG.rst
Link: https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v3.0/CHANGELOG.rst
Several selftests Makefiles (e.g. prctl, breakpoints, etc) attempt to
normalize the ARCH variable by converting x86_64 and i.86 to x86.
However, it uses the conditional assignment operator '?='.
When ARCH is passed as a command-line argument (e.g., during an rpmbuild
process), the '?=' operator ignores the shell command and the sed
transformation. This leads to an incorrect ARCH value being used, which
causes build failures
# make -C tools/testing/selftests TARGETS=prctl ARCH=x86_64
make: Entering directory '/build/tools/testing/selftests'
make[1]: Entering directory '/build/tools/testing/selftests/prctl'
make[1]: *** No targets. Stop.
make[1]: Leaving directory '/build/tools/testing/selftests/prctl'
make: *** [Makefile:197: all] Error 2
Change the assignment to use 'override' and ':=' to ensure the
normalization logic is applied regardless of how the ARCH variable was
initially defined.
Link: https://lkml.kernel.org/r/20260309205145.572778-1-aleksey.oladko@virtuozzo.com
Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Bala-Vignesh-Reddy <reddybalavignesh9979@gmail.com>
Cc: Chelsy Ratnawat <chelsyratnawat2001@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The include directory ../../usr/include is only present if an in-tree
kernel build with CONFIG_HEADERS_INSTALL was done before.
Otherwise the system UAPI headers are used, which most likely are not
the most recent ones.
To make sure to always have access to up-to-date UAPI headers,
use the static copy in tools/include/uapi.
Link: https://lkml.kernel.org/r/20260307-accounting-taskstats-h-v1-2-0b75915c6ce5@weissschuh.net
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202603062103.Z5fecwZD-lkp@intel.com/
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "tools/getdelays: use the static UAPI headers from
tools/include/uapi".
The include directory ../../usr/include is only present if an in-tree
kernel build with CONFIG_HEADERS_INSTALL was done before. Otherwise the
system UAPI headers are used, which most likely are not the most recent
ones.
To make sure to always have access to up-to-date UAPI headers, use the
static copy in tools/include/uapi.
This patch (of 2):
To give the accounting tools access to the new fields introduced in commit
503efe850c ("delayacct: add timestamp of delay max")
Link: https://lkml.kernel.org/r/20260307-accounting-taskstats-h-v1-0-0b75915c6ce5@weissschuh.net
Link: https://lkml.kernel.org/r/20260307-accounting-taskstats-h-v1-1-0b75915c6ce5@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The fchmodat2 test program open codes a version of ksft_finished(), use
the standard version.
Link: https://lkml.kernel.org/r/20260226-selftests-fchmodat2-v4-2-a6419435f2e8@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "selftests/fchmodat2: Error handling and general", v4.
I looked at the fchmodat2() tests since I've been experiencing some random
intermittent segfaults with them in my test systems, while doing so I
noticed these two issues. Unfortunately I didn't figure out the original
yet, unless I managed to fix it unwittingly.
This patch (of 2):
The fchmodat2() test program creates a temporary directory with a file and
a symlink for every test it runs but never cleans these up, resulting in
${TMPDIR} getting left with stale files after every run. Restructure the
program a bit to ensure that we clean these up, this is more invasive than
it might otherwise be due to the extensive use of ksft_exit_fail_msg() in
the program.
As a side effect this also ensures that we report a consistent test name
for the tests and always try both tests even if they are skipped.
Link: https://lkml.kernel.org/r/20260226-selftests-fchmodat2-v4-0-a6419435f2e8@kernel.org
Link: https://lkml.kernel.org/r/20260226-selftests-fchmodat2-v4-1-a6419435f2e8@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
msgque kselftest uses msgrcv(..., MSG_COPY) to copy messages. When the
kernel is built without CONFIG_CHECKPOINT_RESTORE, prepare_copy() is
stubbed out and msgrcv() returns -ENOSYS. The test currently reports this
as a failure even though it is simply a missing feature/configuration.
Skip the test when msgrcv() fails with ENOSYS.
Link: https://lkml.kernel.org/r/20260210135359.178636-1-jouyeol8739@gmail.com
Signed-off-by: UYeol Jo <jouyeol8739@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a regression test for the divide-by-zero in rtsc_min() triggered
when m2sm() converts a large m1 value (e.g. 32gbit) to a u64 scaled
slope reaching 2^32. rtsc_min() stores the difference of two such u64
values (sm1 - sm2) in a u32 variable `dsm`, truncating 2^32 to zero
and causing a divide-by-zero oops in the concave-curve intersection
path. The test configures an HFSC class with m1=32gbit d=1ms m2=0bit,
sends a packet to activate the class, waits for it to drain and go
idle, then sends another packet to trigger reactivation through
rtsc_min().
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-2-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
core/region.c is overloaded with per-region control logic (pmem, dax,
sysram, etc). Move the CXL DAX region device infrastructure from
region.c into a new region_dax.c file.
This will also allow us to add additional dax-driver integration paths
that don't further dirty the core region.c logic.
No functional changes.
Signed-off-by: Gregory Price <gourry@gourry.net>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260327020203.876122-3-gourry@gourry.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
core/region.c is overloaded with per-region control logic (pmem, dax,
sysram, etc). Move the pmem region driver logic from region.c into
region_pmem.c make it clear that this code only applies to pmem regions.
No functional changes.
[ dj: Fixed up some tabbing issues, may be from original code. ]
Signed-off-by: Gregory Price <gourry@gourry.net>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260327020203.876122-2-gourry@gourry.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
scx_central currently assumes that ops.init() runs on the selected
central CPU and aborts otherwise. This is no longer true, as ops.init()
is invoked from the scx_enable_helper thread, which can run on any
CPU.
As a result, sched_setaffinity() from userspace doesn't work, causing
scx_central to fail when loading with:
[ 1985.319942] sched_ext: central: scx_central.bpf.c:314: init from non-central CPU
[ 1985.320317] scx_exit+0xa3/0xd0
[ 1985.320535] scx_bpf_error_bstr+0xbd/0x220
[ 1985.320840] bpf_prog_3a445a8163fa8149_central_init+0x103/0x1ba
[ 1985.321073] bpf__sched_ext_ops_init+0x40/0xa8
[ 1985.321286] scx_root_enable_workfn+0x507/0x1650
[ 1985.321461] kthread_worker_fn+0x260/0x940
[ 1985.321745] kthread+0x303/0x3e0
[ 1985.321901] ret_from_fork+0x589/0x7d0
[ 1985.322065] ret_from_fork_asm+0x1a/0x30
DEBUG DUMP
===================================================================
central: root
scx_enable_help[134] triggered exit kind 1025:
scx_bpf_error (scx_central.bpf.c:314: init from non-central CPU)
Fix this by:
- Defer bpf_timer_start() to the first dispatch on the central CPU.
- Initialize the BPF timer in central_init() and kick the central CPU
to guarantee entering the dispatch path on the central CPU immediately.
- Remove the unnecessary sched_setaffinity() call in userspace.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
The current sbi_pmu_test attempts to read firmware counters without
configuring them first with SBI_EXT_PMU_COUNTER_CFG_MATCH.
Previously this did not fail because KVM incorrectly allowed the read
and accessed fw_event[] with an out-of-bounds index when the counter
was unconfigured. After fixing that bug, the read now correctly returns
SBI_ERR_INVALID_PARAM, causing the selftest to fail.
Update the test to configure a firmware event before reading the
counter. Also add a negative test to ensure that attempting to read an
unconfigured firmware counter fails gracefully.
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Link: https://lore.kernel.org/r/20260316014533.2312254-3-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
The timer_f.utimer test hard-fails with ASSERT_EQ when
SNDRV_TIMER_IOCTL_CREATE returns -1 on kernels without
CONFIG_SND_UTIMER. This causes the entire alsa kselftest suite to
report a failure rather than skipping the unsupported test.
When CONFIG_SND_UTIMER is not enabled, the ioctl is not recognised and
the kernel returns -ENOTTY. If the timer device or subdevice does not
exist, -ENXIO is returned. Skip the test in both cases, but still fail
on any other unexpected error.
Suggested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/linux-kselftest/0e9c25d3-efbd-433b-9fb1-0923010101b9@stanley.mountain/
Signed-off-by: Ben Copeland <ben.copeland@linaro.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260319124521.191491-1-ben.copeland@linaro.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Address "grep: warning: stray \ before white space" warning from GNU
grep 3.12. This warns the misplaced backslashes before whitespaces
(e.g. \\' ' or '\ ') which leads to unspecified behavior [1].
We can just remove the backslashes before whitespaces as POSIX says:
Enclosing characters in single-quotes ('') shall preserve the literal
value of each character within the single-quotes.
and bourne-compatible shells behave so.
[1]: https://lists.gnu.org/r/bug-gnulib/2022-05/msg00057.html
Signed-off-by: Yohei Kojima <yk@y-koj.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/dd0bbd48cdf468da56ec34fd61cecd4d2111d7ba.1774372510.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend srv6_hencap_red_l3vpn_test.sh to include checks for the new
"tunsrc" feature. If there is no support for tunsrc, it silently
falls back to the encap config without tunsrc.
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Justin Iurman <justin.iurman@6wind.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20260324091434.359341-3-justin.iurman@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit c50dcf5331.
The tests are superficial, likely AI-generated slop, and flaky. They
don't add actual value and just churn the selftests.
Signed-off-by: Tejun Heo <tj@kernel.org>
The "comm" column allows grouping events by the process command. It is
intended to group like programs, despite having different PIDs. But some
workloads may adjust their own command, so that a unique identifier
(e.g. a PID or some other numeric value) is part of the command name.
This destroys the utility of "comm", forcing perf to place each unique
process name into its own bucket, which can contribute to a
combinatorial explosion of memory use in perf report.
Create a less strict version of this column, which ignores digits when
comparing command names. Commands whose names are the same (ignoring
digits) are sorted into the same histogram buckets, and displayed with
the placeholder value "<N>" in the place of digits. For example,
hypothetical command names "kworker/1" "kworker/2" "kworker/3" would
sort into the same bucket and be represented as "kworker/<N>".
Committer testing:
$ perf report -s comm,comm_nodigit | grep -F "<N>"
0.01% CPU 6/TCG CPU <N>/TCG
0.01% kworker/53:2-mm kworker/<N>:<N>-mm
0.01% migration/24 migration/<N>
0.01% kworker/24:1-ev kworker/<N>:<N>-ev
0.01% llvmpipe-8 llvmpipe-<N>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Commit f5803651b4 ("perf stat: Choose the most disaggregate command
line option") changed aggregation option handling for `perf stat` but
not `perf stat report` leading to parse_cache_level being passed a
struct in the `perf stat` case but erroneously an aggr_mode enum value
for `perf stat report`. Change the `perf stat report` aggregation
handling to use the same opt_aggr_mode as `perf stat`. Also, just pass
the boolean for consistency with other boolean argument handling.
Fixes: f5803651b4 ("perf stat: Choose the most disaggregate command line option")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The value is a void* and the address of an int, max_stack_depth, is
set up in the perf lock options. The parse_max_stack function treats
the int* as a long*, make this more correct by declaring the value to
be an int*.
Fixes: 0a277b6226 ("perf lock contention: Check --max-stack option")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
commit e5e66adfe4 ("perf regs: Remove __weak attributive arch_sdt_arg_parse_op() function")
removes arch_sdt_arg_parse_op() functions and reveals missing s390 support.
The following warning is printed:
Unknown ELF machine 22, standard arguments parse will be skipped.
ELF machine 22 is the EM_S390 host. This happens with command
# ./perf record -v -- stress-ng -t 1s --matrix 0
when the event is not specified.
Add s390 specific __perf_sdt_arg_parse_op_s390() function to support
-architecture calls to arch_sdt_arg_parse_op() for s390.
The warning disappears.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Jan Polensky <japo@linux.ibm.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The perf static build reports that the BPF skeleton is disabled due to
the missing libopenssl feature.
Use PKG_CONFIG to determine the link flags for libopenssl. Add
"--static" to the PKG_CONFIG command for static linking.
Fixes: 7678523109 ("tools/build: Add a feature test for libopenssl")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
verify btf__new_empty_opts() adds layouts for all kinds supported,
and after adding kind-related types for an unknown kind, ensure that
parsing uses this info when that kind is encountered rather than
giving up.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-9-alan.maguire@oracle.com
Add a FEAT_BTF_LAYOUT feature check which checks if the
kernel supports BTF layout information. Also sanitize
BTF if it contains layout data but the kernel does not
support it. The sanitization requires rewriting raw
BTF data to update the header and eliminate the layout
section (since it lies between the types and strings),
so refactor sanitization to do the raw BTF retrieval
and creation of updated BTF, returning that new BTF
on success.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-7-alan.maguire@oracle.com
BTF parsing can use layout to navigate unknown kinds, so
btf_validate_type() should take layout information into
account to avoid failure when an unrecognized kind is met.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-6-alan.maguire@oracle.com
Support encoding of BTF layout data via btf__new_empty_opts().
Current supported opts are base_btf and add_layout.
Layout information is maintained in btf.c in the layouts[] array;
when BTF is created with the add_layout option it represents the
current view of supported BTF kinds.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-5-alan.maguire@oracle.com
This allows BTF parsing to proceed even if we do not know the
kind. Fall back to base BTF layout if layout information is
not in split BTF.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-4-alan.maguire@oracle.com
Support reading in layout fixing endian issues on reading;
also support writing layout section to raw BTF object.
There is not yet an API to populate the layout with meaningful
information.
As part of this, we need to consider multiple valid BTF header
sizes; the original or the layout-extended headers.
So to support this, the "struct btf" representation is modified
to contain a "struct btf_header" and we copy the valid
portion from the raw data to it; this means we can always safely
check fields like btf->hdr.layout_len .
Note if parsed-in BTF has extra header information beyond
sizeof(struct btf_header) - if so we make that BTF ineligible
for modification by setting btf->has_hdr_extra .
Ensure that we handle endianness issues for BTF layout section,
though currently only field that needs this (flags) is unused.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-3-alan.maguire@oracle.com
BTF kind layouts provide information to parse BTF kinds. By separating
parsing BTF from using all the information it provides, we allow BTF
to encode new features even if they cannot be used by readers. This
will be helpful in particular for cases where older tools are used
to parse newer BTF with kinds the older tools do not recognize;
the BTF can still be parsed in such cases using kind layout.
The intent is to support encoding of kind layouts optionally so that
tools like pahole can add this information. For each kind, we record
- length of singular element following struct btf_type
- length of each of the btf_vlen() elements following
- a (currently unused) flags field
The ideas here were discussed at [1], [2]; hence
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-2-alan.maguire@oracle.com
[1] https://lore.kernel.org/bpf/CAEf4BzYjWHRdNNw4B=eOXOs_ONrDwrgX4bn=Nuc1g8JPFC34MA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/20230531201936.1992188-1-alan.maguire@oracle.com/