Commit Graph

7162 Commits

Author SHA1 Message Date
Linus Torvalds
50647a1176 file->f_path constification
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaN3daAAKCRBZ7Krx/gZQ
 6zNWAP9kD6rOJRNqDgea4pibDPa47Tps/WM5tsDv3dsLliY29gEA6sveOWZ3guAj
 4oY3ts/NtHLWXvhI7Vd/1mr2aTKEZQk=
 =YNK+
 -----END PGP SIGNATURE-----

Merge tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull file->f_path constification from Al Viro:
 "Only one thing was modifying ->f_path of an opened file - acct(2).

  Massaging that away and constifying a bunch of struct path * arguments
  in functions that might be given &file->f_path ends up with the
  situation where we can turn ->f_path into an anon union of const
  struct path f_path and struct path __f_path, the latter modified only
  in a few places in fs/{file_table,open,namei}.c, all for struct file
  instances that are yet to be opened"

* tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
  Have cc(1) catch attempts to modify ->f_path
  kernel/acct.c: saner struct file treatment
  configfs:get_target() - release path as soon as we grab configfs_item reference
  apparmor/af_unix: constify struct path * arguments
  ovl_is_real_file: constify realpath argument
  ovl_sync_file(): constify path argument
  ovl_lower_dir(): constify path argument
  ovl_get_verity_digest(): constify path argument
  ovl_validate_verity(): constify {meta,data}path arguments
  ovl_ensure_verity_loaded(): constify datapath argument
  ksmbd_vfs_set_init_posix_acl(): constify path argument
  ksmbd_vfs_inherit_posix_acl(): constify path argument
  ksmbd_vfs_kern_path_unlock(): constify path argument
  ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path *
  check_export(): constify path argument
  export_operations->open(): constify path argument
  rqst_exp_get_by_name(): constify path argument
  nfs: constify path argument of __vfs_getattr()
  bpf...d_path(): constify path argument
  done_path_create(): constify path argument
  ...
2025-10-03 16:32:36 -07:00
Linus Torvalds
51e9889ab1 vfs_parse_fs_string() stuff
change on vfs_parse_fs_string() calling conventions - get rid of
 the length argument (almost all callers pass strlen() of the
 string argument there), add vfs_parse_fs_qstr() for the cases
 that do want separate length
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaNhllQAKCRBZ7Krx/gZQ
 6wyiAP9TmyFLBWKC/sDNtRAiGRybEtcwvVZj1agpx0kZIWshUwD7BVg4AfDs+vN3
 RoYYg9DR1SP5kZF7h2Ve1T39XDq6ZQU=
 =YZFG
 -----END PGP SIGNATURE-----

Merge tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull fs_context updates from Al Viro:
 "Change vfs_parse_fs_string() calling conventions

  Get rid of the length argument (almost all callers pass strlen() of
  the string argument there), add vfs_parse_fs_qstr() for the cases that
  do want separate length"

* tag 'pull-fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do_nfs4_mount(): switch to vfs_parse_fs_string()
  change the calling conventions for vfs_parse_fs_string()
2025-10-03 10:51:44 -07:00
Sasha Levin
61e19cd2e5 tracing: Fix lock imbalance in s_start() memory allocation failure path
When s_start() fails to allocate memory for set_event_iter, it returns NULL
before acquiring event_mutex. However, the corresponding s_stop() function
always tries to unlock the mutex, causing a lock imbalance warning:

  WARNING: bad unlock balance detected!
  6.17.0-rc7-00175-g2b2e0c04f78c #7 Not tainted
  -------------------------------------
  syz.0.85611/376514 is trying to release lock (event_mutex) at:
  [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131
  but there are no more locks to release!

The issue was introduced by commit b355247df1 ("tracing: Cache ':mod:'
events for modules not loaded yet") which added the kzalloc() allocation before
the mutex lock, creating a path where s_start() could return without locking
the mutex while s_stop() would still try to unlock it.

Fix this by unconditionally acquiring the mutex immediately after allocation,
regardless of whether the allocation succeeded.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-03 12:13:12 -04:00
Yuan Chen
9cf9aa7b0a tracing: Fix race condition in kprobe initialization causing NULL pointer dereference
There is a critical race condition in kprobe initialization that can lead to
NULL pointer dereference and kernel crash.

[1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000
...
[1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO)
[1135630.269239] pc : kprobe_perf_func+0x30/0x260
[1135630.277643] lr : kprobe_dispatcher+0x44/0x60
[1135630.286041] sp : ffffaeff4977fa40
[1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400
[1135630.302837] x27: 0000000000000000 x26: 0000000000000000
[1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528
[1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50
[1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50
[1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000
[1135630.349985] x17: 0000000000000000 x16: 0000000000000000
[1135630.359285] x15: 0000000000000000 x14: 0000000000000000
[1135630.368445] x13: 0000000000000000 x12: 0000000000000000
[1135630.377473] x11: 0000000000000000 x10: 0000000000000000
[1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000
[1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000
[1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000
[1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006
[1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000
[1135630.429410] Call trace:
[1135630.434828]  kprobe_perf_func+0x30/0x260
[1135630.441661]  kprobe_dispatcher+0x44/0x60
[1135630.448396]  aggr_pre_handler+0x70/0xc8
[1135630.454959]  kprobe_breakpoint_handler+0x140/0x1e0
[1135630.462435]  brk_handler+0xbc/0xd8
[1135630.468437]  do_debug_exception+0x84/0x138
[1135630.475074]  el1_dbg+0x18/0x8c
[1135630.480582]  security_file_permission+0x0/0xd0
[1135630.487426]  vfs_write+0x70/0x1c0
[1135630.493059]  ksys_write+0x5c/0xc8
[1135630.498638]  __arm64_sys_write+0x24/0x30
[1135630.504821]  el0_svc_common+0x78/0x130
[1135630.510838]  el0_svc_handler+0x38/0x78
[1135630.516834]  el0_svc+0x8/0x1b0

kernel/trace/trace_kprobe.c: 1308
0xffff3df8995039ec <kprobe_perf_func+0x2c>:     ldr     x21, [x24,#120]
include/linux/compiler.h: 294
0xffff3df8995039f0 <kprobe_perf_func+0x30>:     ldr     x1, [x21,x0]

kernel/trace/trace_kprobe.c
1308: head = this_cpu_ptr(call->perf_events);
1309: if (hlist_empty(head))
1310: 	return 0;

crash> struct trace_event_call -o
struct trace_event_call {
  ...
  [120] struct hlist_head *perf_events;  //(call->perf_event)
  ...
}

crash> struct trace_event_call ffffaf015340e528
struct trace_event_call {
  ...
  perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0
  ...
}

Race Condition Analysis:

The race occurs between kprobe activation and perf_events initialization:

  CPU0                                    CPU1
  ====                                    ====
  perf_kprobe_init
    perf_trace_event_init
      tp_event->perf_events = list;(1)
      tp_event->class->reg (2)← KPROBE ACTIVE
                                          Debug exception triggers
                                          ...
                                          kprobe_dispatcher
                                            kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE)
                                              head = this_cpu_ptr(call->perf_events)(3)
                                              (perf_events is still NULL)

Problem:
1. CPU0 executes (1) assigning tp_event->perf_events = list
2. CPU0 executes (2) enabling kprobe functionality via class->reg()
3. CPU1 triggers and reaches kprobe_dispatcher
4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed)
5. CPU1 calls kprobe_perf_func() and crashes at (3) because
   call->perf_events is still NULL

CPU1 sees that kprobe functionality is enabled but does not see that
perf_events has been assigned.

Add pairing read and write memory barriers to guarantee that if CPU1
sees that kprobe functionality is enabled, it must also see that
perf_events has been assigned.

Link: https://lore.kernel.org/all/20251001022025.44626-1-chenyuan_fl@163.com/

Fixes: 50d7805607 ("tracing/kprobes: Add probe handler dispatcher to support perf and ftrace concurrent use")
Cc: stable@vger.kernel.org
Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-10-02 08:05:01 +09:00
Linus Torvalds
ae28ed4578 bpf-next-6.18
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
 bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
 UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
 +oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
 VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
 tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
 TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
 Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
 9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
 n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
 1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
 c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
 =LeQi
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
2025-09-30 17:58:11 -07:00
Michal Koutný
2378a191f4 tracing: Ensure optimized hashing works
If ever PID_MAX_DEFAULT changes, it must be compatible with tracing
hashmaps assumptions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250924113810.2433478-1-mkoutny@suse.com
Link: https://lore.kernel.org/r/20240409110126.651e94cb@gandalf.local.home/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-30 17:27:58 -04:00
Vladimir Riabchun
4099b98203 ftrace: Fix softlockup in ftrace_module_enable
A soft lockup was observed when loading amdgpu module.
If a module has a lot of tracable functions, multiple calls
to kallsyms_lookup can spend too much time in RCU critical
section and with disabled preemption, causing kernel panic.
This is the same issue that was fixed in
commit d0b24b4e91 ("ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY
kernels") and commit 42ea22e754 ("ftrace: Add cond_resched() to
ftrace_graph_set_hash()").

Fix it the same way by adding cond_resched() in ftrace_module_enable.

Link: https://lore.kernel.org/aMQD9_lxYmphT-up@vova-pc
Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-30 17:27:58 -04:00
Linus Torvalds
8f9736633f tracing fixes for v6.17
- Fix buffer overflow in osnoise_cpu_write()
 
   The allocated buffer to read user space did not add a nul terminating byte
   after copying from user the string. It then reads the string, and if user
   space did not add a nul byte, the read will continue beyond the string.
   Add a nul terminating byte after reading the string.
 
 - Fix missing check for lockdown on tracing
 
   There's a path from kprobe events or uprobe events that can update the
   tracing system even if lockdown on tracing is activate. Add a check in the
   dynamic event path.
 
 - Add a recursion check for the function graph return path
 
   Now that fprobes can hook to the function graph tracer and call different
   code between the entry and the exit, the exit code may now call functions
   that are not called in entry. This means that the exit handler can possibly
   trigger recursion that is not caught and cause the system to crash.
   Add the same recursion checks in the function exit handler as exists in the
   entry handler path.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaNkbyxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qh6dAQDTFqLRb01RzaZuF/xG9A7UqNz9abq5
 fQVwu1RG9xXnnAD/X9PfKfnqLhK/M2EJZT17PJ+nUlFqFoVL6lLJyrDLSw4=
 =4j5J
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix buffer overflow in osnoise_cpu_write()

   The allocated buffer to read user space did not add a nul terminating
   byte after copying from user the string. It then reads the string,
   and if user space did not add a nul byte, the read will continue
   beyond the string.

   Add a nul terminating byte after reading the string.

 - Fix missing check for lockdown on tracing

   There's a path from kprobe events or uprobe events that can update
   the tracing system even if lockdown on tracing is activate. Add a
   check in the dynamic event path.

 - Add a recursion check for the function graph return path

   Now that fprobes can hook to the function graph tracer and call
   different code between the entry and the exit, the exit code may now
   call functions that are not called in entry. This means that the exit
   handler can possibly trigger recursion that is not caught and cause
   the system to crash.

   Add the same recursion checks in the function exit handler as exists
   in the entry handler path.

* tag 'trace-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fgraph: Protect return handler from recursion loop
  tracing: dynevent: Add a missing lockdown check on dynevent
  tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()
2025-09-28 10:26:35 -07:00
Masami Hiramatsu (Google)
0db0934e7f tracing: fgraph: Protect return handler from recursion loop
function_graph_enter_regs() prevents itself from recursion by
ftrace_test_recursion_trylock(), but __ftrace_return_to_handler(),
which is called at the exit, does not prevent such recursion.
Therefore, while it can prevent recursive calls from
fgraph_ops::entryfunc(), it is not able to prevent recursive calls
to fgraph from fgraph_ops::retfunc(), resulting in a recursive loop.
This can lead an unexpected recursion bug reported by Menglong.

 is_endbr() is called in __ftrace_return_to_handler -> fprobe_return
  -> kprobe_multi_link_exit_handler -> is_endbr.

To fix this issue, acquire ftrace_test_recursion_trylock() in the
__ftrace_return_to_handler() after unwind the shadow stack to mark
this section must prevent recursive call of fgraph inside user-defined
fgraph_ops::retfunc().

This is essentially a fix to commit 4346ba1604 ("fprobe: Rewrite
fprobe on function-graph tracer"), because before that fgraph was
only used from the function graph tracer. Fprobe allowed user to run
any callbacks from fgraph after that commit.

Reported-by: Menglong Dong <menglong8.dong@gmail.com>
Closes: https://lore.kernel.org/all/20250918120939.1706585-1-dongml2@chinatelecom.cn/
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/175852292275.307379.9040117316112640553.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Menglong Dong <menglong8.dong@gmail.com>
Acked-by: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-27 09:04:05 -04:00
Linus Torvalds
bf40f4b877 Probes fixes for v6.17-rc7
- fprobe: Even if there is a memory allocation failure, try to remove
   the addresses recorded until then from the filter. Previously we
   just skip it.
 - tracing: dynevent: Add a missing lockdown check on dynevent. This
   dynevent is the interface for all probe events. Thus if there is no
   check, any probe events can be added after lock down the tracefs.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmjUl5obHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b4x4H/05k/zgZGXHEZizE+Zzx
 tS8kikesEcnlTv1F4goYVvMzcUpBpfTgjFEJQPSyEHNUcWtWYAlGSWZnQLfzmrr/
 2qHEXCVXk+jjp9kwTwvpWioLaU51tGC5tkFD+G4WQdgNZaOsejgXR1UEKum80g0p
 cuYYvwEKq3o6Vas+yrMQpJbxiu0CLcxY0EJwULlZrdAzdECifn11lCyUHTNevF7/
 3Xb8sAR7Xxd+PGhvYk9RPzosocXQ5Kb0zvyWr64njDNTstGxkpamwQ8Z0GUd1L1Z
 nnqm+BfU6kdYOtggiNewyFm1r7K383zIf708z2ySUBJKm2YrE4Jl19tCK5Q5PbcG
 puY=
 =+z7x
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - fprobe: Even if there is a memory allocation failure, try to remove
   the addresses recorded until then from the filter. Previously we just
   skipped it.

 - tracing: dynevent: Add a missing lockdown check on dynevent. This
   dynevent is the interface for all probe events. Thus if there is no
   check, any probe events can be added after lock down the tracefs.

* tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: dynevent: Add a missing lockdown check on dynevent
  tracing: fprobe: Fix to remove recorded module addresses from filter
2025-09-24 19:17:07 -07:00
Masami Hiramatsu (Google)
456c32e3c4 tracing: dynevent: Add a missing lockdown check on dynevent
Since dynamic_events interface on tracefs is compatible with
kprobe_events and uprobe_events, it should also check the lockdown
status and reject if it is set.

Link: https://lore.kernel.org/all/175824455687.45175.3734166065458520748.stgit@devnote2/

Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
2025-09-25 00:22:46 +09:00
Masami Hiramatsu (Google)
c539feff3c tracing: fprobe: Fix to remove recorded module addresses from filter
Even if there is a memory allocation failure in fprobe_addr_list_add(),
there is a partial list of module addresses. So remove the recorded
addresses from filter if exists.
This also removes the redundant ret local variable.

Fixes: a3dc2983ca ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
2025-09-24 23:18:26 +09:00
Jiri Olsa
7384893d97 bpf: Allow uprobe program to change context registers
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the
context registers data. While this makes sense for kprobe attachments,
for uprobe attachment it might make sense to be able to change user
space registers to alter application execution.

Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE),
we can't deny write access to context during the program load. We need
to check on it during program attachment to see if it's going to be
kprobe or uprobe.

Storing the program's write attempt to context and checking on it
during the attachment.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250916215301.664963-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-24 02:25:06 -07:00
Masami Hiramatsu (Google)
1da3f145ed tracing: dynevent: Add a missing lockdown check on dynevent
Since dynamic_events interface on tracefs is compatible with
kprobe_events and uprobe_events, it should also check the lockdown
status and reject if it is set.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/175824455687.45175.3734166065458520748.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 11:02:14 -04:00
Wang Liang
a2501032de tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()
When config osnoise cpus by write() syscall, the following KASAN splat may
be observed:

BUG: KASAN: slab-out-of-bounds in _parse_integer_limit+0x103/0x130
Read of size 1 at addr ffff88810121e3a1 by task test/447
CPU: 1 UID: 0 PID: 447 Comm: test Not tainted 6.17.0-rc6-dirty #288 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x55/0x70
 print_report+0xcb/0x610
 kasan_report+0xb8/0xf0
 _parse_integer_limit+0x103/0x130
 bitmap_parselist+0x16d/0x6f0
 osnoise_cpus_write+0x116/0x2d0
 vfs_write+0x21e/0xcc0
 ksys_write+0xee/0x1c0
 do_syscall_64+0xa8/0x2a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

This issue can be reproduced by below code:

const char *cpulist = "1";
int fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, cpulist, strlen(cpulist));

Function bitmap_parselist() was called to parse cpulist, it require that
the parameter 'buf' must be terminated with a '\0' or '\n'. Fix this issue
by adding a '\0' to 'buf' in osnoise_cpus_write().

Cc: <mhiramat@kernel.org>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <tglozar@redhat.com>
Link: https://lore.kernel.org/20250916063948.3154627-1-wangliang74@huawei.com
Fixes: 17f89102fe ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 10:59:52 -04:00
Marco Crivellari
70bd70c303 tracing: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.

queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.

This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.

The old wq will be kept for a few release cylces.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250905091040.109772-2-marco.crivellari@suse.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 10:44:54 -04:00
Liao Yuanhong
8613a55ac5 tracing: Remove redundant 0 value initialization
The saved_cmdlines_buffer struct is already zeroed by memset(). It's
redundant to initialize s->cmdline_idx to 0.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250825123200.306272-1-liaoyuanhong@vivo.com
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 10:31:14 -04:00
Fushuai Wang
1d67d67a8c tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu()
Replace the opencoded for_each_cpu(cpu, cpu_online_mask) loop with the
more readable and equivalent for_each_online_cpu(cpu) macro.

Link: https://lore.kernel.org/20250811064158.2456-1-wangfushuai@baidu.com
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 09:34:57 -04:00
Qianfeng Rong
09da59344a tracing: Use vmalloc_array() to improve code
Remove array_size() calls and replace vmalloc() with vmalloc_array() in
tracing_map_sort_entries().  vmalloc_array() is optimized better, uses
fewer instructions, and handles overflow more concisely[1].

[1]: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250817084725.59477-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 09:31:58 -04:00
Steven Rostedt
3add2d34bd tracing: Have syscall trace events show "0x" for values greater than 10
Currently the syscall trace events show each value as hexadecimal, but
without adding "0x" it can be confusing:

   sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)

Looks like the above write wrote 44 bytes, when in reality it wrote 68
bytes.

Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
For values less than 10, leave off the "0x" as that just adds noise to the
output.

Also change the iterator to check if "i" is nonzero and print the ", "
delimiter at the start, then adding the logic to the trace_seq_printf() at
the end.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://lore.kernel.org/20250923130713.764558957@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 09:29:29 -04:00
Steven Rostedt
17a1a107d0 tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.

The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.

Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.

Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://lore.kernel.org/20250923130713.594320290@kernel.org
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23 09:29:29 -04:00
KP Singh
8cd189e414 bpf: Move the signature kfuncs to helpers.c
No functional changes, except for the addition of the headers for the
kfuncs so that they can be used for signature verification.

Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-8-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18 19:11:42 -07:00
Linus Torvalds
097a6c336d Runtime Verifier fixes for v6.17
- Fix build in some RISC-V flavours
 
   Some system calls only are available for the 64bit RISC-V machines.
   #ifdef out the cases of clock_nanosleep and futex in the sleep monitor
   if they are not supported by the architecture.
 
 - Fix wrong cast, obsolete after refactoring
 
   Use container_of() to get to the rv_monitor structure from the
   enable_monitors_next() 'p' pointer. The assignment worked only because
   the list field used happened to be the first field of the structure.
 
 - Remove redundant include files
 
   Some include files were listed twice. Remove the extra ones and sort
   the includes.
 
 - Fix missing unlock on failure
 
   There was an error path that exited the rv_register_monitor() function
   without releasing a lock. Change that to goto the lock release.
 
 - Add Gabriele Monaco to be Runtime Verifier maintainer
 
   Gabriele is doing most of the work on RV as well as collecting patches.
   Add him to the maintainers file for Runtime Verification.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaMxsBRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qk5oAP4tlGnMBoLpZXVBpAubVUQOVRfQo5dI
 ar9LpXdgnj4xQAEA9Q5uIvhCI/CMXTK98gFhR31p9O4Sqtn0JlCViBbVSQg=
 =tUQG
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier fixes from Steven Rostedt:

 - Fix build in some RISC-V flavours

   Some system calls only are available for the 64bit RISC-V machines.
   #ifdef out the cases of clock_nanosleep and futex in the sleep
   monitor if they are not supported by the architecture.

 - Fix wrong cast, obsolete after refactoring

   Use container_of() to get to the rv_monitor structure from the
   enable_monitors_next() 'p' pointer. The assignment worked only
   because the list field used happened to be the first field of the
   structure.

 - Remove redundant include files

   Some include files were listed twice. Remove the extra ones and sort
   the includes.

 - Fix missing unlock on failure

   There was an error path that exited the rv_register_monitor()
   function without releasing a lock. Change that to goto the lock
   release.

 - Add Gabriele Monaco to be Runtime Verifier maintainer

   Gabriele is doing most of the work on RV as well as collecting
   patches. Add him to the maintainers file for Runtime Verification.

* tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Add Gabriele Monaco as maintainer for Runtime Verification
  rv: Fix missing mutex unlock in rv_register_monitor()
  include/linux/rv.h: remove redundant include file
  rv: Fix wrong type cast in enabled_monitors_next()
  rv: Support systems with time64-only syscalls
2025-09-18 15:22:00 -07:00
Wang Liang
dc3382fffd tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()
A crash was observed with the following output:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none)
RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911
Call Trace:
 <TASK>
 trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089
 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246
 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128
 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107
 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785
 vfs_write+0x2b6/0xd00 fs/read_write.c:684
 ksys_write+0x129/0x240 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94
 </TASK>

Function kmemdup() may return NULL in trace_kprobe_create_internal(), add
check for it's return value.

Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/

Fixes: 33b4e38baa ("tracing: kprobe-event: Allocate string buffers from heap")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-09-18 07:36:41 +09:00
Al Viro
1b8abbb121 bpf...d_path(): constify path argument
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:17:08 -04:00
Zhen Ni
9b5096761c rv: Fix missing mutex unlock in rv_register_monitor()
If create_monitor_dir() fails, the function returns directly without
releasing rv_interface_lock. This leaves the mutex locked and causes
subsequent monitor registration attempts to deadlock.

Fix it by making the error path jump to out_unlock, ensuring that the
mutex is always released before returning.

Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20250903065112.1878330-1-zhen.ni@easystack.cn
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:35 +02:00
Nam Cao
de090d1cca rv: Fix wrong type cast in enabled_monitors_next()
Argument 'p' of enabled_monitors_next() is not a pointer to struct
rv_monitor, it is actually a pointer to the list_head inside struct
rv_monitor. Therefore it is wrong to cast 'p' to struct rv_monitor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_monitor_def.
This is no longer true since commit 24cbfe18d5 ("rv: Merge struct
rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20250806120911.989365-1-namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:35 +02:00
Palmer Dabbelt
03ee64b5e5 rv: Support systems with time64-only syscalls
Some systems (like 32-bit RISC-V) only have the 64-bit time_t versions
of syscalls.  So handle the 32-bit time_t version of those being
undefined.

Fixes: f74f8bb246 ("rv: Add rtapp_sleep monitor")
Closes: https://lore.kernel.org/oe-kbuild-all/202508160204.SsFyNfo6-lkp@intel.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20250804194518.97620-2-palmer@dabbelt.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-09-15 08:36:27 +02:00
Alexei Starovoitov
5d87e96a49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11 09:34:37 -07:00
Pu Lehui
cd4453c5e9 tracing: Silence warning when chunk allocation fails in trace_pid_write
Syzkaller trigger a fault injection warning:

WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0
Modules linked in:
CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0
Tainted: [U]=USER
Hardware name: Google Compute Engine/Google Compute Engine
RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294
Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff
RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283
RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000
RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef
R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0
FS:  00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464
 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline]
 register_pid_events kernel/trace/trace_events.c:2354 [inline]
 event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425
 vfs_write+0x24c/0x1150 fs/read_write.c:677
 ksys_write+0x12b/0x250 fs/read_write.c:731
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

We can reproduce the warning by following the steps below:
1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid
   and register sched_switch tracepoint.
2. echo ' ' >> set_event_pid, and perform fault injection during chunk
   allocation of trace_pid_list_alloc. Let pid_list with no pid and
assign to tr->filtered_pids.
3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to
   tr->filtered_pids.
4. echo 9 >> set_event_pid, will trigger the double register
   sched_switch tracepoint warning.

The reason is that syzkaller injects a fault into the chunk allocation
in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which
may trigger double register of the same tracepoint. This only occurs
when the system is about to crash, but to suppress this warning, let's
add failure handling logic to trace_pid_list_set.

Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-08 14:56:43 -04:00
Wang Liang
c1628c00c4 tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()
A crash was observed with the following output:

BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary)
RIP: 0010:bitmap_parselist+0x53/0x3e0
Call Trace:
 <TASK>
 osnoise_cpus_write+0x7a/0x190
 vfs_write+0xf8/0x410
 ? do_sys_openat2+0x88/0xd0
 ksys_write+0x60/0xd0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

This issue can be reproduced by below code:

fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, "0-2", 0);

When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return
ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which
trigger the null pointer dereference. Add check for the parameter 'count'.

Cc: <mhiramat@kernel.org>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <tglozar@redhat.com>
Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com
Fixes: 17f89102fe ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06 12:12:38 -04:00
Guenter Roeck
ab1396af75 trace/fgraph: Fix error handling
Commit edede7a6dc ("trace/fgraph: Fix the warning caused by missing
unregister notifier") added a call to unregister the PM notifier if
register_ftrace_graph() failed. It does so unconditionally. However,
the PM notifier is only registered with the first call to
register_ftrace_graph(). If the first registration was successful and
a subsequent registration failed, the notifier is now unregistered even
if ftrace graphs are still registered.

Fix the problem by only unregistering the PM notifier during error handling
if there are no active fgraph registrations.

Fixes: edede7a6dc ("trace/fgraph: Fix the warning caused by missing unregister notifier")
Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/
Cc: Ye Weihua <yeweihua4@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06 12:12:38 -04:00
Al Viro
b28f9eba12 change the calling conventions for vfs_parse_fs_string()
Absolute majority of callers are passing the 4th argument equal to
strlen() of the 3rd one.

Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that
want independent length.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-04 15:20:51 -04:00
Luo Gengkun
3d62ab32df tracing: Fix tracing_marker may trigger page fault during preempt_disable
Both tracing_mark_write and tracing_mark_raw_write call
__copy_from_user_inatomic during preempt_disable. But in some case,
__copy_from_user_inatomic may trigger page fault, and will call schedule()
subtly. And if a task is migrated to other cpu, the following warning will
be trigger:
        if (RB_WARN_ON(cpu_buffer,
                       !local_read(&cpu_buffer->committing)))

An example can illustrate this issue:

process flow						CPU
---------------------------------------------------------------------

tracing_mark_raw_write():				cpu:0
   ...
   ring_buffer_lock_reserve():				cpu:0
      ...
      cpu = raw_smp_processor_id()			cpu:0
      cpu_buffer = buffer->buffers[cpu]			cpu:0
      ...
   ...
   __copy_from_user_inatomic():				cpu:0
      ...
      # page fault
      do_mem_abort():					cpu:0
         ...
         # Call schedule
         schedule()					cpu:0
	 ...
   # the task schedule to cpu1
   __buffer_unlock_commit():				cpu:1
      ...
      ring_buffer_unlock_commit():			cpu:1
	 ...
	 cpu = raw_smp_processor_id()			cpu:1
	 cpu_buffer = buffer->buffers[cpu]		cpu:1

As shown above, the process will acquire cpuid twice and the return values
are not the same.

To fix this problem using copy_from_user_nofault instead of
__copy_from_user_inatomic, as the former performs 'access_ok' before
copying.

Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com
Fixes: 656c7f0d2d ("tracing: Replace kmap with copy_from_user() in trace_marker writing")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02 12:02:42 -04:00
Qianfeng Rong
81ac63321e trace: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250805023630.335719-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02 11:48:16 -04:00
Linus Torvalds
e1d8f9ccb2 tracing fixes for v6.17-rc2:
- Fix rtla and latency tooling pkg-config errors
 
   If libtraceevent and libtracefs is installed, but their corresponding '.pc'
   files are not installed, it reports that the libraries are missing and
   confuses the developer. Instead, report that the pkg-config files are
   missing and should be installed.
 
 - Fix overflow bug of the parser in trace_get_user()
 
   trace_get_user() uses the parsing functions to parse the user space strings.
   If the parser fails due to incorrect processing, it doesn't terminate the
   buffer with a nul byte. Add a "failed" flag to the parser that gets set when
   parsing fails and is used to know if the buffer is fine to use or not.
 
 - Remove a semicolon that was at an end of a comment line
 
 - Fix register_ftrace_graph() to unregister the pm notifier on error
 
   The register_ftrace_graph() registers a pm notifier but there's an error
   path that can exit the function without unregistering it. Since the function
   returns an error, it will never be unregistered.
 
 - Allocate and copy ftrace hash for reader of ftrace filter files
 
   When the set_ftrace_filter or set_ftrace_notrace files are open for read,
   an iterator is created and sets its hash pointer to the associated hash that
   represents filtering or notrace filtering to it. The issue is that the hash
   it points to can change while the iteration is happening. All the locking
   used to access the tracer's hashes are released which means those hashes can
   change or even be freed. Using the hash pointed to by the iterator can cause
   UAF bugs or similar.
 
   Have the read of these files allocate and copy the corresponding hashes and
   use that as that will keep them the same while the iterator is open. This
   also simplifies the code as opening it for write already does an allocate
   and copy, and now that the read is doing the same, there's no need to check
   which way it was opened on the release of the file, and the iterator hash
   can always be freed.
 
 - Fix function graph to copy args into temp storage
 
   The output of the function graph tracer shows both the entry and the exit of
   a function. When the exit is right after the entry, it combines the two
   events into one with the output of "function();", instead of showing:
 
   function() {
   }
 
   In order to do this, the iterator descriptor that reads the events includes
   storage that saves the entry event while it peaks at the next event in
   the ring buffer. The peek can free the entry event so the iterator must
   store the information to use it after the peek.
 
   With the addition of function graph tracer recording the args, where the
   args are a dynamic array in the entry event, the temp storage does not save
   them. This causes the args to be corrupted or even cause a read of unsafe
   memory.
 
   Add space to save the args in the temp storage of the iterator.
 
 - Fix race between ftrace_dump and reading trace_pipe
 
   ftrace_dump() is used when a crash occurs where the ftrace buffer will be
   printed to the console. But it can also be triggered by sysrq-z. If a
   sysrq-z is triggered while a task is reading trace_pipe it can cause a race
   in the ftrace_dump() where it checks if the buffer has content, then it
   checks if the next event is available, and then prints the output
   (regardless if the next event was available or not). Reading trace_pipe
   at the same time can cause it to not be available, and this triggers a
   WARN_ON in the print. Move the printing into the check if the next event
   exists or not.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaKnAGRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qotPAQD02idezasiFi0vakLTR+0x/uAI2UOL
 5RLfTwmZW7S1FwEAwOvGpKx3k/kUwDp5EReP34A+1Fqyc5Mvps4UCE1s4gM=
 =ENHu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix rtla and latency tooling pkg-config errors

   If libtraceevent and libtracefs is installed, but their corresponding
   '.pc' files are not installed, it reports that the libraries are
   missing and confuses the developer. Instead, report that the
   pkg-config files are missing and should be installed.

 - Fix overflow bug of the parser in trace_get_user()

   trace_get_user() uses the parsing functions to parse the user space
   strings. If the parser fails due to incorrect processing, it doesn't
   terminate the buffer with a nul byte. Add a "failed" flag to the
   parser that gets set when parsing fails and is used to know if the
   buffer is fine to use or not.

 - Remove a semicolon that was at an end of a comment line

 - Fix register_ftrace_graph() to unregister the pm notifier on error

   The register_ftrace_graph() registers a pm notifier but there's an
   error path that can exit the function without unregistering it. Since
   the function returns an error, it will never be unregistered.

 - Allocate and copy ftrace hash for reader of ftrace filter files

   When the set_ftrace_filter or set_ftrace_notrace files are open for
   read, an iterator is created and sets its hash pointer to the
   associated hash that represents filtering or notrace filtering to it.
   The issue is that the hash it points to can change while the
   iteration is happening. All the locking used to access the tracer's
   hashes are released which means those hashes can change or even be
   freed. Using the hash pointed to by the iterator can cause UAF bugs
   or similar.

   Have the read of these files allocate and copy the corresponding
   hashes and use that as that will keep them the same while the
   iterator is open. This also simplifies the code as opening it for
   write already does an allocate and copy, and now that the read is
   doing the same, there's no need to check which way it was opened on
   the release of the file, and the iterator hash can always be freed.

 - Fix function graph to copy args into temp storage

   The output of the function graph tracer shows both the entry and the
   exit of a function. When the exit is right after the entry, it
   combines the two events into one with the output of "function();",
   instead of showing:

     function() {
     }

   In order to do this, the iterator descriptor that reads the events
   includes storage that saves the entry event while it peaks at the
   next event in the ring buffer. The peek can free the entry event so
   the iterator must store the information to use it after the peek.

   With the addition of function graph tracer recording the args, where
   the args are a dynamic array in the entry event, the temp storage
   does not save them. This causes the args to be corrupted or even
   cause a read of unsafe memory.

   Add space to save the args in the temp storage of the iterator.

 - Fix race between ftrace_dump and reading trace_pipe

   ftrace_dump() is used when a crash occurs where the ftrace buffer
   will be printed to the console. But it can also be triggered by
   sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
   it can cause a race in the ftrace_dump() where it checks if the
   buffer has content, then it checks if the next event is available,
   and then prints the output (regardless if the next event was
   available or not). Reading trace_pipe at the same time can cause it
   to not be available, and this triggers a WARN_ON in the print. Move
   the printing into the check if the next event exists or not

* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Also allocate and copy hash for reading of filter files
  ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
  fgraph: Copy args in intermediate storage with entry
  trace/fgraph: Fix the warning caused by missing unregister notifier
  ring-buffer: Remove redundant semicolons
  tracing: Limit access to parser->buffer when trace_get_user failed
  rtla: Check pkg-config install
  tools/latency-collector: Check pkg-config install
2025-08-23 10:11:34 -04:00
Steven Rostedt
bfb336cf97 ftrace: Also allocate and copy hash for reading of filter files
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds
the pointer to the global tracer hash to its iterator. Unlike the writer
that allocates a copy of the hash, the reader keeps the pointer to the
filter hashes. This is problematic because this pointer is static across
function calls that release the locks that can update the global tracer
hashes. This can cause UAF and similar bugs.

Allocate and copy the hash for reading the filter files like it is done
for the writers. This not only fixes UAF bugs, but also makes the code a
bit simpler as it doesn't have to differentiate when to free the
iterator's hash between writers and readers.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home
Fixes: c20489dad1 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read")
Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/
Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/
Reported-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 19:58:35 -04:00
Tengda Wu
4013aef2ce ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
When calling ftrace_dump_one() concurrently with reading trace_pipe,
a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race
condition.

The issue occurs because:

CPU0 (ftrace_dump)                              CPU1 (reader)
echo z > /proc/sysrq-trigger

!trace_empty(&iter)
trace_iterator_reset(&iter) <- len = size = 0
                                                cat /sys/kernel/tracing/trace_pipe
trace_find_next_entry_inc(&iter)
  __find_next_entry
    ring_buffer_empty_cpu <- all empty
  return NULL

trace_printk_seq(&iter.seq)
  WARN_ON_ONCE(s->seq.len >= s->seq.size)

In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.

Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com
Fixes: d769041f86 ("ring_buffer: implement new locking")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:36 -04:00
Steven Rostedt
e3d01979e4 fgraph: Copy args in intermediate storage with entry
The output of the function graph tracer has two ways to display its
entries. One way for leaf functions with no events recorded within them,
and the other is for functions with events recorded inside it. As function
graph has an entry and exit event, to simplify the output of leaf
functions it combines the two, where as non leaf functions are separate:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)   0.391 us    |                  __raise_softirq_irqoff();
 2)   1.191 us    |                }
 2)   2.086 us    |              }

The __raise_softirq_irqoff() function above is really two events that were
merged into one. Otherwise it would have looked like:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)               |                  __raise_softirq_irqoff() {
 2)   0.391 us    |                  }
 2)   1.191 us    |                }
 2)   2.086 us    |              }

In order to do this merge, the reading of the trace output file needs to
look at the next event before printing. But since the pointer to the event
is on the ring buffer, it needs to save the entry event before it looks at
the next event as the next event goes out of focus as soon as a new event
is read from the ring buffer. After it reads the next event, it will print
the entry event with either the '{' (non leaf) or ';' and timestamps (leaf).

The iterator used to read the trace file has storage for this event. The
problem happens when the function graph tracer has arguments attached to
the entry event as the entry now has a variable length "args" field. This
field only gets set when funcargs option is used. But the args are not
recorded in this temp data and garbage could be printed. The entry field
is copied via:

  data->ent = *curr;

Where "curr" is the entry field. But this method only saves the non
variable length fields from the structure.

Add a helper structure to the iterator data that adds the max args size to
the data storage in the iterator. Then simply copy the entire entry into
this storage (with size protection).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Tested-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:35 -04:00
Masami Hiramatsu (Google)
ec879e1a0b tracing: fprobe-event: Sanitize wildcard for fprobe event name
Fprobe event accepts wildcards for the target functions, but unless user
specifies its event name, it makes an event with the wildcards.

  /sys/kernel/tracing # echo 'f mutex*' >> dynamic_events
  /sys/kernel/tracing # cat dynamic_events
  f:fprobes/mutex*__entry mutex*
  /sys/kernel/tracing # ls events/fprobes/
  enable         filter         mutex*__entry

To fix this, replace the wildcard ('*') with an underscore.

Link: https://lore.kernel.org/all/175535345114.282990.12294108192847938710.stgit@devnote2/

Fixes: 334e5519c3 ("tracing/probes: Add fprobe events for tracing function entry and exit.")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
2025-08-20 23:41:58 +09:00
Ye Weihua
edede7a6dc trace/fgraph: Fix the warning caused by missing unregister notifier
This warning was triggered during testing on v6.16:

notifier callback ftrace_suspend_notifier_call already registered
WARNING: CPU: 2 PID: 86 at kernel/notifier.c:23 notifier_chain_register+0x44/0xb0
...
Call Trace:
 <TASK>
 blocking_notifier_chain_register+0x34/0x60
 register_ftrace_graph+0x330/0x410
 ftrace_profile_write+0x1e9/0x340
 vfs_write+0xf8/0x420
 ? filp_flush+0x8a/0xa0
 ? filp_close+0x1f/0x30
 ? do_dup2+0xaf/0x160
 ksys_write+0x65/0xe0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

When writing to the function_profile_enabled interface, the notifier was
not unregistered after start_graph_tracing failed, causing a warning the
next time function_profile_enabled was written.

Fixed by adding unregister_pm_notifier in the exception path.

Link: https://lore.kernel.org/20250818073332.3890629-1-yeweihua4@huawei.com
Fixes: 4a2b8dda3f ("tracing/function-graph-tracer: fix a regression while suspend to disk")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ye Weihua <yeweihua4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:21:03 -04:00
Liao Yuanhong
cd6e4faba9 ring-buffer: Remove redundant semicolons
Remove unnecessary semicolons.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250813095114.559530-1-liaoyuanhong@vivo.com
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:20:30 -04:00
Pu Lehui
6a909ea83f tracing: Limit access to parser->buffer when trace_get_user failed
When the length of the string written to set_ftrace_filter exceeds
FTRACE_BUFF_MAX, the following KASAN alarm will be triggered:

BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0
Read of size 1 at addr ffff0000d00bd5ba by task ash/165

CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty
Hardware name: linux,dummy-virt (DT)
Call trace:
 show_stack+0x34/0x50 (C)
 dump_stack_lvl+0xa0/0x158
 print_address_description.constprop.0+0x88/0x398
 print_report+0xb0/0x280
 kasan_report+0xa4/0xf0
 __asan_report_load1_noabort+0x20/0x30
 strsep+0x18c/0x1b0
 ftrace_process_regex.isra.0+0x100/0x2d8
 ftrace_regex_release+0x484/0x618
 __fput+0x364/0xa58
 ____fput+0x28/0x40
 task_work_run+0x154/0x278
 do_notify_resume+0x1f0/0x220
 el0_svc+0xec/0xf0
 el0t_64_sync_handler+0xa0/0xe8
 el0t_64_sync+0x1ac/0x1b0

The reason is that trace_get_user will fail when processing a string
longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0.
Then an OOB access will be triggered in ftrace_regex_release->
ftrace_process_regex->strsep->strpbrk. We can solve this problem by
limiting access to parser->buffer when trace_get_user failed.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com
Fixes: 8c9af478c0 ("ftrace: Handle commands when closing set_ftrace_filter file")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:20:30 -04:00
Tao Chen
abdaf49be5 bpf: Remove migrate_disable in kprobe_multi_link_prog_run
Graph tracer framework ensures we won't migrate, kprobe_multi_link_prog_run
called all the way from graph tracer, which disables preemption in
function_graph_enter_regs, as Jiri and Yonghong suggested, there is no
need to use migrate_disable. As a result, some overhead may will be reduced.
And add cant_sleep check for __this_cpu_inc_return.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250814121430.2347454-1-chen.dylane@linux.dev
2025-08-15 16:49:31 -07:00
Linus Torvalds
e991acf1bc Significant patch series in this pull request:
- The 2 patch series "squashfs: Remove page->mapping references" from
   Matthew Wilcox gets us closer to being able to remove page->mapping.
 
 - The 5 patch series "relayfs: misc changes" from Jason Xing does some
   maintenance and minor feature addition work in relayfs.
 
 - The 5 patch series "kdump: crashkernel reservation from CMA" from Jiri
   Bohac switches us from static preallocation of the kdump crashkernel's
   working memory over to dynamic allocation.  So the difficulty of
   a-priori estimation of the second kernel's needs is removed and the
   first kernel obtains extra memory.
 
 - The 5 patch series "generalize panic_print's dump function to be used
   by other kernel parts" from Feng Tang implements some consolidation and
   rationalizatio of the various ways in which a faiing kernel splats
   information at the operator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+82gAKCRDdBJ7gKXxA
 jj4JAP9xb+w9DrBY6sa+7KTPIb+aTqQ7Zw3o9O2m+riKQJv6jAEA6aEwRnDA0451
 fDT5IqVlCWGvnVikdZHSnvhdD7TGsQ0=
 =rT71
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Significant patch series in this pull request:

   - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
     us closer to being able to remove page->mapping

   - "relayfs: misc changes" (Jason Xing) does some maintenance and
     minor feature addition work in relayfs

   - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
     us from static preallocation of the kdump crashkernel's working
     memory over to dynamic allocation. So the difficulty of a-priori
     estimation of the second kernel's needs is removed and the first
     kernel obtains extra memory

   - "generalize panic_print's dump function to be used by other
     kernel parts" (Feng Tang) implements some consolidation and
     rationalization of the various ways in which a failing kernel
     splats information at the operator

* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
  tools/getdelays: add backward compatibility for taskstats version
  kho: add test for kexec handover
  delaytop: enhance error logging and add PSI feature description
  samples: Kconfig: fix spelling mistake "instancess" -> "instances"
  fat: fix too many log in fat_chain_add()
  scripts/spelling.txt: add notifer||notifier to spelling.txt
  xen/xenbus: fix typo "notifer"
  net: mvneta: fix typo "notifer"
  drm/xe: fix typo "notifer"
  cxl: mce: fix typo "notifer"
  KVM: x86: fix typo "notifer"
  MAINTAINERS: add maintainers for delaytop
  ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
  ucount: fix atomic_long_inc_below() argument type
  kexec: enable CMA based contiguous allocation
  stackdepot: make max number of pools boot-time configurable
  lib/xxhash: remove unused functions
  init/Kconfig: restore CONFIG_BROKEN help text
  lib/raid6: update recov_rvv.c zero page usage
  docs: update docs after introducing delaytop
  ...
2025-08-03 16:23:09 -07:00
Linus Torvalds
3c4a063b1f tracing cleanups for v6.17:
- Remove unneeded goto out statements
 
   Over time, the logic was restructured but left a "goto out" where the
   out label simply did a "return ret;". Instead of jumping to this out
   label, simply return immediately and remove the out label.
 
 - Add guard(ring_buffer_nest)
 
   Some calls to the tracing ring buffer can happen when the ring buffer is
   already being written to at the same context (for example, a
   trace_printk() in between a ring_buffer_lock_reserve() and a
   ring_buffer_unlock_commit()).
 
   In order to not trigger the recursion detection, these functions use
   ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
   these functions so that their use cases can be simplified and not need to
   use goto for the release.
 
 - Clean up the tracing code with guard() and __free() logic
 
   There were several locations that were prime candidates for using guard()
   and __free() helpers. Switch them over to use them.
 
 - Fix output of function argument traces for unsigned int values
 
   The function tracer with "func-args" option set will record up to 6 argument
   registers and then use BTF to format them for human consumption when the
   trace file is read. There's several arguments that are "unsigned long" and
   even "unsigned int" that are either and address or a mask. It is easier to
   understand if they were printed using hexadecimal instead of decimal.
   The old method just printed all non-pointer values as signed integers,
   which made it even worse for unsigned integers.
 
   For instance, instead of:
 
     __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs
 
   Show:
 
    __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaI9pOBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkhoAQD+moa8M+WWUS9T9utwREytolfyNKEO
 dW0dPVzquX3L6gEAnc7zNla4QZJsdU1bHyhpDTn/Zhu11aMrzoxcBcdrSwI=
 =x79z
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing updates from Steven Rostedt:

 - Remove unneeded goto out statements

   Over time, the logic was restructured but left a "goto out" where the
   out label simply did a "return ret;". Instead of jumping to this out
   label, simply return immediately and remove the out label.

 - Add guard(ring_buffer_nest)

   Some calls to the tracing ring buffer can happen when the ring buffer
   is already being written to at the same context (for example, a
   trace_printk() in between a ring_buffer_lock_reserve() and a
   ring_buffer_unlock_commit()).

   In order to not trigger the recursion detection, these functions use
   ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard()
   for these functions so that their use cases can be simplified and not
   need to use goto for the release.

 - Clean up the tracing code with guard() and __free() logic

   There were several locations that were prime candidates for using
   guard() and __free() helpers. Switch them over to use them.

 - Fix output of function argument traces for unsigned int values

   The function tracer with "func-args" option set will record up to 6
   argument registers and then use BTF to format them for human
   consumption when the trace file is read. There are several arguments
   that are "unsigned long" and even "unsigned int" that are either and
   address or a mask. It is easier to understand if they were printed
   using hexadecimal instead of decimal. The old method just printed all
   non-pointer values as signed integers, which made it even worse for
   unsigned integers.

   For instance, instead of:

     __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs

   show:

     __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs"

* tag 'trace-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Have unsigned int function args displayed as hexadecimal
  ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
  tracing: Use __free(kfree) in trace.c to remove gotos
  tracing: Add guard() around locks and mutexes in trace.c
  tracing: Add guard(ring_buffer_nest)
  tracing: Remove unneeded goto out logic
2025-08-03 15:03:04 -07:00
Linus Torvalds
8877fcb70f This is a small set of changes for modules, primarily to extend module users
to use the module data structures in combination with the already no-op stub
 module functions, even when support for modules is disabled in the kernel
 configuration. This change follows the kernel's coding style for conditional
 compilation and allows kunit code to drop all CONFIG_MODULES ifdefs, which is
 also part of the changes. This should allow others part of the kernel to do the
 same cleanup.
 
 Note that this had a conflict with sysctl changes [1] but should be fixed now as I
 rebased on top.
 
 The remaining changes include a fix for module name length handling which could
 potentially lead to the removal of an incorrect module, and various cleanups.
 
 The module name fix and related cleanup has been in linux-next since Thursday
 (July 31) while the rest of the changes for a bit more than 3 weeks.
 
 Note that this currently has conflicts in next with kbuild's tree [2].
 
 Link: https://lore.kernel.org/all/20250714175916.774e6d79@canb.auug.org.au/ [1]
 Link: https://lore.kernel.org/all/20250801132941.6815d93d@canb.auug.org.au/ [2]
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE73Ua4R8Pc+G5xjxTQJ6jxB8ZUfsFAmiPQgkACgkQQJ6jxB8Z
 UfuPTA//XrRguJFBhQh6cUWqVleTNQJuhjiPsOSO5S52aVET4wsrnRNeM2eM5oqw
 0+6ELvhIJINQ1LjpOP8D67d8P5Ds1/qM1pbQIkQsoKiEj6E7Q4dXH5N0uyf/BzO3
 HaosLG9cpqcomlSorYEiYoPjqy9EChQzsi+YAYWAB+fW6bvU/AdUHTRH88m3ppBJ
 Y22BTTPOKKyj5/QgfY+kwH8TTnrzCzY8aoOqW7uimLI5h4c9dFQ2PigRJnoMfDG1
 11w5VshOTzZJvNFrUk5GVSirwlxdJDbW6dKfG0DD5+eNWK5dfIEc+/EcuhaGoPvO
 Euwv8VQubdxHTAG6kzHI0MtxAVQUM1gyz8zHiu18eW++GTtnTFs6m8E6H9AC176G
 nDkUh3qSxJN2HHgxtS9VUExEEZpYqtWeB9Zts8K3oSWvTaQenHWpVHPADkxzS4JU
 Jvkjq8SiKo+RqHxaOKfyf1RfOtYe5tjMCLrP7zX39d1+cwGxuc6mip/omY9HFDgn
 op132fYdt24JSHoioJDzRz9mTfvj3nICEmgX4D4WDQx5lP27CUcLugPnBNHPp0fu
 5hL+ajy8M8nq4zm/42Y+F7VS74TIA6mSnJKs9dMCknUWueD6HrDEU9xHi1YMpUMZ
 cBUSpU+P94dCIScwEzkp926vDnHyxCHLbpF1Jsq5qNNdj7AelHk=
 =4bGB
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull module updates from Daniel Gomez:
 "This is a small set of changes for modules, primarily to extend module
  users to use the module data structures in combination with the
  already no-op stub module functions, even when support for modules is
  disabled in the kernel configuration. This change follows the kernel's
  coding style for conditional compilation and allows kunit code to drop
  all CONFIG_MODULES ifdefs, which is also part of the changes. This
  should allow others part of the kernel to do the same cleanup.

  The remaining changes include a fix for module name length handling
  which could potentially lead to the removal of an incorrect module,
  and various cleanups"

* tag 'modules-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: Rename MAX_PARAM_PREFIX_LEN to __MODULE_NAME_LEN
  tracing: Replace MAX_PARAM_PREFIX_LEN with MODULE_NAME_LEN
  module: Restore the moduleparam prefix length check
  module: Remove unnecessary +1 from last_unloaded_module::name size
  module: Prevent silent truncation of module name in delete_module(2)
  kunit: test: Drop CONFIG_MODULE ifdeffery
  module: make structure definitions always visible
  module: move 'struct module_use' to internal.h
2025-08-03 14:16:52 -07:00
Steven Rostedt
3ca824369b tracing: Have unsigned int function args displayed as hexadecimal
Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:

static void __create_object(unsigned long ptr, size_t size,
				int min_count, gfp_t gfp, unsigned int objflags);

static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
			    size_t len);

void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);

Show up in the trace as:

    __create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof
    stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame
    __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs

Instead, by displaying unsigned as hexadecimal, they look more like this:

    __create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof
    stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame
    __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs

Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.

Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home

- Use btf_int_encoding() instead of open coding it (Martin KaFai Lau)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 19:14:51 -04:00
Steven Rostedt
db5f0c3e3e ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
The function ring_buffer_write() has a goto out to only do a
preempt_enable_notrace(). This can be replaced by a guard.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.205479143@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt
12d5189615 tracing: Use __free(kfree) in trace.c to remove gotos
There's a couple of locations that have goto out in trace.c for the only
purpose of freeing a variable that was allocated. These can be replaced
with __free(kfree).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.040892777@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt
debe57fbe1 tracing: Add guard() around locks and mutexes in trace.c
There's several locations in trace.c that can be simplified by using
guards around raw_spin_lock_irqsave, mutexes and preempt disabling.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.879085376@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt
788fa4b47c tracing: Add guard(ring_buffer_nest)
Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).

In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.710501021@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt
c89504a703 tracing: Remove unneeded goto out logic
Several places in the trace.c file there's a goto out where the out is
simply a return. There's no reason to jump to the out label if it's not
doing any more logic but simply returning from the function.

Replace the goto outs with a return and remove the out labels.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.538726745@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:14 -04:00
Linus Torvalds
d6f38c1239 tracing changes for 6.17
- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing
 
   When tracefs was first introduced back in 2014, the directory
   /sys/kernel/tracing was added and is the designated location to mount
   tracefs. To keep backward compatibility, tracefs was auto-mounted in
   /sys/kernel/debug/tracing as well.
 
   All distros now mount tracefs on /sys/kernel/tracing. Having it seen in two
   different locations has lead to various issues and inconsistencies.
 
   The VFS folks have to also maintain debugfs_create_automount() for this
   single user.
 
   It's been over 10 years. Tooling and scripts should start replacing the
   debugfs location with the tracefs one. The reason tracefs was created in the
   first place was to allow access to the tracing facilities without the need
   to configure debugfs into the kernel. Using tracefs should now be more
   robust.
 
   A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED
   which is default y, so that the kernel is still built with the automount.
   This config allows those that want to remove the automount from debugfs to
   do so.
 
   When tracefs is accessed from /sys/kernel/debug/tracing, the following
   printk is triggerd:
 
    pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");
 
   This gives users another 5 years to fix their scripts.
 
 - Use queue_rcu_work() instead of call_rcu() for freeing event filters
 
   The number of filters to be free can be many depending on the number of
   events within an event system. Freeing them from softirq context can
   potentially cause undesired latency. Use the RCU workqueue to free them
   instead.
 
 - Remove pointless memory barriers in latency code
 
   Memory barriers were added to some of the latency code a long time ago with
   the idea of "making them visible", but that's not what memory barriers are
   for. They are to synchronize access between different variables. There was
   no synchronization here making them pointless.
 
 - Remove "__attribute__()" from the type field of event format
 
   When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y and
   PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with the
   following:
 
     field:const char * filename;      offset:24;      size:8; signed:0;
 
   Turns into:
 
     field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;
 
   This confuses parsers. Add code to strip these tags from the strings.
 
 - Add eprobe config option CONFIG_EPROBE_EVENTS
 
   Eprobes were added back in 5.15 but were only enabled when another probe was
   enabled (kprobe, fprobe, uprobe, etc). The eprobes had no config option
   of their own. Add one as they should be a separate entity.
 
   It's default y to keep with the old kernels but still has dependencies on
   TRACING and HAVE_REGS_AND_STACK_ACCESS_API.
 
 - Add eprobe documentation
 
   When eprobes were added back in 5.15 no documentation was added to describe
   them. This needs to be rectified.
 
 - Replace open coded cpumask_next_wrap() in move_to_next_cpu()
 
 - Have preemptirq_delay_run() use off-stack CPU mask
 
 - Remove obsolete comment about pelt_cfs event
 
   DECLARE_TRACE() appends "_tp" to trace events now, but the comment above
   pelt_cfs still mentioned appending it manually.
 
 - Remove EVENT_FILE_FL_SOFT_MODE flag
 
   The SOFT_MODE flag was required when the soft enabling and disabling of
   trace events was first introduced. But there was a bug with this approach
   as it only worked for a single instance. When multiple users required soft
   disabling and disabling the code was changed to have a ref count. The
   SOFT_MODE flag is now set iff the ref count is non zero. This is redundant
   and just reading the ref count is good enough.
 
 - Fix typo in comment
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIt5ZRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvriAPsEbOEgMrPF1Tdj1mHLVajYTxI8ft5J
 aX5bfM2cDDRVcgEA57JHOXp4d05dj555/hgAUuCWuFp/E0Anp45EnFTedgQ=
 =wKZW
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing

   When tracefs was first introduced back in 2014, the directory
   /sys/kernel/tracing was added and is the designated location to mount
   tracefs. To keep backward compatibility, tracefs was auto-mounted in
   /sys/kernel/debug/tracing as well.

   All distros now mount tracefs on /sys/kernel/tracing. Having it seen
   in two different locations has lead to various issues and
   inconsistencies.

   The VFS folks have to also maintain debugfs_create_automount() for
   this single user.

   It's been over 10 years. Tooling and scripts should start replacing
   the debugfs location with the tracefs one. The reason tracefs was
   created in the first place was to allow access to the tracing
   facilities without the need to configure debugfs into the kernel.
   Using tracefs should now be more robust.

   A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
   default y, so that the kernel is still built with the automount. This
   config allows those that want to remove the automount from debugfs to
   do so.

   When tracefs is accessed from /sys/kernel/debug/tracing, the
   following printk is triggerd:

     pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");

   This gives users another 5 years to fix their scripts.

 - Use queue_rcu_work() instead of call_rcu() for freeing event filters

   The number of filters to be free can be many depending on the number
   of events within an event system. Freeing them from softirq context
   can potentially cause undesired latency. Use the RCU workqueue to
   free them instead.

 - Remove pointless memory barriers in latency code

   Memory barriers were added to some of the latency code a long time
   ago with the idea of "making them visible", but that's not what
   memory barriers are for. They are to synchronize access between
   different variables. There was no synchronization here making them
   pointless.

 - Remove "__attribute__()" from the type field of event format

   When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
   and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
   the following:

     field:const char * filename;      offset:24;      size:8; signed:0;

   Turns into:

     field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;

   This confuses parsers. Add code to strip these tags from the strings.

 - Add eprobe config option CONFIG_EPROBE_EVENTS

   Eprobes were added back in 5.15 but were only enabled when another
   probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
   config option of their own. Add one as they should be a separate
   entity.

   It's default y to keep with the old kernels but still has
   dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.

 - Add eprobe documentation

   When eprobes were added back in 5.15 no documentation was added to
   describe them. This needs to be rectified.

 - Replace open coded cpumask_next_wrap() in move_to_next_cpu()

 - Have preemptirq_delay_run() use off-stack CPU mask

 - Remove obsolete comment about pelt_cfs event

   DECLARE_TRACE() appends "_tp" to trace events now, but the comment
   above pelt_cfs still mentioned appending it manually.

 - Remove EVENT_FILE_FL_SOFT_MODE flag

   The SOFT_MODE flag was required when the soft enabling and disabling
   of trace events was first introduced. But there was a bug with this
   approach as it only worked for a single instance. When multiple users
   required soft disabling and disabling the code was changed to have a
   ref count. The SOFT_MODE flag is now set iff the ref count is non
   zero. This is redundant and just reading the ref count is good
   enough.

 - Fix typo in comment

* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Add documentation about eprobes
  tracing: Have eprobes have their own config option
  tracing: Remove "__attribute__()" from the type field of event format
  tracing: Deprecate auto-mounting tracefs in debugfs
  tracing: Fix comment in trace_module_remove_events()
  tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
  tracing: Remove pointless memory barriers
  tracing/sched: Remove obsolete comment on suffixes
  kernel: trace: preemptirq_delay_test: use offstack cpu mask
  tracing: Use queue_rcu_work() to free filters
  tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
2025-08-01 10:29:36 -07:00
Petr Pavlu
a7c54b2b41
tracing: Replace MAX_PARAM_PREFIX_LEN with MODULE_NAME_LEN
Use the MODULE_NAME_LEN definition in module_exists() to obtain the maximum
size of a module name, instead of using MAX_PARAM_PREFIX_LEN. The values
are the same but MODULE_NAME_LEN is more appropriate in this context.
MAX_PARAM_PREFIX_LEN was added in commit 730b69d225 ("module: check
kernel param length at compile time, not runtime") only to break a circular
dependency between module.h and moduleparam.h, and should mostly be limited
to use in moduleparam.h.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20250630143535.267745-5-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-07-31 13:57:44 +02:00
Linus Torvalds
2be6a7503d Remove or hide unused tracepoints
Tracepoints take up memory (around 5K per tracepoint) even when they are
 unused. Changes are being made to detect when a tracepoint is defined but
 unused and a warning is shown at build. But those changes are not yet
 ready for inclusion.
 
 - Fix some of the unused tracepoints that it detected
 
   Some tracepoints were removed and others were hidden by config settings
   to match the config settings of where they are instantiated. Some
   tracepoints were moved into architecture specific code as only one
   architecture used them.
 
 - Call the ftrace_test_filter tracepoint in an unreachable if statement
 
   The ftrace_test_filter tracepoint which is defined when ftrace selftests
   are configured and is used to test the filter logic, but the tracepoint is
   not actually called. It is put into an if statement to not have it get
   compiled out, but also not warn for not being used.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIlYqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qisrAQD+pu2en9LAXLcgbFxQOwhbACpxOpmT
 3LiE2+MvDR3ckQD/Vyi31XebdRmj3leJ7ENf28oa155y1pyK/onrPgDHyQ4=
 =nFfn
 -----END PGP SIGNATURE-----

Merge tag 'trace-unused-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracepoint cleanup from Steven Rostedt:
 "Remove or hide unused tracepoints

  Tracepoints take up memory (around 5K per tracepoint) even when they
  are unused. Changes are being made to detect when a tracepoint is
  defined but unused and a warning is shown at build. But those changes
  are not yet ready for inclusion.

   - Fix some of the unused tracepoints that it detected

     Some tracepoints were removed and others were hidden by config
     settings to match the config settings of where they are
     instantiated. Some tracepoints were moved into architecture
     specific code as only one architecture used them.

   - Call the ftrace_test_filter tracepoint in an unreachable if
     statement

     The ftrace_test_filter tracepoint which is defined when ftrace
     selftests are configured and is used to test the filter logic, but
     the tracepoint is not actually called. It is put into an if
     statement to not have it get compiled out, but also not warn for
     not being used"

* tag 'trace-unused-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: sched: Hide numa events under CONFIG_NUMA_BALANCING
  powerpc/thp: tracing: Hide hugepage events under CONFIG_PPC_BOOK3S_64
  tracing: Call trace_ftrace_test_filter() for the event
  tracing: arm: arm64: Hide trace events ipi_raise, ipi_entry and ipi_exit
  binder: Remove unused binder lock events
  PM: tracing: Hide power_domain_target event under ARCH_OMAP2PLUS
  PM: tracing: Hide device_pm_callback events under PM_SLEEP
  PM: tracing: Hide psci_domain_idle events under ARM_PSCI_CPUIDLE
  PM: cpufreq: powernv/tracing: Move powernv_throttle trace event
  alarmtimer: Hide alarmtimer_suspend event when RTC_CLASS is not configured
  tracing, AER: Hide PCIe AER event when PCIEAER is not configured
2025-07-30 16:41:58 -07:00
Linus Torvalds
4ff261e725 Runtime verification changes for 6.17
- Added Linear temporal logic monitors for RT application
 
   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page faults, or
   may be blocked trying to take a mutex without priority inheritance.
 
   However, while attempting to implement DA monitors for these real-time
   rules, deterministic automaton is found to be inappropriate as the
   specification language. The automaton is complicated, hard to understand,
   and error-prone.
 
   For these cases, linear temporal logic is found to be more suitable. The
   LTL is more concise and intuitive.
 
 - Make printk_deferred() public
 
   The new monitors needed access to printk_deferred(). Make them visible for
   the entire kernel.
 
 - Add a vpanic() to allow for va_list to be passed to panic.
 
 - Add rtapp container monitor.
 
   A collection of monitors that check for common problems with real-time
   applications that cause unexpected latency.
 
 - Add page fault tracepoints to risc-v
 
   These tracepoints are necessary to for the RV monitor to run on risc-v.
 
 - Fix the behaviour of the rv tool with -s and idle tasks.
 
 - Allow the rv tool to gracefully terminate with SIGTERM
 
 - Adjusts dot2c not to create lines over 100 columns
 
 - Properly order nested monitors in the RV Kconfig file
 
 - Return the registration error in all DA monitor instead of 0
 
 - Update and add new sched collection monitors
 
   Replace tss and sncid monitors with more complete sts:
   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.
 
   New monitor: nrp
    Preemption requires need resched which is cleared by any switch
    (includes a non optimal workaround for /nested/ preemptions)
 
   New monitor: sssw
    suspension requires setting the task to sleepable and, after the
    switch occurs, the task requires a wakeup to come back to runnable
 
   New monitor: opid
    waking and need-resched operations occur with interrupts and
    preemption disabled or in IRQ without explicitly disabling preemption
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIk8cBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qi3DAQCFu6DM7uPSh94oggWlH2LukOYVGk2b
 CvGrqMFuefae7QD/aK9nCMfzaBehixMOMQHLHELEh527Hd+RwQCrlnLALQU=
 =r5HZ
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verification updates from Steven Rostedt:

 - Added Linear temporal logic monitors for RT application

   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page
   faults, or may be blocked trying to take a mutex without priority
   inheritance.

   However, while attempting to implement DA monitors for these
   real-time rules, deterministic automaton is found to be inappropriate
   as the specification language. The automaton is complicated, hard to
   understand, and error-prone.

   For these cases, linear temporal logic is found to be more suitable.
   The LTL is more concise and intuitive.

 - Make printk_deferred() public

   The new monitors needed access to printk_deferred(). Make them
   visible for the entire kernel.

 - Add a vpanic() to allow for va_list to be passed to panic.

 - Add rtapp container monitor.

   A collection of monitors that check for common problems with
   real-time applications that cause unexpected latency.

 - Add page fault tracepoints to risc-v

   These tracepoints are necessary to for the RV monitor to run on
   risc-v.

 - Fix the behaviour of the rv tool with -s and idle tasks.

 - Allow the rv tool to gracefully terminate with SIGTERM

 - Adjusts dot2c not to create lines over 100 columns

 - Properly order nested monitors in the RV Kconfig file

 - Return the registration error in all DA monitor instead of 0

 - Update and add new sched collection monitors

   Replace tss and sncid monitors with more complete sts:

   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.

   New monitor: nrp
     Preemption requires need resched which is cleared by any switch
     (includes a non optimal workaround for /nested/ preemptions)

   New monitor: sssw
     suspension requires setting the task to sleepable and, after the
     switch occurs, the task requires a wakeup to come back to runnable

   New monitor: opid
      waking and need-resched operations occur with interrupts and
      preemption disabled or in IRQ without explicitly disabling
      preemption"

* tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (48 commits)
  rv: Add opid per-cpu monitor
  rv: Add nrp and sssw per-task monitors
  rv: Replace tss and sncid monitors with more complete sts
  sched: Adapt sched tracepoints for RV task model
  rv: Retry when da monitor detects race conditions
  rv: Adjust monitor dependencies
  rv: Use strings in da monitors tracepoints
  rv: Remove trailing whitespace from tracepoint string
  rv: Add da_handle_start_run_event_ to per-task monitors
  rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
  rv: Fix wrong type cast in monitors_show()
  rv: Remove struct rv_monitor::reacting
  rv: Remove rv_reactor's reference counter
  rv: Merge struct rv_reactor_def into struct rv_reactor
  rv: Merge struct rv_monitor_def into struct rv_monitor
  rv: Remove unused field in struct rv_monitor_def
  rv: Return init error when registering monitors
  verification/rvgen: Organise Kconfig entries for nested monitors
  tools/dot2c: Fix generated files going over 100 column limit
  tools/rv: Stop gracefully also on SIGTERM
  ...
2025-07-30 16:23:12 -07:00
Linus Torvalds
d50b07d05c ring-buffer changes for v6.17:
- Rewind persistent ring buffer on boot
 
   When the persistent ring buffer is being used for live kernel tracing and
   the system crashes, the tool that is reading the trace may not have recorded
   the data when the system crashed. Although the persistent ring buffer still
   has that data, when reading it after a reboot, it will start where it left
   off. That is, what was read will not be accessible.
 
   Instead, on reboot, have the persistent ring buffer restart where the data
   starts and this will allow the tooling to recover what was lost when the
   crash occurred.
 
 - Remove the ring_buffer_read_prepare_sync() logic
 
   Reading the trace file required stopping writing to the ring buffer as the
   trace file is only an iterator and does not consume what it read. It was
   originally not safe to read the ring buffer in this mode and required
   disabling writing. The ring_buffer_read_prepare_sync() logic was used to
   stop each per_cpu ring buffer, call synchronize_rcu() and then start the
   iterator. This was used instead of calling synchronize_rcu() for each
   per_cpu buffer.
 
   Today, the iterator has been updated where it is safe to read the trace file
   while writing to the ring buffer is still occurring. There is no more need
   to do this synchronization and it is causing large delays on machines with
   many CPUs. Remove this unneeded synchronization.
 
 - Make static string array a constant in show_irq_str()
 
   Making the string array into a constant has shown to decrease code text/data
   size.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIkfURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnx4AQCNXOuKYJXDvXrkwf449agwrn0lCVyI
 vV0L65nyIrakpAD8COV/lw8DhlCpb/Lijlzzo5L0n9QpEElNpq5uEntNwgE=
 =1YIy
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - Rewind persistent ring buffer on boot

   When the persistent ring buffer is being used for live kernel tracing
   and the system crashes, the tool that is reading the trace may not
   have recorded the data when the system crashed.

   Although the persistent ring buffer still has that data, when reading
   it after a reboot, it will start where it left off. That is, what was
   read will not be accessible.

   Instead, on reboot, have the persistent ring buffer restart where the
   data starts and this will allow the tooling to recover what was lost
   when the crash occurred.

 - Remove the ring_buffer_read_prepare_sync() logic

   Reading the trace file required stopping writing to the ring buffer
   as the trace file is only an iterator and does not consume what it
   read. It was originally not safe to read the ring buffer in this mode
   and required disabling writing. The ring_buffer_read_prepare_sync()
   logic was used to stop each per_cpu ring buffer, call
   synchronize_rcu() and then start the iterator. This was used instead
   of calling synchronize_rcu() for each per_cpu buffer.

   Today, the iterator has been updated where it is safe to read the
   trace file while writing to the ring buffer is still occurring. There
   is no more need to do this synchronization and it is causing large
   delays on machines with many CPUs. Remove this unneeded
   synchronization.

 - Make static string array a constant in show_irq_str()

   Making the string array into a constant has shown to decrease code
   text/data size.

* tag 'trace-ringbuffer-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Make the const read-only 'type' static
  ring-buffer: Remove ring_buffer_read_prepare_sync()
  tracing: ring_buffer: Rewind persistent ring buffer on reboot
2025-07-30 16:16:58 -07:00
Linus Torvalds
90a871f74b ftrace changes for v6.17:
- Keep track of when fgraph_ops are registered or not
 
   Keep accounting of when fgraph_ops are registered as if a fgraph_ops is
   registered twice it can mess up the accounting and it will not work as
   expected later. Trigger a warning if something registers it twice as to
   catch bugs before they are found by things just not working as expected.
 
 - Make DYNAMIC_FTRACE always enabled for architectures that support it
 
   As static ftrace (where all functions are always traced) is very expensive
   and only exists to help architectures support ftrace, do not make it an
   option. As soon as an architecture supports DYNAMIC_FTRACE make it use it.
   This simplifies the code.
 
 - Remove redundant config HAVE_FTRACE_MCOUNT_RECORD
 
   The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
   DYNAMIC_FTRACE work, but now every architecture that implements
   DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it redundant
   with the HAVE_DYNAMIC_FTRACE.
 
 - Make pid_ptr string size match the comment
 
   In print_graph_proc() the pid_ptr string is of size 11, but the comment says
   /* sign + log10(MAX_INT) + '\0' */ which is actually 12.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIkVkRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmdxAPsGcyT/gnyX/wf70cI63QoODrlRAd7M
 tg3R0J0H41U05QD/apttbA9GSdZ8bDLLSFAXTJgr8f4GvYvbUsmu2sMBBA8=
 =gd9V
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Keep track of when fgraph_ops are registered or not

   Keep accounting of when fgraph_ops are registered as if a fgraph_ops
   is registered twice it can mess up the accounting and it will not
   work as expected later. Trigger a warning if something registers it
   twice as to catch bugs before they are found by things just not
   working as expected.

 - Make DYNAMIC_FTRACE always enabled for architectures that support it

   As static ftrace (where all functions are always traced) is very
   expensive and only exists to help architectures support ftrace, do
   not make it an option. As soon as an architecture supports
   DYNAMIC_FTRACE make it use it. This simplifies the code.

 - Remove redundant config HAVE_FTRACE_MCOUNT_RECORD

   The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
   DYNAMIC_FTRACE work, but now every architecture that implements
   DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it
   redundant with the HAVE_DYNAMIC_FTRACE.

 - Make pid_ptr string size match the comment

   In print_graph_proc() the pid_ptr string is of size 11, but the
   comment says /* sign + log10(MAX_INT) + '\0' */ which is actually 12.

* tag 'ftrace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Remove redundant config HAVE_FTRACE_MCOUNT_RECORD
  ftrace: Make DYNAMIC_FTRACE always enabled for architectures that support it
  fgraph: Keep track of when fgraph_ops are registered or not
  fgraph: Make pid_str size match the comment
2025-07-30 16:04:10 -07:00
Linus Torvalds
b7dbc2e813 Probes updates for v6.17:
- Stack usage reduction for probe events:
    - Allocate string buffers from the heap for uprobe, eprobe, kprobe,
      and fprobe events to avoid stack overflow.
    - Allocate traceprobe_parse_context from the heap to prevent
      potential stack overflow.
    - Fix a typo in the above commit.
 
  - New features for eprobe and tprobe events:
    - Add support for arrays in eprobes.
    - Support multiple tprobes on the same tracepoint.
 
  - Improve efficiency:
    - Register fprobe-events only when it is enabled to reduce overhead.
    - Register tracepoints for tprobe events only when enabled to
      resolve a lock dependency.
 
  - Code Cleanup:
    - Add kerneldoc for traceprobe_parse_event_name() and __get_insn_slot().
    - Sort #include alphabetically in the probes code.
    - Remove the unused 'mod' field from the tprobe-event.
    - Clean up the entry-arg storing code in probe-events.
 
  - Selftest update
    - Enable fprobe events before checking enable_functions in selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmiJ2DQbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bSfkH/06Zn5I55rU85FKSBQll
 FN4hipmef/9Nd13skDwpEuFyzLPNS4P1up/UBUuyDQUTlO74+t2zSFO2dpcNrWmu
 sPTenQ+6h82H3K591WTIC23VzF54syIbFLXEj8iMBALT3wyU4Nn0bs4DCbnTo5HX
 R3NVo77rk6wxNJoKYOtT6ALf/lHonuNlGF+KTUGWP8UbWsIY3fIp0RWWy572M0bt
 +YBE8D8RIVrw+ZY+vNKn1LdZdWlR1ton518XDf1gV9isTCfKErcd/6HJKwuj5q2v
 qMgwiaKK+Gne/ylAKmWLEg2oNDo7kpyfW+612oiECitgZkqxOXhyYYfWgRt1lFNp
 Wb8=
 =E+Z6
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:
 "Stack usage reduction for probe events:
   - Allocate string buffers from the heap for uprobe, eprobe, kprobe,
     and fprobe events to avoid stack overflow
   - Allocate traceprobe_parse_context from the heap to prevent
     potential stack overflow
   - Fix a typo in the above commit

  New features for eprobe and tprobe events:
   - Add support for arrays in eprobes
   - Support multiple tprobes on the same tracepoint

  Improve efficiency:
   - Register fprobe-events only when it is enabled to reduce overhead
   - Register tracepoints for tprobe events only when enabled to resolve
     a lock dependency

  Code Cleanup:
   - Add kerneldoc for traceprobe_parse_event_name() and
     __get_insn_slot()
   - Sort #include alphabetically in the probes code
   - Remove the unused 'mod' field from the tprobe-event
   - Clean up the entry-arg storing code in probe-events

  Selftest update
   - Enable fprobe events before checking enable_functions in selftests"

* tag 'probes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: trace_fprobe: Fix typo of the semicolon
  tracing: Have eprobes handle arrays
  tracing: probes: Add a kerneldoc for traceprobe_parse_event_name()
  tracing: uprobe-event: Allocate string buffers from heap
  tracing: eprobe-event: Allocate string buffers from heap
  tracing: kprobe-event: Allocate string buffers from heap
  tracing: fprobe-event: Allocate string buffers from heap
  tracing: probe: Allocate traceprobe_parse_context from heap
  tracing: probes: Sort #include alphabetically
  kprobes: Add missing kerneldoc for __get_insn_slot
  tracing: tprobe-events: Register tracepoint when enable tprobe event
  selftests: tracing: Enable fprobe events before checking enable_functions
  tracing: fprobe-events: Register fprobe-events only when it is enabled
  tracing: tprobe-events: Support multiple tprobes on the same tracepoint
  tracing: tprobe-events: Remove mod field from tprobe-event
  tracing: probe-events: Cleanup entry-arg storing code
2025-07-30 15:38:01 -07:00
Linus Torvalds
a03eec7420 Probes fixes for v6.16:
- Fix a potential infinite recursion in fprobe by using preempt_*_notrace().
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmiIdp4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8boY8IAMQGLspd1mATbCfnrQKY
 2X86OVygOJx7Iq1RCmOV6fhroe5EoNVR/b1RXmZJf2gIoN176zdBdYrBIFC97lYO
 J1XaU/Ns1McBuKrOjc3TSYYioVPHJrKLiZ1vAoCicTkUsS34MQJXbbAlfdn424pb
 J1wUeIDJF0WrFH9yVJ4mEs1dH81oCQ3iSG0CYx5/qLggcoubUFrVl4QessJwAuI6
 VM+cKDsqMCltBovXFw/fAgWfiQp79z/uq9umOFLdZGsesqutMYTMgJXBS6slKl3a
 qE2EQ57Op39A2zpk2hUoVoyv5Ey/XkfEjLU7WIMfqjLOL201IGQEKuyvR/mS54Kc
 HDw=
 =EeVm
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:

 - Fix a potential infinite recursion in fprobe by using preempt_*_notrace()

* tag 'probes-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: Fix infinite recursion using preempt_*_notrace()
2025-07-30 15:35:57 -07:00
Linus Torvalds
d9104cec3e bpf-next-6.17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
 bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
 1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
 KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
 mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
 NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
 dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
 MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
 xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
 Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
 AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
 EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
 =/O3j
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Remove usermode driver (UMD) framework (Thomas Weißschuh)

 - Introduce Strongly Connected Component (SCC) in the verifier to
   detect loops and refine register liveness (Eduard Zingerman)

 - Allow 'void *' cast using bpf_rdonly_cast() and corresponding
   '__arg_untrusted' for global function parameters (Eduard Zingerman)

 - Improve precision for BPF_ADD and BPF_SUB operations in the verifier
   (Harishankar Vishwanathan)

 - Teach the verifier that constant pointer to a map cannot be NULL
   (Ihor Solodrai)

 - Introduce BPF streams for error reporting of various conditions
   detected by BPF runtime (Kumar Kartikeya Dwivedi)

 - Teach the verifier to insert runtime speculation barrier (lfence on
   x86) to mitigate speculative execution instead of rejecting the
   programs (Luis Gerhorst)

 - Various improvements for 'veristat' (Mykyta Yatsenko)

 - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
   improve bug detection by syzbot (Paul Chaignon)

 - Support BPF private stack on arm64 (Puranjay Mohan)

 - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
   node (Song Liu)

 - Introduce kfuncs for read-only string opreations (Viktor Malik)

 - Implement show_fdinfo() for bpf_links (Tao Chen)

 - Reduce verifier's stack consumption (Yonghong Song)

 - Implement mprog API for cgroup-bpf programs (Yonghong Song)

* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
  selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
  selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
  bpf: Add log for attaching tracing programs to functions in deny list
  bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
  bpf: Fix various typos in verifier.c comments
  bpf: Add third round of bounds deduction
  selftests/bpf: Test invariants on JSLT crossing sign
  selftests/bpf: Test cross-sign 64bits range refinement
  selftests/bpf: Update reg_bound range refinement logic
  bpf: Improve bounds when s64 crosses sign boundary
  bpf: Simplify bounds refinement from s32
  selftests/bpf: Enable private stack tests for arm64
  bpf, arm64: JIT support for private stack
  bpf: Move bpf_jit_get_prog_name() to core.c
  bpf, arm64: Fix fp initialization for exception boundary
  umd: Remove usermode driver framework
  bpf/preload: Don't select USERMODE_DRIVER
  selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
  selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
  selftests/bpf: Increase xdp data size for arm64 64K page size
  ...
2025-07-30 09:58:50 -07:00
Steven Rostedt
0dd1274a05 tracing: Have eprobes have their own config option
Eprobes were added in 5.15 and were selected whenever any of the other
probe events were selected. If kprobe events were enabled (which it is by
default if kprobes are enabled) it would enable eprobe events as well. The
same for uprobes and fprobes.

Have eprobes have its own config and it gets enabled by default if tracing
is enabled.

Link: https://lore.kernel.org/all/20250729102636.b7cce553e7cc263722b12365@kernel.org/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/20250730140945.360286733@kernel.org
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-30 10:38:43 -04:00
Colin Ian King
6443cdf567 ring-buffer: Make the const read-only 'type' static
Don't populate the read-only 'type' on the stack at run time,
instead make it static.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250714160858.1234719-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-29 15:08:57 -04:00
Masami Hiramatsu (Google)
1a967e92bf tracing: Remove "__attribute__()" from the type field of event format
With CONFIG_DEBUG_INFO_BTF=y and PAHOLE_HAS_BTF_TAG=y, `__user` is
converted to `__attribute__((btf_type_tag("user")))`. In this case,
some syscall events have it for __user data, like below;

/sys/kernel/tracing # cat events/syscalls/sys_enter_openat/format
name: sys_enter_openat
ID: 720
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:int __syscall_nr; offset:8;       size:4; signed:1;
        field:int dfd;  offset:16;      size:8; signed:0;
        field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;
        field:int flags;        offset:32;      size:8; signed:0;
        field:umode_t mode;     offset:40;      size:8; signed:0;

Then the trace event filter fails to set the string acceptable flag
(FILTER_PTR_STRING) to the field and rejects setting string filter;

 # echo 'filename.ustring ~ "*ftracetest-dir.wbx24v*"' \
    >> events/syscalls/sys_enter_openat/filter
 sh: write error: Invalid argument
 # cat error_log
 [  723.743637] event filter parse error: error: Expecting numeric field
   Command: filename.ustring ~ "*ftracetest-dir.wbx24v*"

Since this __attribute__ makes format parsing complicated and not
needed, remove the __attribute__(.*) from the type string.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/175376583493.1688759.12333973498014733551.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-29 12:30:41 -04:00
Masami Hiramatsu (Google)
a3e892ab0f tracing: fprobe: Fix infinite recursion using preempt_*_notrace()
Since preempt_count_add/del() are tracable functions, it is not allowed
to use preempt_disable/enable() in ftrace handlers. Without this fix,
probing on `preempt_count_add%return` will cause an infinite recursion
of fprobes.

To fix this problem, use preempt_disable/enable_notrace() in
fprobe_return().

Link: https://lore.kernel.org/all/175374642359.1471729.1054175011228386560.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-29 16:19:05 +09:00
Linus Torvalds
6e11664f14 for-6.17/block-20250728
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHdZ8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptRED/9o3dQ1QHL5yNM/AyCCGox0V4zra8qGS/Vc
 cBWpAVrmPGRw0IYlLZENtN9PdwKcbMzJq3l6cxeC7dBnAZP0AxTzP4YYJYUNVsqo
 WtJ3d/k5+cVp0OyOp4uabaqNeMeLoPk9/JXe1Ml2KxtDmHtj5yee0JRh7zlPZmZj
 tsrpIUTeHgAPn6yR1EI+0ybx/mjCb05Mv2Y8gF5hkUPA2PuON+MTFixJmqoy2ySh
 n+22mz/prqlyOSYh/VVv1+9jcQ94wMjcW0JIpg9lM3Kg8BCPU4IetvO1UiX6X33v
 154zEh2aJJDBx+yORS4BM4JMXjRZI7lYea2dkHM8Cajctu1Wpja9bNwnK9ibXvEc
 WtyBwztleLbAZef25fA/W87JE23fGa/r3nwIb2cF4QqkAFslCvhjA93WkOzNJCgQ
 qsWOrlCh3IK2NUu4b1Ncs3ZHOPvc51+zzjMzC6SUr54xhrxDK+gngDPhRy7XDqWJ
 DTMpIlr366o8GdJqnib0/e/CPBrThS6Vl6u0tgLnNbwdpK1svgo/uHW5ksKvDqHX
 kGEIhyRRJJC+4wyl4dsYKXa2twcyFrlWdAE+pZguEC2nZRYqYl9uXftOtvfp1x0y
 /skDX0FIDjvyjRqCLcqF03FSGqwCGS8WuWXZjPhVhcfz47NvbHeFDh1G/jMzsbpj
 S9zrPve/DQ==
 =e86T
 -----END PGP SIGNATURE-----

Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull request via Yu:
      - call del_gendisk synchronously (Xiao)
      - cleanup unused variable (John)
      - cleanup workqueue flags (Ryo)
      - fix faulty rdev can't be removed during resync (Qixing)

 - NVMe pull request via Christoph:
      - try PCIe function level reset on init failure (Keith Busch)
      - log TLS handshake failures at error level (Maurizio Lombardi)
      - pci-epf: do not complete commands twice if nvmet_req_init()
        fails (Rick Wertenbroek)
      - misc cleanups (Alok Tiwari)

 - Removal of the pktcdvd driver

   This has been more than a decade coming at this point, and some
   recently revealed breakages that had it causing issues even for cases
   where it isn't required made me re-pull the trigger on this one. It's
   known broken and nobody has stepped up to maintain the code

 - Series for ublk supporting batch commands, enabling the use of
   multishot where appropriate

 - Speed up ublk exit handling

 - Fix for the two-stage elevator fixing which could leak data

 - Convert NVMe to use the new IOVA based API

 - Increase default max transfer size to something more reasonable

 - Series fixing write operations on zoned DM devices

 - Add tracepoints for zoned block device operations

 - Prep series working towards improving blk-mq queue management in the
   presence of isolated CPUs

 - Don't allow updating of the block size of a loop device that is
   currently under exclusively ownership/open

 - Set chunk sectors from stacked device stripe size and use it for the
   atomic write size limit

 - Switch to folios in bcache read_super()

 - Fix for CD-ROM MRW exit flush handling

 - Various tweaks, fixes, and cleanups

* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
  block: restore two stage elevator switch while running nr_hw_queue update
  cdrom: Call cdrom_mrw_exit from cdrom_release function
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  nvme-pci: try function level reset on init failure
  dm: split write BIOs on zone boundaries when zone append is not emulated
  block: use chunk_sectors when evaluating stacked atomic write limits
  dm-stripe: limit chunk_sectors to the stripe size
  md/raid10: set chunk_sectors limit
  md/raid0: set chunk_sectors limit
  block: sanitize chunk_sectors for atomic write limits
  ilog2: add max_pow_of_two_factor()
  nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
  nvme-tcp: log TLS handshake failures at error level
  docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
  nvme: fix typo in status code constant for self-test in progress
  nvmet: remove redundant assignment of error code in nvmet_ns_enable()
  nvme: fix incorrect variable in io cqes error message
  nvme: fix multiple spelling and grammar issues in host drivers
  block: fix blk_zone_append_update_request_bio() kernel-doc
  md/raid10: fix set but not used variable in sync_request_write()
  ...
2025-07-28 16:43:54 -07:00
Masami Hiramatsu (Google)
133c302a0c tracing: trace_fprobe: Fix typo of the semicolon
Fix a typo that uses ',' instead of ';' for line delimiter.

Link: https://lore.kernel.org/linux-trace-kernel/175366879192.487099.5714468217360139639.stgit@mhiramat.tok.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-29 08:37:52 +09:00
Gabriele Monaco
614384533d rv: Add opid per-cpu monitor
Add a per-cpu monitor as part of the sched model:
* opid: operations with preemption and irq disabled
    Monitor to ensure wakeup and need_resched occur with irq and
    preemption disabled or in irq handlers.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-10-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Tested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:35 -04:00
Gabriele Monaco
e8440a88e5 rv: Add nrp and sssw per-task monitors
Add 2 per-task monitors as part of the sched model:

* nrp: need-resched preempts
    Monitor to ensure preemption requires need resched.
* sssw: set state sleep and wakeup
    Monitor to ensure sched_set_state to sleepable leads to sleeping and
    sleeping tasks require wakeup.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-9-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Nam Cao <namcao@linutronix.de>
Tested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
d0096c2f9c rv: Replace tss and sncid monitors with more complete sts
The tss monitor currently guarantees task switches can happen only while
scheduling, whereas the sncid monitor enforces scheduling occurs with
interrupt disabled.

Replace the monitors with a more comprehensive specification which
implies both but also ensures that:
* each scheduler call disable interrupts to switch
* each task switch happens with interrupts disabled

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
adcc3bfa88 sched: Adapt sched tracepoints for RV task model
Add the following tracepoint:
* sched_set_need_resched(tsk, cpu, tif)
    Called when a task is set the need resched [lazy] flag

Remove the unused ip parameter from sched_entry and sched_exit and alter
sched_entry to have a value of preempt consistent with the one used in
sched_switch.

Also adapt all monitors using sched_{entry,exit} to avoid breaking build.

These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-7-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
9d475d80c9 rv: Retry when da monitor detects race conditions
DA monitor can be accessed from multiple cores simultaneously, this is
likely, for instance when dealing with per-task monitors reacting on
events that do not always occur on the CPU where the task is running.
This can cause race conditions where two events change the next state
and we see inconsistent values. E.g.:

  [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
  [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable
  [63] error_srs: 27: event sched_switch_suspend not expected in the state running

In this case the monitor fails because the event on CPU 62 wins against
the one on CPU 63, although the correct state should have been
sleepable, since the task get suspended.

Detect if the current state was modified by using try_cmpxchg while
storing the next value. If it was, try again reading the current state.
After a maximum number of failed retries, react by calling a special
tracepoint, print on the console and reset the monitor.

Remove the functions da_monitor_curr_state() and da_monitor_set_state()
as they only hide the underlying implementation in this case.

Monitors where this type of condition can occur must be able to account
for racing events in any possible order, as we cannot know the winner.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250728135022.255578-6-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
79de661707 rv: Adjust monitor dependencies
RV monitors relying on the preemptirqs tracepoints are set as dependent
on PREEMPT_TRACER and IRQSOFF_TRACER. In fact, those configurations do
enable the tracepoints but are not the minimal configurations enabling
them, which are TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS (not selectable
manually).

Set TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS as dependencies for
monitors.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-5-gmonaco@redhat.com
Fixes: fbe6c09b7e ("rv: Add scpd, snep and sncid per-cpu monitors")
Acked-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
7f904ff6e5 rv: Use strings in da monitors tracepoints
Using DA monitors tracepoints with KASAN enabled triggers the following
warning:

 BUG: KASAN: global-out-of-bounds in do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0
 Read of size 32 at addr ffffffffaada8980 by task ...
 Call Trace:
  <TASK>
 [...]
  do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0
  ? __pfx_do_trace_event_raw_event_event_da_monitor+0x10/0x10
  ? trace_event_sncid+0x83/0x200
  trace_event_sncid+0x163/0x200
 [...]
 The buggy address belongs to the variable:
  automaton_snep+0x4e0/0x5e0

This is caused by the tracepoints reading 32 bytes __array instead of
__string from the automata definition. Such strings are literals and
reading 32 bytes ends up in out of bound memory accesses (e.g. the next
automaton's data in this case).
The error is harmless as, while printing the string, we stop at the null
terminator, but it should still be fixed.

Use the __string facilities while defining the tracepoints to avoid
reading out of bound memory.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-4-gmonaco@redhat.com
Fixes: 792575348f ("rv/include: Add deterministic automata monitor definition via C macros")
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Gabriele Monaco
7b70ac4cad rv: Remove trailing whitespace from tracepoint string
RV event tracepoints print a line with the format:
    "event_xyz: S0 x event -> S1 "
    "event_xyz: S1 x event -> S0 (final)"

While printing an event leading to a non-final state, the line
has a trailing white space (visible above before the closing ").

Adapt the format string not to print the trailing whitespace if we are
not printing "(final)".

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-3-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Nam Cao
3cfb9c1a7d rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
Argument 'p' of reactors_show() and monitor_reactor_show() is not a pointer
to struct rv_reactor, it is actually a pointer to the list_head inside
struct rv_reactor. Therefore it's wrong to cast 'p' to struct rv_reactor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_reactor_def.
This is no longer true since commit 3d3c376118 ("rv: Merge struct
rv_reactor_def into struct rv_reactor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/b4febbd6844311209e4c8768b65d508b81bd8c9b.1753625621.git.namcao@linutronix.de
Fixes: 3d3c376118 ("rv: Merge struct rv_reactor_def into struct rv_reactor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 10:39:34 -04:00
Nam Cao
e82aea50fe rv: Fix wrong type cast in monitors_show()
Argument 'p' of monitors_show() is not a pointer to struct rv_monitor, it
is actually a pointer to the list_head inside struct rv_monitor. Therefore
it is wrong to cast 'p' to struct rv_monitor *.

This wrong type cast has been there since the beginning. But it still
worked because the list_head was the first field in struct rv_monitor_def.
This is no longer true since commit 24cbfe18d5 ("rv: Merge struct
rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong
type cast became a functional problem.

Properly use container_of() instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/35e49e97696007919ceacf73796487a2e15a3d02.1753625621.git.namcao@linutronix.de
Fixes: 24cbfe18d5 ("rv: Merge struct rv_monitor_def into struct rv_monitor")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 10:39:34 -04:00
Nam Cao
b8a7fba39c rv: Remove struct rv_monitor::reacting
The field 'reacting' in struct rv_monitor is set but never used. Delete it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/a6c16f845d2f1a09c4d0934ab83f3cb14478a71d.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
3d3800b4f7 rv: Remove rv_reactor's reference counter
rv_reactor has a reference counter to ensure it is not removed while
monitors are still using it.

However, this is futile, as __exit functions are not expected to fail and
will proceed normally despite rv_unregister_reactor() returning an error.

At the moment, reactors do not support being built as modules, therefore
they are never removed and the reference counters are not necessary.

If we support building RV reactors as modules in the future, kernel
module's centralized facilities such as try_module_get(), module_put() or
MODULE_SOFTDEP should be used instead of this custom implementation.

Remove this reference counter.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/bb946398436a5e17fb0f5b842ef3313c02291852.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
3d3c376118 rv: Merge struct rv_reactor_def into struct rv_reactor
Each struct rv_reactor has a unique struct rv_reactor_def associated with
it. struct rv_reactor is statically allocated, while struct rv_reactor_def
is dynamically allocated.

This makes the code more complicated than it should be:

  - Lookup is required to get the associated rv_reactor_def from rv_reactor

  - Dynamic memory allocation is required for rv_reactor_def. This is
    harder to get right compared to static memory. For instance, there is
    an existing mistake: rv_unregister_reactor() does not free the memory
    allocated by rv_register_reactor(). This is fortunately not a real
    memory leak problem as rv_unregister_reactor() is never called.

Simplify and merge rv_reactor_def into rv_reactor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/71cb91c86cd40df5b8c492b788787f2a73c3eaa3.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
24cbfe18d5 rv: Merge struct rv_monitor_def into struct rv_monitor
Each struct rv_monitor has a unique struct rv_monitor_def associated with
it. struct rv_monitor is statically allocated, while struct rv_monitor_def
is dynamically allocated.

This makes the code more complicated than it should be:

  - Lookup is required to get the associated rv_monitor_def from rv_monitor

  - Dynamic memory allocation is required for rv_monitor_def. This is
    harder to get right compared to static memory. For instance, there is
    an existing mistake: rv_unregister_monitor() does not free the memory
    allocated by rv_register_monitor(). This is fortunately not a real
    memory leak problem, as rv_unregister_monitor() is never called.

Simplify and merge rv_monitor_def into rv_monitor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/194449c00f87945c207aab4c96920c75796a4f53.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:14 -04:00
Nam Cao
b0c08dd534 rv: Remove unused field in struct rv_monitor_def
rv_monitor_def::task_monitor is not used. Delete it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/502d94f2696435690a2b1fdbe80a9e56c96fcabf.1753378331.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-25 09:04:13 -04:00
Steven Rostedt
8c4e53a1a0 tracing: Call trace_ftrace_test_filter() for the event
The trace event filter bootup self test tests a bunch of filter logic
against the ftrace_test_filter event, but does not actually call the
event. Work is being done to cause a warning if an event is defined but
not used. To quiet the warning call the trace event under an if statement
where it is disabled so it doesn't get optimized out.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas.schier@linux.dev>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250723194212.274458858@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 21:33:35 -04:00
Gabriele Monaco
58d5f0d437 rv: Return init error when registering monitors
Monitors generated with dot2k have their registration function (the one
called during monitor initialisation) return always 0, even if the
registration failed on RV side.
This can hide potential errors.

Return the value returned by the RV register function.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-6-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Gabriele Monaco
560473f2e2 verification/rvgen: Organise Kconfig entries for nested monitors
The current behaviour of rvgen when running with the -a option is to
append the necessary lines at the end of the configuration for Kconfig,
Makefile and tracepoints.
This is not always the desired behaviour in case of nested monitors:
while tracepoints are not affected by nesting and the Makefile's only
requirement is that the parent monitor is built before its children, in
the Kconfig it is better to have children defined right after their
parent, otherwise the result has wrong indentation:

[*]   foo_parent monitor
[*]     foo_child1 monitor
[*]     foo_child2 monitor
[*]   bar_parent monitor
[*]     bar_child1 monitor
[*]     bar_child2 monitor
[*]   foo_child3 monitor
[*]   foo_child4 monitor

Adapt rvgen to look for a different marker for nested monitors in the
Kconfig file and append the line right after the last sibling, instead
of the last monitor.
Also add the marker when creating a new parent monitor.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-5-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Gabriele Monaco
9efcf59082 tools/dot2c: Fix generated files going over 100 column limit
The dot2c.py script generates all states in a single line. This breaks the
100 column limit when the state machines are non-trivial.

Change dot2c.py to generate the states in separate lines in case the
generated line is going to be too long.

Also adapt existing monitors with line length over the limit.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250723161240.194860-4-gmonaco@redhat.com
Suggested-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 10:43:46 -04:00
Steven Rostedt
dabd3e7dcc tracing: Have eprobes handle arrays
eprobes are dynamic events that can read other events using their fields
to create new events. Currently it doesn't work with arrays. When the new
event field is attached to the old event field, it looks at the size of
the field to determine what type of field the new field should be. For 1
byte fields it's a char, for 2 bytes, it's a short and for 4 bytes it's an
integer. For all other sizes it just defaults to "long". This also reads
the contents of the field for such cases.

For arrays that are bigger than the size of long, return the value of the
address of the content itself. This will allow eprobes to read other
values in the array of the old event.

This is useful when raw_syscalls is enabled but the syscall events are
not. The syscall events are created from the raw_syscalls as they have an
array of "args" that holds the 6 long words passed to the syscall entry
point. To read the value of "filename" from sys_openat, the eprobe could
attach to the raw_syscall and read the second value.

It can then even be passed to a synthetic event and converted back to
another eprobe to get the value of "filename" after it has been read by
the kernel during the system call:

 [
   Create an eprobe called "sys" and attach it to sys_enter.
   Read the id of the system call and the second argument
 ]
 # echo 'e:sys raw_syscalls.sys_enter nr=$id:u32 arg2=+8($args):u64' >> /sys/kernel/tracing/dynamic_events

 [
   Create a synthetic event "path" that will hold the address of the
   sys_openat filename. This is on a 64bit machine, so make it 64 bits
 ]
 # echo 's:path u64 file;' >> /sys/kernel/tracing/dynamic_events

 [
   Add a histogram to the eprobe/sys which tiggers if the "nr" field is
   257 (sys_openat), and save the filename in the "file" variable.
 ]
 # echo 'hist:keys=common_pid:file=arg2 if nr == 257' > /sys/kernel/tracing/events/eprobes/sys/trigger

 [
   Attach a histogram to sys_exit event that triggers the "path" synthetic
   event and records the "filename" that was passed from the sys eprobe.
 ]
 # echo 'hist:keys=common_pid:f=$file:onmatch(eprobes.sys).trace(path,$f)' >> /sys/kernel/tracing/events/raw_syscalls/sys_exit/trigger

 [
   Create another eprobe that dereferences the "file" field as a user
   space string and displays it.
 ]
 # echo 'e:open synthetic.path file=+0($file):ustring' >> /sys/kernel/tracing/dynamic_events

 # echo 1 > /sys/kernel/tracing/events/eprobes/open/enable
 # cat trace_pipe
             cat-1142    [003] ...5.   799.521912: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.521934: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522065: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522080: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522296: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
             cat-1142    [003] ...5.   799.522319: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522327: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522333: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
             cat-1142    [003] ...5.   799.522348: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522349: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522363: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522477: open: (synthetic.path) file="/etc/ld.so.cache"
             cat-1142    [003] ...5.   799.522489: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"
            less-1143    [005] ...5.   799.522492: open: (synthetic.path) file="/etc/ld.so.cache"
            less-1143    [005] ...5.   799.522720: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
            less-1143    [005] ...5.   799.522744: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
            less-1143    [005] ...5.   799.522759: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libtinfo.so.6"
             cat-1142    [003] ...5.   799.522850: open: (synthetic.path) file="/lib/x86_64-linux-gnu/libc.so.6"

Link: https://lore.kernel.org/all/20250723124202.4f7475be@batman.local.home/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-24 22:57:32 +09:00
Steven Rostedt
9f0cb91767 tracing: arm: arm64: Hide trace events ipi_raise, ipi_entry and ipi_exit
The ipi tracepoints are mostly generic, but the tracepoints ipi_raise,
ipi_entry and ipi_exit are only used by arm and arm64. This means these
trace events are wasting memory in all the other architectures that do not
use them.

Add CONFIG_HAVE_EXTRA_IPI_TRACEPOINTS and have arm and arm64 select it to
enable these trace events. The config makes it easy if other architectures
decide to trace these as well.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Will Deacon <will@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20250722103714.64eba013@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2025-07-23 14:58:55 -04:00
Masami Hiramatsu (Google)
558d5f3cd2 tracing: probes: Add a kerneldoc for traceprobe_parse_event_name()
Since traceprobe_parse_event_name() is a bit complicated, add a
kerneldoc for explaining the behavior.

Link: https://lore.kernel.org/all/175323430565.57270.2602609519355112748.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:22:06 +09:00
Masami Hiramatsu (Google)
97e8230f89 tracing: uprobe-event: Allocate string buffers from heap
Allocate temporary string buffers for parsing uprobe-events
from heap instead of stack.

Link: https://lore.kernel.org/all/175323429593.57270.12369235525923902341.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:21:58 +09:00
Masami Hiramatsu (Google)
4c6edb43ea tracing: eprobe-event: Allocate string buffers from heap
Allocate temporary string buffers for parsing eprobe-events
from heap instead of stack.

Link: https://lore.kernel.org/all/175323428599.57270.988038042425748956.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:21:51 +09:00
Masami Hiramatsu (Google)
33b4e38baa tracing: kprobe-event: Allocate string buffers from heap
Allocate temporary string buffers for parsing kprobe-events
from heap instead of stack.

Link: https://lore.kernel.org/all/175323427627.57270.5105357260879695051.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:21:44 +09:00
Masami Hiramatsu (Google)
d643eaa708 tracing: fprobe-event: Allocate string buffers from heap
Allocate temporary string buffers for fprobe-event from heap
instead of stack. This fixes the stack frame exceed limit error.

Link: https://lore.kernel.org/all/175323426643.57270.6657152008331160704.stgit@devnote2/

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506240416.nZIhDXoO-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:21:37 +09:00
Masami Hiramatsu (Google)
43beb5e89b tracing: probe: Allocate traceprobe_parse_context from heap
Instead of allocating traceprobe_parse_context on stack, allocate it
dynamically from heap (slab).

This change is likely intended to prevent potential stack overflow
issues, which can be a concern in the kernel environment where stack
space is limited.

Link: https://lore.kernel.org/all/175323425650.57270.280750740753792504.stgit@devnote2/

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506240416.nZIhDXoO-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-24 00:21:30 +09:00
Masami Hiramatsu (Google)
2f02a61d84 tracing: probes: Sort #include alphabetically
Sort the #include directives in trace_probe* files alphabetically for
easier maintenance and avoid double includes.
This also groups headers as linux-generic, asm-generic, and local
headers.

Link: https://lore.kernel.org/all/175323424678.57270.11975372127870059007.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-24 00:21:22 +09:00
Steven Rostedt
9ba817fb7c tracing: Deprecate auto-mounting tracefs in debugfs
In January 2015, tracefs was created to allow access to the tracing
infrastructure without needing to compile in debugfs. When tracefs is
configured, the directory /sys/kernel/tracing will exist and tooling is
expected to use that path to access the tracing infrastructure.

To allow backward compatibility, when debugfs is mounted, it would
automount tracefs in its "tracing" directory so that tooling that had hard
coded /sys/kernel/debug/tracing would still work.

It has been over 10 years since the new interface was introduced, and all
tooling should now be using it. Start the process of deprecating the old
path so that it doesn't need to be maintained anymore.

A new config is added to allow distributions to disable automounting of
tracefs on debugfs.

If /sys/kernel/debug/tracing is accessed, a pr_warn() will trigger stating:

"NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030"

Expect to remove this feature in 5 years (2030).

Cc: <linux-trace-users@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/20250722170806.40c068c6@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-23 10:47:26 -04:00
Steven Rostedt
502ffa4399 tracing: Fix comment in trace_module_remove_events()
Fix typo "allocade" -> "allocated".

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250710095628.42ed6b06@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:28:54 -04:00
Steven Rostedt
4d6d0a6263 tracing: Remove redundant config HAVE_FTRACE_MCOUNT_RECORD
Ftrace is tightly coupled with architecture specific code because it
requires the use of trampolines written in assembly. This means that when
a new feature or optimization is made, it must be done for all
architectures. To simplify the approach, CONFIG_HAVE_FTRACE_* configs are
added to denote which architecture has the new enhancement so that other
architectures can still function until they too have been updated.

The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
DYNAMIC_FTRACE work, but now every architecture that implements
DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it redundant
with the HAVE_DYNAMIC_FTRACE.

Remove the HAVE_FTRACE_MCOUNT config and use DYNAMIC_FTRACE directly where
applicable.

Link: https://lore.kernel.org/all/20250703154916.48e3ada7@gandalf.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250704104838.27a18690@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:15:56 -04:00
Steven Rostedt
07c3f391bc tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
When soft disabling of trace events was first created, it needed to have a
way to know if a file had a user that was using it with soft disabled (for
triggers that need to enable or disable events from a context that can not
really enable or disable the event, it would set SOFT_DISABLED to state it
is disabled). The flag SOFT_MODE was used to denote that an event had a
user that would enable or disable it via the SOFT_DISABLED flag.

Commit 1cf4c0732d ("tracing: Modify soft-mode only if there's no other
referrer") fixed a bug where if two users were using the SOFT_DISABLED
flag the accounting would get messed up as the SOFT_MODE flag could only
handle one user. That commit added the sm_ref counter which kept track of
how many users were using the event in "soft mode". This made the
SOFT_MODE flag redundant as it should only be set if the sm_ref counter is
non zero.

Remove the SOFT_MODE flag and just use the sm_ref counter to know the
event is in soft mode or not. This makes the code a bit simpler.

Link: https://lore.kernel.org/all/20250702111908.03759998@batman.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Link: https://lore.kernel.org/20250702143657.18dd1882@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:15:51 -04:00
Nam Cao
c897c1e5b1 tracing: Remove pointless memory barriers
Memory barriers are useful to ensure memory accesses from one CPU appear in
the original order as seen by other CPUs.

Some smp_rmb() and smp_wmb() are used, but they are not ordering multiple
memory accesses.

Remove them.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/20250626151940.1756398-1-namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:15:51 -04:00
Steven Rostedt
9b4d5d330f ftrace: Make DYNAMIC_FTRACE always enabled for architectures that support it
ftrace has two flavors:

 1) static: Where every function always calls the ftrace trampoline

 2) dynamic: Where each function has nops that can be changed on demand to
             jump to the ftrace trampoline when needed.

The static flavor has very high performance overhead and was only created
to make it easier for architectures to implement the dynamic flavor. An
architecture developer can first implement the static ftrace to make sure
the trampolines work before working on the more complicated dynamic aspect
of ftrace. Once the architecture can support dynamic ftrace, there's no
reason to continue to support the static flavor. In fact, the static
flavor tends to bitrot and bugs start to appear in them.

Remove the prompt to pick DYNAMIC_FTRACE and simply enable it if the
architecture supports it.

Link: https://lore.kernel.org/all/f7e12c6d-892e-4ca3-9ef0-fbb524d04a48@ghiti.fr/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: ChenMiao <chenmiao.ku@gmail.com>
Link: https://lore.kernel.org/20250703115222.2d7c8cd5@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:13:16 -04:00
Steven Rostedt
218d372ce8 fgraph: Keep track of when fgraph_ops are registered or not
Add a warning if unregister_ftrace_graph() is called without ever
registering it, or if register_ftrace_graph() is called twice. This can
detect errors when they happen and not later when there's a side effect:

Link: https://lore.kernel.org/all/20250617120830.24fbdd62@gandalf.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250701194451.22e34724@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:06:02 -04:00
Steven Rostedt
119a5d5736 ring-buffer: Remove ring_buffer_read_prepare_sync()
When the ring buffer was first introduced, reading the non-consuming
"trace" file required disabling the writing of the ring buffer. To make
sure the writing was fully disabled before iterating the buffer with a
non-consuming read, it would set the disable flag of the buffer and then
call an RCU synchronization to make sure all the buffers were
synchronized.

The function ring_buffer_read_start() originally  would initialize the
iterator and call an RCU synchronization, but this was for each individual
per CPU buffer where this would get called many times on a machine with
many CPUs before the trace file could be read. The commit 72c9ddfd4c
("ring-buffer: Make non-consuming read less expensive with lots of cpus.")
separated ring_buffer_read_start into ring_buffer_read_prepare(),
ring_buffer_read_sync() and then ring_buffer_read_start() to allow each of
the per CPU buffers to be prepared, call the read_buffer_read_sync() once,
and then the ring_buffer_read_start() for each of the CPUs which made
things much faster.

The commit 1039221cc2 ("ring-buffer: Do not disable recording when there
is an iterator") removed the requirement of disabling the recording of the
ring buffer in order to iterate it, but it did not remove the
synchronization that was happening that was required to wait for all the
buffers to have no more writers. It's now OK for the buffers to have
writers and no synchronization is needed.

Remove the synchronization and put back the interface for the ring buffer
iterator back before commit 72c9ddfd4c was applied.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250630180440.3eabb514@batman.local.home
Reported-by: David Howells <dhowells@redhat.com>
Fixes: 1039221cc2 ("ring-buffer: Do not disable recording when there is an iterator")
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-22 20:01:41 -04:00
Steven Rostedt
647fe16b46 PM: cpufreq: powernv/tracing: Move powernv_throttle trace event
As the trace event powernv_throttle is only used by the powernv code, move
it to a separate include file and have that code directly enable it.

Trace events can take up around 5K of memory when they are defined
regardless if they are used or not. It wastes memory to have them defined
in configurations where the tracepoint is not used.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/20250612145407.906308844@goodmis.org
Fixes: 0306e481d4 ("cpufreq: powernv/tracing: Add powernv_throttle tracepoint")
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-21 16:40:56 -04:00
Linus Torvalds
2013e8c2e6 Tracing fixes for 6.16:
- Fix timerlat with use of FORTIFY_SOURCE
 
   FORTIFY_SOURCE was added to the stack tracer where it compares the
   entry->caller array to having entry->size elements.
 
   timerlat has the following:
 
      memcpy(&entry->caller, fstack->calls, size);
      entry->size = size;
 
   Which triggers FORTIFY_SOURCE as the caller is populated before the
   entry->size is initialized.
 
   Swap the order to satisfy FORTIFY_SOURCE logic.
 
 - Add down_write(trace_event_sem) when adding trace events in modules
 
   Trace events being added to the ftrace_events array are protected by
   the trace_event_sem semaphore. But when loading modules that have
   trace events, the addition of the events are not protected by the
   semaphore and loading two modules that have events at the same time
   can corrupt the list.
 
   Also add a lockdep_assert_held(trace_event_sem) to
   _trace_add_event_dirs() to confirm its held when iterating the list.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaH06gBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoJsAP0a+/E0f+5g7O/OtYPVEDSCREv1vj9c
 3dr0iWopqaOC7gEAw8Vc5iWIHKcB/JuJ+GqALoutL+lihruG26MWkFFsOgU=
 =zH5J
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix timerlat with use of FORTIFY_SOURCE

   FORTIFY_SOURCE was added to the stack tracer where it compares the
   entry->caller array to having entry->size elements.

   timerlat has the following:

      memcpy(&entry->caller, fstack->calls, size);
      entry->size = size;

   Which triggers FORTIFY_SOURCE as the caller is populated before the
   entry->size is initialized.

   Swap the order to satisfy FORTIFY_SOURCE logic.

 - Add down_write(trace_event_sem) when adding trace events in modules

   Trace events being added to the ftrace_events array are protected by
   the trace_event_sem semaphore. But when loading modules that have
   trace events, the addition of the events are not protected by the
   semaphore and loading two modules that have events at the same time
   can corrupt the list.

   Also add a lockdep_assert_held(trace_event_sem) to
   _trace_add_event_dirs() to confirm it is held when iterating the
   list.

* tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Add down_write(trace_event_sem) when adding trace event
  tracing/osnoise: Fix crash in timerlat_dump_stack()
2025-07-20 13:03:31 -07:00
Steven Rostedt
b5e8acc14d tracing: Add down_write(trace_event_sem) when adding trace event
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.

If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.

The addition of the event should take the trace_event_sem for write while
it adds the new event.

Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生)  <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-19 13:54:59 -04:00
Tomas Glozar
85a3bce695 tracing/osnoise: Fix crash in timerlat_dump_stack()
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:

memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
 <TASK>
 ? trace_buffer_lock_reserve+0x2a/0x60
 __fortify_panic+0xd/0xf
 __timerlat_dump_stack.cold+0xd/0xd
 timerlat_dump_stack.part.0+0x47/0x80
 timerlat_fd_read+0x36d/0x390
 vfs_read+0xe2/0x390
 ? syscall_exit_to_user_mode+0x1d5/0x210
 ksys_read+0x73/0xe0
 do_syscall_64+0x7b/0x160
 ? exc_page_fault+0x7e/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

__timerlat_dump_stack() constructs the ftrace stack entry like this:

struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;

Since commit e7186af7fb ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.

Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().

Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-18 15:51:35 -04:00
Alexei Starovoitov
beb1097ec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-18 12:15:59 -07:00
Feng Yang
62ef449b8d bpf: Clean up individual BTF_ID code
Use BTF_ID_LIST_SINGLE(a, b, c) instead of
BTF_ID_LIST(a)
BTF_ID(b, c)

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Link: https://lore.kernel.org/r/20250710055419.70544-1-yangfeng59949@163.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-16 18:34:42 -07:00
Nathan Chancellor
1ed171a3af tracing/probes: Avoid using params uninitialized in parse_btf_arg()
After a recent change in clang to strengthen uninitialized warnings [1],
it points out that in one of the error paths in parse_btf_arg(), params
is used uninitialized:

  kernel/trace/trace_probe.c:660:19: warning: variable 'params' is uninitialized when used here [-Wuninitialized]
    660 |                         return PTR_ERR(params);
        |                                        ^~~~~~

Match many other NO_BTF_ENTRY error cases and return -ENOENT, clearing
up the warning.

Link: https://lore.kernel.org/all/20250715-trace_probe-fix-const-uninit-warning-v1-1-98960f91dd04@kernel.org/

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2110
Fixes: d157d76944 ("tracing/probes: Support BTF field access from $retval")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-16 14:01:54 +09:00
Johannes Thumshirn
bd116214d5 blktrace: add zoned block commands to blk_fill_rwbs
Add zoned block commands to blk_fill_rwbs:

- ZONE APPEND will be decoded as 'ZA'
- ZONE RESET will be decoded as 'ZR'
- ZONE RESET ALL will be decoded as 'ZRA'
- ZONE FINISH will be decoded as 'ZF'
- ZONE OPEN will be decoded as 'ZO'
- ZONE CLOSE will be decoded as 'ZC'

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-2-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15 08:03:48 -06:00
Tao Chen
b725441f02 bpf: Add attach_type field to bpf_link
Attach_type will be set when a link is created by user. It is better to
record attach_type in bpf_link generically and have it available
universally for all link types. So add the attach_type field in bpf_link
and move the sleepable field to avoid unnecessary gap padding.

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
2025-07-11 10:51:55 -07:00
Jason Xing
7f2173894f blktrace: use rbuf->stats.full as a drop indicator in relayfs
Replace internal subbuf_start in blktrace with the default policy in
relayfs.

Remove dropped field from struct blktrace.  Correspondingly, call the
common helper in relay.  By incrementing full_count to keep track of how
many times we encountered a full buffer issue, user space will know how
many events were lost.

Link: https://lkml.kernel.org/r/20250612061201.34272-5-kerneljasonxing@gmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:57:52 -07:00
Jason Xing
2489e95812 relayfs: abolish prev_padding
Patch series "relayfs: misc changes", v5.

The series mostly focuses on the error counters which helps every user
debug their own kernel module.


This patch (of 5):

prev_padding represents the unused space of certain subbuffer.  If the
content of a call of relay_write() exceeds the limit of the remainder of
this subbuffer, it will skip storing in the rest space and record the
start point as buf->prev_padding in relay_switch_subbuf().  Since the buf
is a per-cpu big buffer, the point of prev_padding as a global value for
the whole buffer instead of a single subbuffer (whose padding info is
stored in buf->padding[]) seems meaningless from the real use cases, so we
don't bother to record it any more.

Link: https://lkml.kernel.org/r/20250612061201.34272-1-kerneljasonxing@gmail.com
Link: https://lkml.kernel.org/r/20250612061201.34272-2-kerneljasonxing@gmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:57:51 -07:00
Artem Sadovnikov
437889b926 fgraph: Make pid_str size match the comment
The comment above buffer mentions sign, 10 bytes width for number and null
terminator, but buffer itself isn't large enough to hold that much data.

This is a cosmetic change, since PID cannot be negative, other than -1.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/20250617152110.2530-1-a.sadovnikov@ispras.ru
Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 20:39:00 -04:00
Masami Hiramatsu (Google)
ca296d32ec tracing: ring_buffer: Rewind persistent ring buffer on reboot
Rewind persistent ring buffer pages which have been read in the previous
boot. Those pages are highly possible to be lost before writing it to the
disk if the previous kernel crashed. In this case, the trace data is kept
on the persistent ring buffer, but it can not be read because its commit
size has been reset after read.  This skips clearing the commit size of
each sub-buffer and recover it after reboot.

Note: If you read the previous boot data via trace_pipe, that is not
accessible in that time. But reboot without clearing (or reusing) the read
data, the read data is recovered again in the next boot.

Thus, when you read the previous boot data, clear it by `echo > trace`.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174899582116.955054.773265393511190051.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 20:38:42 -04:00
Nam Cao
fac5493251 rv: Allow to configure the number of per-task monitor
Now that there are 2 monitors for real-time applications, users may want to
enable both of them simultaneously. Make the number of per-task monitor
configurable. Default it to 2 for now.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:02 -04:00
Nam Cao
f74f8bb246 rv: Add rtapp_sleep monitor
Add a monitor for checking that real-time tasks do not go to sleep in a
manner that may cause undesirable latency.

Also change
	RV depends on TRACING
to
	RV select TRACING
to avoid the following recursive dependency:

 error: recursive dependency detected!
	symbol TRACING is selected by PREEMPTIRQ_TRACEPOINTS
	symbol PREEMPTIRQ_TRACEPOINTS depends on TRACE_IRQFLAGS
	symbol TRACE_IRQFLAGS is selected by RV_MON_SLEEP
	symbol RV_MON_SLEEP depends on RV
	symbol RV depends on TRACING

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/75bc5bcc741d153aa279c95faf778dff35c5c8ad.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:01 -04:00
Nam Cao
9162620eb6 rv: Add rtapp_pagefault monitor
Userspace real-time applications may have design flaws that they raise
page faults in real-time threads, and thus have unexpected latencies.

Add an linear temporal logic monitor to detect this scenario.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/78fea8a2de6d058241d3c6502c1a92910772b0ed.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:01 -04:00
Nam Cao
886fc86e94 rv: Add rtapp container monitor
Add the container "rtapp" which is the monitor collection for detecting
problems with real-time applications. The monitors will be added in the
follow-up commits.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/fb18b87631d386271de00959d8d4826f23fcd1cd.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:01 -04:00
Nam Cao
a9769a5b98 rv: Add support for LTL monitors
While attempting to implement DA monitors for some complex specifications,
deterministic automaton is found to be inappropriate as the specification
language. The automaton is complicated, hard to understand, and
error-prone.

For these cases, linear temporal logic is more suitable as the
specification language.

Add support for linear temporal logic runtime verification monitor.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:01 -04:00
Nam Cao
c94d27c01b rv: rename CONFIG_DA_MON_EVENTS to CONFIG_RV_MON_EVENTS
CONFIG_DA_MON_EVENTS is not specific to deterministic automaton. It could
be used for other monitor types. Therefore rename it to
CONFIG_RV_MON_EVENTS.

This prepares for the introduction of linear temporal logic monitor.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/507210517123d887c1d208aa2fd45ec69765d3f0.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:01 -04:00
Nam Cao
ff4e233d8a rv: Let the reactors take care of buffers
Each RV monitor has one static buffer to send to the reactors. If multiple
errors are detected simultaneously, the one buffer could be overwritten.

Instead, leave it to the reactors to handle buffering.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:00 -04:00
Nam Cao
2d08876263 rv: Add #undef TRACE_INCLUDE_FILE
Without "#undef TRACE_INCLUDE_FILE", there could be a build error due to
TRACE_INCLUDE_FILE being redefined. Therefore add it.

Also fix a typo while at it.

Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/f805e074581e927bb176c742c981fa7675b6ebe5.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09 15:27:00 -04:00
Arnd Bergmann
adc353c0bf kernel: trace: preemptirq_delay_test: use offstack cpu mask
A CPU mask on the stack is broken for large values of CONFIG_NR_CPUS:

kernel/trace/preemptirq_delay_test.c: In function ‘preemptirq_delay_run’:
kernel/trace/preemptirq_delay_test.c:143:1: error: the frame size of 8512 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]

Fall back to dynamic allocation here.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Chen <chensong_2000@189.cn>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250620111215.3365305-1-arnd@kernel.org
Fixes: 4b9091e1c1 ("kernel: trace: preemptirq_delay_test: add cpu affinity")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08 18:17:38 -04:00
Steven Rostedt
3aceaa539c tracing: Use queue_rcu_work() to free filters
Freeing of filters requires to wait for both an RCU grace period as well as
a RCU task trace wait period after they have been detached from their
lists. The trace task period can be quite large so the freeing of the
filters was moved to use the call_rcu*() routines. The problem with that is
that the callback functions of call_rcu*() is done from a soft irq and can
cause latencies if the callback takes a bit of time.

The filters are freed per event in a system and the syscalls system
contains an event per system call, which can be over 700 events. Freeing 700
filters in a bottom half is undesirable.

Instead, move the freeing to use queue_rcu_work() which is done in task
context.

Link: https://lore.kernel.org/all/9a2f0cd0-1561-4206-8966-f93ccd25927f@paulmck-laptop/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250609131732.04fd303b@gandalf.local.home
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08 18:17:29 -04:00
Yury Norov
88c79ecfb6 tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
The dedicated cpumask_next_wrap() is more verbose and effective than
cpumask_next() followed by cpumask_first().

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250605000651.45281-1-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08 18:17:29 -04:00
Tao Chen
da7e9c0a7f bpf: Add show_fdinfo for kprobe_multi
Show kprobe_multi link info with fdinfo, the info as follows:

link_type:	kprobe_multi
link_id:	1
prog_tag:	a69740b9746f7da8
prog_id:	21
kprobe_cnt:	8
missed:	0
cookie	 func
1	 bpf_fentry_test1+0x0/0x20
7	 bpf_fentry_test2+0x0/0x20
2	 bpf_fentry_test3+0x0/0x20
3	 bpf_fentry_test4+0x0/0x20
4	 bpf_fentry_test5+0x0/0x20
5	 bpf_fentry_test6+0x0/0x20
6	 bpf_fentry_test7+0x0/0x20
8	 bpf_fentry_test8+0x0/0x10

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Link: https://lore.kernel.org/r/20250702153958.639852-3-chen.dylane@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03 19:29:42 -07:00
Tao Chen
b4dfe26fbf bpf: Add show_fdinfo for uprobe_multi
Show uprobe_multi link info with fdinfo, the info as follows:

link_type:	uprobe_multi
link_id:	9
prog_tag:	e729f789e34a8eca
prog_id:	39
uprobe_cnt:	3
pid:	0
path:	/home/dylane/bpf/tools/testing/selftests/bpf/test_progs
cookie	 offset	 ref_ctr_offset
3	 0xa69f13	 0x0
1	 0xa69f1e	 0x0
2	 0xa69f29	 0x0

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Link: https://lore.kernel.org/r/20250702153958.639852-2-chen.dylane@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03 19:29:42 -07:00
Tao Chen
803f0700a3 bpf: Show precise link_type for {uprobe,kprobe}_multi fdinfo
Alexei suggested, 'link_type' can be more precise and differentiate
for human in fdinfo. In fact BPF_LINK_TYPE_KPROBE_MULTI includes
kretprobe_multi type, the same as BPF_LINK_TYPE_UPROBE_MULTI, so we
can show it more concretely.

link_type:	kprobe_multi
link_id:	1
prog_tag:	d2b307e915f0dd37
...
link_type:	kretprobe_multi
link_id:	2
prog_tag:	ab9ea0545870781d
...
link_type:	uprobe_multi
link_id:	9
prog_tag:	e729f789e34a8eca
...
link_type:	uretprobe_multi
link_id:	10
prog_tag:	7db356c03e61a4d4

Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Link: https://lore.kernel.org/r/20250702153958.639852-1-chen.dylane@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03 19:29:42 -07:00
Masami Hiramatsu (Google)
2867495dea tracing: tprobe-events: Register tracepoint when enable tprobe event
As same as fprobe, register tracepoint stub function only when enabling
tprobe events. The major changes are introducing a list of
tracepoint_user and its lock, and tprobe_event_module_nb, which is
another module notifier for module loading/unloading.  By spliting the
lock from event_mutex and a module notifier for trace_fprobe, it
solved AB-BA lock dependency issue between event_mutex and
tracepoint_module_list_mutex.

Link: https://lore.kernel.org/all/174343538901.843280.423773753642677941.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:45:35 +09:00
Masami Hiramatsu (Google)
2db832ec90 tracing: fprobe-events: Register fprobe-events only when it is enabled
Currently fprobe events are registered when it is defined. Thus it will
give some overhead even if it is disabled. This changes it to register the
fprobe only when it is enabled.

Link: https://lore.kernel.org/all/174343537128.843280.16131300052837035043.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:45:09 +09:00
Masami Hiramatsu (Google)
e3d6e1b9a3 tracing: tprobe-events: Support multiple tprobes on the same tracepoint
Allow user to set multiple tracepoint-probe events on the same
tracepoint. After the last tprobe-event is removed, the tracepoint
callback is unregistered.

Link: https://lore.kernel.org/all/174343536245.843280.6548776576601537671.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:55 +09:00
Masami Hiramatsu (Google)
c135ab4a96 tracing: tprobe-events: Remove mod field from tprobe-event
Remove unneeded 'mod' struct module pointer field from trace_fprobe
because we don't need to save this info.

Link: https://lore.kernel.org/all/174343535351.843280.5868426549023332120.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:36 +09:00
Masami Hiramatsu (Google)
ddb017ec9c tracing: probe-events: Cleanup entry-arg storing code
Cleanup __store_entry_arg() so that it is easier to understand.
The main complexity may come from combining the loops for finding
stored-entry-arg and max-offset and appending new entry.

This split those different loops into 3 parts, lookup the same
entry-arg, find the max offset and append new entry.

Link: https://lore.kernel.org/all/174323039929.348535.4705349977127704120.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:12 +09:00
Edward Adam Davis
6921d1e07c tracing: Fix filter logic error
If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).

After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.

Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894
Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-27 15:51:36 -04:00
Alexei Starovoitov
886178a33a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge BPF, perf and other fixes after downstream PRs.
It restores BPF CI to green after critical fix
commit bc4394e5e7 ("perf: Fix the throttle error of some clock events")

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26 09:49:39 -07:00
Steven Rostedt
327e286643 fgraph: Do not enable function_graph tracer when setting funcgraph-args
When setting the funcgraph-args option when function graph tracer is net
enabled, it incorrectly enables it. Worse, it unregisters itself when it
was never registered. Then when it gets enabled again, it will register
itself a second time causing a WARNing.

 ~# echo 1 > /sys/kernel/tracing/options/funcgraph-args
 ~# head -20 /sys/kernel/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 813/26317372   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
           <idle>-0       [007] d..4.   358.966010:  7)   1.692 us    |          fetch_next_timer_interrupt(basej=4294981640, basem=357956000000, base_local=0xffff88823c3ae040, base_global=0xffff88823c3af300, tevt=0xffff888100e47cb8);
           <idle>-0       [007] d..4.   358.966012:  7)               |          tmigr_cpu_deactivate(nextexp=357988000000) {
           <idle>-0       [007] d..4.   358.966013:  7)               |            _raw_spin_lock(lock=0xffff88823c3b2320) {
           <idle>-0       [007] d..4.   358.966014:  7)   0.981 us    |              preempt_count_add(val=1);
           <idle>-0       [007] d..5.   358.966017:  7)   1.058 us    |              do_raw_spin_lock(lock=0xffff88823c3b2320);
           <idle>-0       [007] d..4.   358.966019:  7)   5.824 us    |            }
           <idle>-0       [007] d..5.   358.966021:  7)               |            tmigr_inactive_up(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {
           <idle>-0       [007] d..5.   358.966022:  7)               |              tmigr_update_events(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {

Notice the "tracer: nop" at the top there. The current tracer is the "nop"
tracer, but the content is obviously the function graph tracer.

Enabling function graph tracing will cause it to register again and
trigger a warning in the accounting:

 ~# echo function_graph > /sys/kernel/tracing/current_tracer
 -bash: echo: write error: Device or resource busy

With the dmesg of:

 ------------[ cut here ]------------
 WARNING: CPU: 7 PID: 1095 at kernel/trace/ftrace.c:3509 ftrace_startup_subops+0xc1e/0x1000
 Modules linked in: kvm_intel kvm irqbypass
 CPU: 7 UID: 0 PID: 1095 Comm: bash Not tainted 6.16.0-rc2-test-00006-gea03de4105d3 #24 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:ftrace_startup_subops+0xc1e/0x1000
 Code: 48 b8 22 01 00 00 00 00 ad de 49 89 84 24 88 01 00 00 8b 44 24 08 89 04 24 e9 c3 f7 ff ff c7 04 24 ed ff ff ff e9 b7 f7 ff ff <0f> 0b c7 04 24 f0 ff ff ff e9 a9 f7 ff ff c7 04 24 f4 ff ff ff e9
 RSP: 0018:ffff888133cff948 EFLAGS: 00010202
 RAX: 0000000000000001 RBX: 1ffff1102679ff31 RCX: 0000000000000000
 RDX: 1ffffffff0b27a60 RSI: ffffffff8593d2f0 RDI: ffffffff85941140
 RBP: 00000000000c2041 R08: ffffffffffffffff R09: ffffed1020240221
 R10: ffff88810120110f R11: ffffed1020240214 R12: ffffffff8593d2f0
 R13: ffffffff8593d300 R14: ffffffff85941140 R15: ffffffff85631100
 FS:  00007f7ec6f28740(0000) GS:ffff8882b5251000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f7ec6f181c0 CR3: 000000012f1d0005 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __pfx_ftrace_startup_subops+0x10/0x10
  ? find_held_lock+0x2b/0x80
  ? ftrace_stub_direct_tramp+0x10/0x10
  ? ftrace_stub_direct_tramp+0x10/0x10
  ? trace_preempt_on+0xd0/0x110
  ? __pfx_trace_graph_entry_args+0x10/0x10
  register_ftrace_graph+0x4d2/0x1020
  ? tracing_reset_online_cpus+0x14b/0x1e0
  ? __pfx_register_ftrace_graph+0x10/0x10
  ? ring_buffer_record_enable+0x16/0x20
  ? tracing_reset_online_cpus+0x153/0x1e0
  ? __pfx_tracing_reset_online_cpus+0x10/0x10
  ? __pfx_trace_graph_return+0x10/0x10
  graph_trace_init+0xfd/0x160
  tracing_set_tracer+0x500/0xa80
  ? __pfx_tracing_set_tracer+0x10/0x10
  ? lock_release+0x181/0x2d0
  ? _copy_from_user+0x26/0xa0
  tracing_set_trace_write+0x132/0x1e0
  ? __pfx_tracing_set_trace_write+0x10/0x10
  ? ftrace_graph_func+0xcc/0x140
  ? ftrace_stub_direct_tramp+0x10/0x10
  ? ftrace_stub_direct_tramp+0x10/0x10
  ? ftrace_stub_direct_tramp+0x10/0x10
  vfs_write+0x1d0/0xe90
  ? __pfx_vfs_write+0x10/0x10

Have the setting of the funcgraph-args check if function_graph tracer is
the current tracer of the instance, and if not, do nothing, as there's
nothing to do (the option is checked when function_graph tracing starts).

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250618073801.057ea636@gandalf.local.home
Fixes: c7a60a733c ("ftrace: Have funcgraph-args take affect during tracing")
Closes: https://lore.kernel.org/all/4ab1a7bdd0174ab09c7b0d68cdbff9a4@huawei.com/
Reported-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-18 07:43:22 -04:00
James Bottomley
bd07bd12f2 bpf: Fix key serial argument of bpf_lookup_user_key()
The underlying lookup_user_key() function uses a signed 32 bit integer
for key serial numbers because legitimate serial numbers are positive
(and > 3) and keyrings are negative.  Using a u32 for the keyring in
the bpf function doesn't currently cause any conversion problems but
will start to trip the signed to unsigned conversion warnings when the
kernel enables them, so convert the argument to signed (and update the
tests accordingly) before it acquires more users.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Link: https://lore.kernel.org/r/84cdb0775254d297d75e21f577089f64abdfbd28.camel@HansenPartnership.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-17 18:15:27 -07:00
Steven Rostedt
8a157d8a00 tracing: Do not free "head" on error path of filter_free_subsystem_filters()
The variable "head" is allocated and initialized as a list before
allocating the first "item" for the list. If the allocation of "item"
fails, it frees "head" and then jumps to the label "free_now" which will
process head and free it.

This will cause a UAF of "head", and it doesn't need to free it before
jumping to the "free_now" label as that code will free it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250610093348.33c5643a@gandalf.local.home
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202506070424.lCiNreTI-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-10 09:39:58 -04:00
Linus Torvalds
538c429a4b tracing fixes:
- Fix regression of waiting a long time on updating trace event filters
 
   When the faultable trace points were added, it needed task trace RCU
   synchronization. This was added to the tracepoint_synchronize_unregister()
   function. The filter logic always called this function whenever it
   updated the trace event filters before freeing the old filters.
   This increased the time of "trace-cmd record" from taking 13 seconds
   to running over 2 minutes to complete.
 
   Move the freeing of the filters to call_rcu*() logic, which brings the
   time back down to 13 seconds.
 
 - Fix ring_buffer_subbuf_order_set() error path lock protection
 
   The error path of the ring_buffer_subbuf_order_set() released the
   mutex too early and allowed subsequent accesses to setting the
   subbuffer size to corrupt the data and cause a bug.
 
   By moving the mutex locking to the end of the error path, it prevents
   the reentrant access to the critical data and also allows the function
   to convert the taking of the mutex over to the guard() logic.
 
 - Remove unused power management clock events
 
   The clock events were added in 2010 for power management. In 2011
   arm used them. In 2013 the code they were used in was removed.
   These events have been wasting memory since then.
 
 - Fix sparse warnings
 
   There was a few places that sparse warned about trace_events_filter.c
   where file->filter was referenced directly, but it is annotated with
   an __rcu tag. Use the helper functions and fix them up to use
   rcu_dereference() properly.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaEST0xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgdSAPoD7L17oeiP5KQkM0wPuPBz0tmJF7XE
 2VmHp1lBu5rYwgEAyHTD7SqWvInMMp9sGt5tzkByXpOsYC65/RprkbFpXwA=
 =s4wK
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing fixes from Steven Rostedt:

 - Fix regression of waiting a long time on updating trace event filters

   When the faultable trace points were added, it needed task trace RCU
   synchronization.

   This was added to the tracepoint_synchronize_unregister() function.
   The filter logic always called this function whenever it updated the
   trace event filters before freeing the old filters. This increased
   the time of "trace-cmd record" from taking 13 seconds to running over
   2 minutes to complete.

   Move the freeing of the filters to call_rcu*() logic, which brings
   the time back down to 13 seconds.

 - Fix ring_buffer_subbuf_order_set() error path lock protection

   The error path of the ring_buffer_subbuf_order_set() released the
   mutex too early and allowed subsequent accesses to setting the
   subbuffer size to corrupt the data and cause a bug.

   By moving the mutex locking to the end of the error path, it prevents
   the reentrant access to the critical data and also allows the
   function to convert the taking of the mutex over to the guard()
   logic.

 - Remove unused power management clock events

   The clock events were added in 2010 for power management. In 2011 arm
   used them. In 2013 the code they were used in was removed. These
   events have been wasting memory since then.

 - Fix sparse warnings

   There was a few places that sparse warned about trace_events_filter.c
   where file->filter was referenced directly, but it is annotated with
   an __rcu tag. Use the helper functions and fix them up to use
   rcu_dereference() properly.

* tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Add rcu annotation around file->filter accesses
  tracing: PM: Remove unused clock events
  ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()
  tracing: Fix regression of filter waiting a long time on RCU synchronization
2025-06-08 08:19:01 -07:00
Steven Rostedt
549e914c96 tracing: Add rcu annotation around file->filter accesses
Running sparse on trace_events_filter.c triggered several warnings about
file->filter being accessed directly even though it's annotated with __rcu.

Add rcu_dereference() around it and shuffle the logic slightly so that
it's always referenced via accessor functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250607102821.6c7effbf@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-07 10:31:22 -04:00
Linus Torvalds
119b1e61a7 RISC-V Patches for the 6.16 Merge Window, Part 1
* Support for the FWFT SBI extension, which is part of SBI 3.0 and a
   dependency for many new SBI and ISA extensions.
 * Support for getrandom() in the VDSO.
 * Support for mseal.
 * Optimized routines for raid6 syndrome and recovery calculations.
 * kexec_file() supports loading Image-formatted kernel binaries.
 * Improvements to the instruction patching framework to allow for atomic
   instruction patching, along with rules as to how systems need to
   behave in order to function correctly.
 * Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
   some SiFive vendor extensions.
 * Various fixes and cleanups, including: misaligned access handling, perf
   symbol mangling, module loading, PUD THPs, and improved uaccess
   routines.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmhDLP8ZHHBhbG1lcmRh
 YmJlbHRAZ29vZ2xlLmNvbQAKCRAuExnzX7sYiZhFD/4+Zikkld812VjFb9dTF+Wj
 n/x9h86zDwAEFgf2BMIpUQhHru6vtdkO2l/Ky6mQblTPMWLafF4eK85yCsf84sQ0
 +RX4sOMLZ0+qvqxKX+aOFe9JXOWB0QIQuPvgBfDDOV4UTm60sglIxwqOpKcsBEHs
 2nplXXjiv0ckaMFLos8xlwu1uy4A/jMfT3Y9FDcABxYCqBoKOZ1frcL9ezJZbHbv
 BoOKLDH8ZypFxIG/eQ511lIXXtrnLas0l4jHWjrfsWu6pmXTgJasKtbGuH3LoLnM
 G/4qvHufR6lpVUOIL5L0V6PpsmYwDi/ciFIFlc8NH2oOZil3qiVaGSEbJIkWGFu9
 8lWTXQWnbinZbfg2oYbWp8GlwI70vKomtDyYNyB9q9Cq9jyiTChMklRNODr4764j
 ZiEnzc/l4KyvaxUg8RLKCT595lKECiUDnMytbIbunJu05HBqRCoGpBtMVzlQsyUd
 ybkRt3BA7eOR8/xFA7ZZQeJofmiu2yxkBs5ggMo8UnSragw27hmv/OA0mWMXEuaD
 aaWc4ZKpKqf7qLchLHOvEl5ORUhsisyIJgZwOqdme5rQoWorVtr51faA4AKwFAN4
 vcKgc5qJjK8vnpW+rl3LNJF9LtH+h4TgmUI853vUlukPoH2oqRkeKVGSkxG0iAze
 eQy2VjP1fJz6ciRtJZn9aw==
 =cZGy
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for the FWFT SBI extension, which is part of SBI 3.0 and a
   dependency for many new SBI and ISA extensions

 - Support for getrandom() in the VDSO

 - Support for mseal

 - Optimized routines for raid6 syndrome and recovery calculations

 - kexec_file() supports loading Image-formatted kernel binaries

 - Improvements to the instruction patching framework to allow for
   atomic instruction patching, along with rules as to how systems need
   to behave in order to function correctly

 - Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
   some SiFive vendor extensions

 - Various fixes and cleanups, including: misaligned access handling,
   perf symbol mangling, module loading, PUD THPs, and improved uaccess
   routines

* tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits)
  riscv: uaccess: Only restore the CSR_STATUS SUM bit
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  RISC-V: Documentation: Add enough title underlines to CMODX
  riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE
  MAINTAINERS: Update Atish's email address
  riscv: uaccess: do not do misaligned accesses in get/put_user()
  riscv: process: use unsigned int instead of unsigned long for put_user()
  riscv: make unsafe user copy routines use existing assembly routines
  riscv: hwprobe: export Zabha extension
  riscv: Make regs_irqs_disabled() more clear
  perf symbols: Ignore mapping symbols on riscv
  RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND
  riscv: module: Optimize PLT/GOT entry counting
  riscv: Add support for PUD THP
  riscv: xchg: Prefetch the destination word for sc.w
  riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
  riscv: Add support for Zicbop
  ...
2025-06-06 18:05:18 -07:00
Dmitry Antipov
40ee2afafc ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()
Enlarge the critical section in ring_buffer_subbuf_order_set() to
ensure that error handling takes place with per-buffer mutex held,
thus preventing list corruption and other concurrency-related issues.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru
Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06 20:25:55 -04:00
Steven Rostedt
a9d0aab5eb tracing: Fix regression of filter waiting a long time on RCU synchronization
When faultable trace events were added, a trace event may no longer use
normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
This synchronization takes a much longer time to synchronize.

The filter logic would free the filters by calling
tracepoint_synchronize_unregister() after it unhooked the filter strings
and before freeing them. With this function now calling
synchronize_rcu_tasks_trace() this increased the time to free a filter
tremendously. On a PREEMPT_RT system, it was even more noticeable.

 # time trace-cmd record -p function sleep 1
 [..]
 real	2m29.052s
 user	0m0.244s
 sys	0m20.136s

As trace-cmd would clear out all the filters before recording, it could
take up to 2 minutes to do a recording of "sleep 1".

To find out where the issues was:

 ~# trace-cmd sqlhist -e -n sched_stack  select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta,  start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid

Which will produce the following commands (and -e will also execute them):

 echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
 echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
 echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger

The above creates a synthetic event that creates a stack trace when a task
schedules out and records it with the time it scheduled back in. Basically
the time a task is off the CPU. It also records the state of the task when
it left the CPU (running, blocked, sleeping, etc). It also saves the comm
of the task as "comm" (needed for the next command).

~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" &&  state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger

The above creates a histogram with buckets per state, per stack, and the
value of the total time it was off the CPU for that stack trace. It filters
on tasks with "comm == trace-cmd" and only the sleeping and blocked states
(1 - sleeping, 2 - blocked).

~# trace-cmd record -p function sleep 1

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_tasks_generic+0x151/0x230
         apply_subsystem_event_filter+0xa2b/0x1300
         subsystem_filter_write+0x67/0xc0
         vfs_write+0x1e2/0xeb0
         ksys_write+0xff/0x1d0
         do_syscall_64+0x7b/0x420
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
} hitcount:        237  delta:   99756288  <<--------------- Delta is 99 seconds!

Totals:
    Hits: 525
    Entries: 21
    Dropped: 0

This shows that this particular trace waited for 99 seconds on
synchronize_rcu_tasks() in apply_subsystem_event_filter().

In fact, there's a lot of places in the filter code that spends a lot of
time waiting for synchronize_rcu_tasks_trace() in order to free the
filters.

Add helper functions that will use call_rcu*() variants to asynchronously
free the filters. This brings the timings back to normal:

 # time trace-cmd record -p function sleep 1
 [..]
 real	0m14.681s
 user	0m0.335s
 sys	0m28.616s

And the histogram also shows this:

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_normal+0x3db/0x5c0
         tracing_reset_online_cpus+0x8f/0x1e0
         tracing_open+0x335/0x440
         do_dentry_open+0x4c6/0x17a0
         vfs_open+0x82/0x360
         path_openat+0x1a36/0x2990
         do_filp_open+0x1c5/0x420
         do_sys_openat2+0xed/0x180
         __x64_sys_openat+0x108/0x1d0
         do_syscall_64+0x7b/0x420
} hitcount:          2  delta:      77044

Totals:
    Hits: 55
    Entries: 28
    Dropped: 0

Where the total waiting time of synchronize_rcu_tasks_trace() is 77
milliseconds.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Andreas Ziegler <ziegler.andreas@siemens.com>
Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home
Reported-by: "Flot, Julien" <julien.flot@siemens.com>
Tested-by: Julien Flot <julien.flot@siemens.com>
Fixes: a363d27cdb ("tracing: Allow system call tracepoints to handle page faults")
Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06 20:25:55 -04:00
Palmer Dabbelt
2670a39b1e
Merge tag 'riscv-mw2-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next
riscv patches for 6.16-rc1, part 2

* Performance improvements
  - Add support for vdso getrandom
  - Implement raid6 calculations using vectors
  - Introduce svinval tlb invalidation

* Cleanup
  - A bunch of deduplication of the macros we use for manipulating instructions

* Misc
  - Introduce a kunit test for kprobes
  - Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :))

[Palmer: There was a rebase between part 1 and part 2, so I've had to do
some more git surgery here... at least two rounds of surgery...]

* alex-pr-2: (866 commits)
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  riscv: Add kprobes KUnit test
  riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG
  riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG
  riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG
  riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM
  riscv: kprobes: Move branch_funct3 to insn.h
  riscv: kprobes: Move branch_rs2_idx to insn.h
  Linux 6.15-rc6
  Input: xpad - fix xpad_device sorting
  Input: xpad - add support for several more controllers
  Input: xpad - fix Share button on Xbox One controllers
  ...
2025-06-05 14:03:16 -07:00
Andy Chiu
500e626c4a
kernel: ftrace: export ftrace_sync_ipi
The following ftrace patch for riscv uses a data store to update ftrace
function. Therefore, a romote fence is required to order it against
function_trace_op updates. The mechanism is similar to the fence between
function_trace_op and update_ftrace_func in the generic ftrace, so we
leverage the same ftrace_sync_ipi function.

[ alex: Fix build warning when !CONFIG_DYNAMIC_FTRACE ]

Signed-off-by: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/r/20250407180838.42877-4-andybnac@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:09:24 -07:00
Ye Bin
5834a59738 ftrace: Don't allocate ftrace module map if ftrace is disabled
If ftrace is disabled, it is meaningless to allocate a module map.
Add a check in allocate_ftrace_mod_map() to not allocate if ftrace is
disabled.

Link: https://lore.kernel.org/20250529111955.2349189-3-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-02 13:12:26 -04:00
Ye Bin
f914b52c37 ftrace: Fix UAF when lookup kallsym after ftrace disabled
The following issue happens with a buggy module:

BUG: unable to handle page fault for address: ffffffffc05d0218
PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
RIP: 0010:sized_strscpy+0x81/0x2f0
RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000
RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d
RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68
R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038
R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff
FS:  00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ftrace_mod_get_kallsym+0x1ac/0x590
 update_iter_mod+0x239/0x5b0
 s_next+0x5b/0xa0
 seq_read_iter+0x8c9/0x1070
 seq_read+0x249/0x3b0
 proc_reg_read+0x1b0/0x280
 vfs_read+0x17f/0x920
 ksys_read+0xf3/0x1c0
 do_syscall_64+0x5f/0x2e0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

The above issue may happen as follows:
(1) Add kprobe tracepoint;
(2) insmod test.ko;
(3)  Module triggers ftrace disabled;
(4) rmmod test.ko;
(5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed;
ftrace_mod_get_kallsym()
...
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
...

The problem is when a module triggers an issue with ftrace and
sets ftrace_disable. The ftrace_disable is set when an anomaly is
discovered and to prevent any more damage, ftrace stops all text
modification. The issue that happened was that the ftrace_disable stops
more than just the text modification.

When a module is loaded, its init functions can also be traced. Because
kallsyms deletes the init functions after a module has loaded, ftrace
saves them when the module is loaded and function tracing is enabled. This
allows the output of the function trace to show the init function names
instead of just their raw memory addresses.

When a module is removed, ftrace_release_mod() is called, and if
ftrace_disable is set, it just returns without doing anything more. The
problem here is that it leaves the mod_list still around and if kallsyms
is called, it will call into this code and access the module memory that
has already been freed as it will return:

  strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);

Where the "mod" no longer exists and triggers a UAF bug.

Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/

Cc: stable@vger.kernel.org
Fixes: aba4b5c22c ("ftrace: Save module init functions kallsyms symbols for tracing")
Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-02 13:09:48 -04:00
Linus Torvalds
8bf722c684 ring-buffer changes for v6.16:
- Allow the persistent ring buffer to be memory mapped
 
   In the last merge window there was issues with the implementation of
   mapping the persistent ring buffer because it was assumed that the
   persistent memory was just physical memory without being part of the
   kernel virtual address space. But this was incorrect and the persistent
   ring buffer can be mapped the same way as the allocated ring buffer is
   mapped.
 
   The meta data for the persistent ring buffer is different than the normal
   ring buffer and the organization of mapping it to user space is a little
   different. Make the updates needed to the meta data to allow the
   persistent ring buffer to be mapped to user space.
 
 - Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock
 
   Mapping the ring buffer to user space uses the cpu_buffer->mapping_lock.
   The buffer->mutex can be taken when the mapping_lock is held, giving the
   locking order of: cpu_buffer->mapping_lock -->> buffer->mutex. But there
   also exists the ordering:
 
       buffer->mutex -->> cpus_read_lock()
       mm->mmap_lock -->> cpu_buffer->mapping_lock
       cpus_read_lock() -->> mm->mmap_lock
 
   causing a circular chain of:
 
       cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->>
          mm->mmap_lock -->> cpu_buffer->mapping_lock
 
   By moving the cpus_read_lock() outside the buffer->mutex where:
   cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain.
 
 - Do not trigger WARN_ON() for commit overrun
 
   When the ring buffer is user space mapped and there's a "commit overrun"
   (where an interrupt preempted an event, and then added so many events it
    filled the buffer having to drop events when it hit the preempted event)
   a WARN_ON() was triggered if this was read via a memory mapped buffer.
   This is due to "missed events" being non zero when the reader page ended
   up with the commit page. The idea was, if the writer is on the reader page,
   there's only one page that has been written to and there should be no
   missed events. But if a commit overrun is done where the writer is off the
   commit page and looped around to the commit page causing missed events, it
   is possible that the reader page is the commit page with missed events.
 
   Instead of triggering a WARN_ON() when the reader page is the commit page
   with missed events, trigger it when the reader page is the tail_page with
   missed events. That's because the writer is always on the tail_page if
   an event was interrupted (which holds the commit event) and continues off
   the commit page.
 
 - Reset the persistent buffer if it is fully consumed
 
   On boot up, if the user fully consumes the last boot buffer of the
   persistent buffer, if it reboots without enabling it, there will still be
   events in the buffer which can cause confusion. Instead, reset the buffer
   when it is fully consumed, so that the data is not read again.
 
 - Clean up some goto out jumps
 
   There's a few cases that the code jumps to the "out:" label that simply
   returns a value. There used to be more work done at those labels but now
   that they simply return a value use a return instead of jumping to a
   label.
 
 - Use guard() to simplify some of the code
 
   Add guard() around some locking instead of jumping to a label to do the
   unlocking.
 
 - Use free() to simplify some of the code
 
   Use free(kfree) on variables that will get freed on error and use
   return_ptr() to return the variable when its not freed. There's one
   instance where free(kfree) simplifies the code on a temp variable that was
   allocated just for the function use.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaDjJMxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkDzAP468AZOnjIxezfzYEmtcDl8ZUgf2U3I
 XtXjn7aKH/gZiwD/dCCZX2IY2gddqAb6s9Bo4/AWgtYbjacLPL+pWYbTJwQ=
 =DOfF
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - Allow the persistent ring buffer to be memory mapped

   In the last merge window there was issues with the implementation of
   mapping the persistent ring buffer because it was assumed that the
   persistent memory was just physical memory without being part of the
   kernel virtual address space. But this was incorrect and the
   persistent ring buffer can be mapped the same way as the allocated
   ring buffer is mapped.

   The metadata for the persistent ring buffer is different than the
   normal ring buffer and the organization of mapping it to user space
   is a little different. Make the updates needed to the meta data to
   allow the persistent ring buffer to be mapped to user space.

 - Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock

   Mapping the ring buffer to user space uses the
   cpu_buffer->mapping_lock. The buffer->mutex can be taken when the
   mapping_lock is held, giving the locking order of:
   cpu_buffer->mapping_lock -->> buffer->mutex. But there also exists
   the ordering:

       buffer->mutex -->> cpus_read_lock()
       mm->mmap_lock -->> cpu_buffer->mapping_lock
       cpus_read_lock() -->> mm->mmap_lock

   causing a circular chain of:

       cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->>
          mm->mmap_lock -->> cpu_buffer->mapping_lock

   By moving the cpus_read_lock() outside the buffer->mutex where:
   cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain.

 - Do not trigger WARN_ON() for commit overrun

   When the ring buffer is user space mapped and there's a "commit
   overrun" (where an interrupt preempted an event, and then added so
   many events it filled the buffer having to drop events when it hit
   the preempted event) a WARN_ON() was triggered if this was read via a
   memory mapped buffer.

   This is due to "missed events" being non zero when the reader page
   ended up with the commit page. The idea was, if the writer is on the
   reader page, there's only one page that has been written to and there
   should be no missed events.

   But if a commit overrun is done where the writer is off the commit
   page and looped around to the commit page causing missed events, it
   is possible that the reader page is the commit page with missed
   events.

   Instead of triggering a WARN_ON() when the reader page is the commit
   page with missed events, trigger it when the reader page is the
   tail_page with missed events. That's because the writer is always on
   the tail_page if an event was interrupted (which holds the commit
   event) and continues off the commit page.

 - Reset the persistent buffer if it is fully consumed

   On boot up, if the user fully consumes the last boot buffer of the
   persistent buffer, if it reboots without enabling it, there will
   still be events in the buffer which can cause confusion. Instead,
   reset the buffer when it is fully consumed, so that the data is not
   read again.

 - Clean up some goto out jumps

   There's a few cases that the code jumps to the "out:" label that
   simply returns a value. There used to be more work done at those
   labels but now that they simply return a value use a return instead
   of jumping to a label.

 - Use guard() to simplify some of the code

   Add guard() around some locking instead of jumping to a label to do
   the unlocking.

 - Use free() to simplify some of the code

   Use free(kfree) on variables that will get freed on error and use
   return_ptr() to return the variable when its not freed. There's one
   instance where free(kfree) simplifies the code on a temp variable
   that was allocated just for the function use.

* tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Simplify functions with __free(kfree) to free allocations
  ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)
  ring-buffer: Simplify ring_buffer_read_page() with guard()
  ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()
  ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()
  ring-buffer: Removed unnecessary if() goto out where out is the next line
  tracing: Reset last-boot buffers when reading out all cpu buffers
  ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped
  ring-buffer: Do not trigger WARN_ON() due to a commit_overrun
  ring-buffer: Move cpus_read_lock() outside of buffer->mutex
2025-05-30 21:20:11 -07:00
Linus Torvalds
0f70f5b08a automount wart removal
Calling conventions of ->d_automount() made saner (flagday change)
 vfs_submount() is gone - its sole remaining user (trace_automount) had
 been switched to saner primitives.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
 6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
 EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
 =GXLi
 -----END PGP SIGNATURE-----

Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull automount updates from Al Viro:
 "Automount wart removal

  A bunch of odd boilerplate gone from instances - the reason for
  those was the need to protect the yet-to-be-attched mount from
  mark_mounts_for_expiry() deciding to take it out.

  But that's easy to detect and take care of in mark_mounts_for_expiry()
  itself; no need to have every instance simulate mount being busy by
  grabbing an extra reference to it, with finish_automount() undoing
  that once it attaches that mount.

  Should've done it that way from the very beginning... This is a
  flagday change, thankfully there are very few instances.

  vfs_submount() is gone - its sole remaining user (trace_automount)
  had been switched to saner primitives"

* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill vfs_submount()
  saner calling conventions for ->d_automount()
2025-05-30 15:38:29 -07:00
Linus Torvalds
b78f1293f9 tracing updates for v6.16:
- Have module addresses get updated in the persistent ring buffer
 
   The addresses of the modules from the previous boot are saved in the
   persistent ring buffer. If the same modules are loaded and an address is
   in the old buffer points to an address that was both saved in the
   persistent ring buffer and is loaded in memory, shift the address to point
   to the address that is loaded in memory in the trace event.
 
 - Print function names for irqs off and preempt off callsites
 
   When ignoring the print fmt of a trace event and just printing the fields
   directly, have the fields for preempt off and irqs off events still show
   the function name (via kallsyms) instead of just showing the raw address.
 
 - Clean ups of the histogram code
 
   The histogram functions saved over 800 bytes on the stack to process
   events as they come in. Instead, create per-cpu buffers that can hold this
   information and have a separate location for each context level (thread,
   softirq, IRQ and NMI).
 
   Also add some more comments to the code.
 
 - Add "common_comm" field for histograms
 
   Add "common_comm" that uses the current->comm as a field in an event
   histogram and acts like any of the other fields of the event.
 
 - Show "subops" in the enabled_functions file
 
   When the function graph infrastructure is used, a subsystem has a "subops"
   that it attaches its callback function to. Instead of the
   enabled_functions just showing a function calling the function that calls
   the subops functions, also show the subops functions that will get called
   for that function too.
 
 - Add "copy_trace_marker" option to instances
 
   There are cases where an instance is created for tooling to write into,
   but the old tooling has the top level instance hardcoded into the
   application. New tools want to consume the data from an instance and not
   the top level buffer. By adding a copy_trace_marker option, whenever the
   top instance trace_marker is written into, a copy of it is also written
   into the instance with this option set. This allows new tools to read what
   old tools are writing into the top buffer.
 
   If this option is cleared by the top instance, then what is written into
   the trace_marker is not written into the top instance. This is a way to
   redirect the trace_marker writes into another instance.
 
 - Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp()
 
   If a tracepoint is created by DECLARE_TRACE() instead of TRACE_EVENT(),
   then it will not be exposed via tracefs. Currently there's no way to
   differentiate in the kernel the tracepoint functions between those that
   are exposed via tracefs or not. A calling convention has been made
   manually to append a "_tp" prefix for events created by DECLARE_TRACE().
   Instead of doing this manually, force it so that all DECLARE_TRACE()
   events have this notation.
 
 - Use __string() for task->comm in some sched events
 
   Instead of hardcoding the comm to be TASK_COMM_LEN in some of the
   scheduler events use __string() which makes it dynamic. Note, if these
   events are parsed by user space it they may break, and the event may have
   to be converted back to the hardcoded size.
 
 - Have function graph "depth" be unsigned to the user
 
   Internally to the kernel, the "depth" field of the function graph event is
   signed due to -1 being used for end of boundary. What actually gets
   recorded in the event itself is zero or positive. Reflect this to user
   space by showing "depth" as unsigned int and be consistent across all
   events.
 
 - Allow an arbitrary long CPU string to osnoise_cpus_write()
 
   The filtering of which CPUs to write to can exceed 256 bytes. If a machine
   has 256 CPUs, and the filter is to filter every other CPU, the write would
   take a string larger than 256 bytes. Instead of using a fixed size buffer
   on the stack that is 256 bytes, allocate it to handle what is passed in.
 
 - Stop having ftrace check the per-cpu data "disabled" flag
 
   The "disabled" flag in the data structure passed to most ftrace functions
   is checked to know if tracing has been disabled or not. This flag was
   added back in 2008 before the ring buffer had its own way to disable
   tracing. The "disable" flag is now not always set when needed, and the
   ring buffer flag should be used in all locations where the disabled is
   needed. Since the "disable" flag is redundant and incorrect, stop using it.
   Fix up some locations that use the "disable" flag to use the ring buffer
   info.
 
 - Use a new tracer_tracing_disable/enable() instead of data->disable flag
 
   There's a few cases that set the data->disable flag to stop tracing, but
   this flag is not consistently used. It is also an on/off switch where if a
   function set it and calls another function that sets it, the called
   function may incorrectly enable it.
 
   Use a new trace_tracing_disable() and tracer_tracing_enable() that uses a
   counter and can be nested. These use the ring buffer flags which are
   always checked making the disabling more consistent.
 
 - Save the trace clock in the persistent ring buffer
 
   Save what clock was used for tracing in the persistent ring buffer and set
   it back to that clock after a reboot.
 
 - Remove unused reference to a per CPU data pointer in mmiotrace functions
 
 - Remove unused buffer_page field from trace_array_cpu structure
 
 - Remove more strncpy() instances
 
 - Other minor clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaDhiqRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkheAQDpyRHoXF1AIoEqyahDax8f3vpZQeCH
 B/mn+YJmU1wuVgEA7AFALov5SHKv4IzoARz68GXtR0jGhP5D8uebUhUqDAQ=
 =WmFG
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Have module addresses get updated in the persistent ring buffer

   The addresses of the modules from the previous boot are saved in the
   persistent ring buffer. If the same modules are loaded and an address
   is in the old buffer points to an address that was both saved in the
   persistent ring buffer and is loaded in memory, shift the address to
   point to the address that is loaded in memory in the trace event.

 - Print function names for irqs off and preempt off callsites

   When ignoring the print fmt of a trace event and just printing the
   fields directly, have the fields for preempt off and irqs off events
   still show the function name (via kallsyms) instead of just showing
   the raw address.

 - Clean ups of the histogram code

   The histogram functions saved over 800 bytes on the stack to process
   events as they come in. Instead, create per-cpu buffers that can hold
   this information and have a separate location for each context level
   (thread, softirq, IRQ and NMI).

   Also add some more comments to the code.

 - Add "common_comm" field for histograms

   Add "common_comm" that uses the current->comm as a field in an event
   histogram and acts like any of the other fields of the event.

 - Show "subops" in the enabled_functions file

   When the function graph infrastructure is used, a subsystem has a
   "subops" that it attaches its callback function to. Instead of the
   enabled_functions just showing a function calling the function that
   calls the subops functions, also show the subops functions that will
   get called for that function too.

 - Add "copy_trace_marker" option to instances

   There are cases where an instance is created for tooling to write
   into, but the old tooling has the top level instance hardcoded into
   the application. New tools want to consume the data from an instance
   and not the top level buffer. By adding a copy_trace_marker option,
   whenever the top instance trace_marker is written into, a copy of it
   is also written into the instance with this option set. This allows
   new tools to read what old tools are writing into the top buffer.

   If this option is cleared by the top instance, then what is written
   into the trace_marker is not written into the top instance. This is a
   way to redirect the trace_marker writes into another instance.

 - Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp()

   If a tracepoint is created by DECLARE_TRACE() instead of
   TRACE_EVENT(), then it will not be exposed via tracefs. Currently
   there's no way to differentiate in the kernel the tracepoint
   functions between those that are exposed via tracefs or not. A
   calling convention has been made manually to append a "_tp" prefix
   for events created by DECLARE_TRACE(). Instead of doing this
   manually, force it so that all DECLARE_TRACE() events have this
   notation.

 - Use __string() for task->comm in some sched events

   Instead of hardcoding the comm to be TASK_COMM_LEN in some of the
   scheduler events use __string() which makes it dynamic. Note, if
   these events are parsed by user space it they may break, and the
   event may have to be converted back to the hardcoded size.

 - Have function graph "depth" be unsigned to the user

   Internally to the kernel, the "depth" field of the function graph
   event is signed due to -1 being used for end of boundary. What
   actually gets recorded in the event itself is zero or positive.
   Reflect this to user space by showing "depth" as unsigned int and be
   consistent across all events.

 - Allow an arbitrary long CPU string to osnoise_cpus_write()

   The filtering of which CPUs to write to can exceed 256 bytes. If a
   machine has 256 CPUs, and the filter is to filter every other CPU,
   the write would take a string larger than 256 bytes. Instead of using
   a fixed size buffer on the stack that is 256 bytes, allocate it to
   handle what is passed in.

 - Stop having ftrace check the per-cpu data "disabled" flag

   The "disabled" flag in the data structure passed to most ftrace
   functions is checked to know if tracing has been disabled or not.
   This flag was added back in 2008 before the ring buffer had its own
   way to disable tracing. The "disable" flag is now not always set when
   needed, and the ring buffer flag should be used in all locations
   where the disabled is needed. Since the "disable" flag is redundant
   and incorrect, stop using it. Fix up some locations that use the
   "disable" flag to use the ring buffer info.

 - Use a new tracer_tracing_disable/enable() instead of data->disable
   flag

   There's a few cases that set the data->disable flag to stop tracing,
   but this flag is not consistently used. It is also an on/off switch
   where if a function set it and calls another function that sets it,
   the called function may incorrectly enable it.

   Use a new trace_tracing_disable() and tracer_tracing_enable() that
   uses a counter and can be nested. These use the ring buffer flags
   which are always checked making the disabling more consistent.

 - Save the trace clock in the persistent ring buffer

   Save what clock was used for tracing in the persistent ring buffer
   and set it back to that clock after a reboot.

 - Remove unused reference to a per CPU data pointer in mmiotrace
   functions

 - Remove unused buffer_page field from trace_array_cpu structure

 - Remove more strncpy() instances

 - Other minor clean ups and fixes

* tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (36 commits)
  tracing: Fix compilation warning on arm32
  tracing: Record trace_clock and recover when reboot
  tracing/sched: Use __string() instead of fixed lengths for task->comm
  tracepoint: Have tracepoints created with DECLARE_TRACE() have _tp suffix
  tracing: Cleanup upper_empty() in pid_list
  tracing: Allow the top level trace_marker to write into another instances
  tracing: Add a helper function to handle the dereference arg in verifier
  tracing: Remove unnecessary "goto out" that simply returns ret is trigger code
  tracing: Fix error handling in event_trigger_parse()
  tracing: Rename event_trigger_alloc() to trigger_data_alloc()
  tracing: Replace deprecated strncpy() with strscpy() for stack_trace_filter_buf
  tracing: Remove unused buffer_page field from trace_array_cpu structure
  tracing: Use atomic_inc_return() for updating "disabled" counter in irqsoff tracer
  tracing: Convert the per CPU "disabled" counter to local from atomic
  tracing: branch: Use trace_tracing_is_on_cpu() instead of "disabled" field
  ring-buffer: Add ring_buffer_record_is_on_cpu()
  tracing: Do not use per CPU array_buffer.data->disabled for cpumask
  ftrace: Do not disabled function graph based on "disabled" field
  tracing: kdb: Use tracer_tracing_on/off() instead of setting per CPU disabled
  tracing: Use tracer_tracing_disable() instead of "disabled" field for ftrace_dump_one()
  ...
2025-05-29 21:04:36 -07:00
Steven Rostedt
99d2328044 ring-buffer: Simplify functions with __free(kfree) to free allocations
The function rb_allocate_pages() allocates cpu_buffer and on error needs
to free it. It has a single return. Use __free(kfree) and return directly
on errors and have the return use return_ptr(cpu_buffer).

The function alloc_buffer() allocates buffer and on error needs to free
it. It has a single return. Use __free(kfree) and return directly on
errors and have the return use return_ptr(buffer).

The function __rb_map_vma() allocates a temporary array "pages". Have it
use __free() and not worry about freeing it when returning.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527143144.6edc4625@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:08 -04:00
Steven Rostedt
60bc720e10 ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)
Convert the taking of the buffer->mutex and the cpu_buffer->mapping_lock
over to guard(mutex) and simplify the ring_buffer_map() and
ring_buffer_unmap() functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527122009.267efb72@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:08 -04:00
Steven Rostedt
b2e7c6ed26 ring-buffer: Simplify ring_buffer_read_page() with guard()
The function ring_buffer_read_page() had two gotos. One was simply
returning "ret" and the other was unlocking the reader_lock.

There's no reason to use goto to simply return the "ret" variable. Instead
just return the value.

The jump to the unlocking of the reader_lock can be replaced by
guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock).

With these two changes the "ret" variable is no longer used and can be
removed. The return value on non-error is what was read and is stored in
the "read" variable.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145216.0187cf36@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
f0d8cbc8cc ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()
Use guard(raw_spinlock_irqsave)() in reset_disabled_cpu_buffer() to
simplify the locking.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527144623.77a9cc47@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
f115d2b70b ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()
The function ring_buffer_swap_cpu() has a bunch of jumps to the label out
that simply returns "ret". There's no reason to jump to a label that
simply returns a value. Just return directly from there.

This goes back to almost the beginning when commit 8aabee573d
("ring-buffer: remove unneeded get_online_cpus") was introduced. That
commit removed a put_online_cpus() from that label, but never updated all
the jumps to it that now no longer needed to do anything but return a
value.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145753.6b45d840@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
2d22216521 ring-buffer: Removed unnecessary if() goto out where out is the next line
In the function ring_buffer_discard_commit() there's an if statement that
jumps to the next line:

	if (rb_try_to_discard(cpu_buffer, event))
		goto out;
 out:

This was caused by the change that modified the way timestamps were taken
in interrupt context, and removed the code between the if statement and
the goto, but failed to update the conditional logic.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527155116.227f35be@gandalf.local.home
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Masami Hiramatsu (Google)
32dc004252 tracing: Reset last-boot buffers when reading out all cpu buffers
Reset the last-boot ring buffers when read() reads out all cpu
buffers through trace_pipe/trace_pipe_raw. This prevents ftrace to
unwind ring buffer read pointer next boot.

Note that this resets only when all per-cpu buffers are empty, and
read via read(2) syscall. For example, if you read only one of the
per-cpu trace_pipe, it does not reset it. Also, reading buffer by
splice(2) syscall does not reset because some data in the reader
(the last) page.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174792929202.496143.8184644221859580999.stgit@mhiramat.tok.corp.google.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
c2a0831142 ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped
When the persistent ring buffer is created from the memory returned by
reserve_mem there is nothing prohibiting it to be memory mapped to user
space. The memory is the same as the pages allocated by alloc_page().

The way the memory is managed by the ring buffer code is slightly
different though and needs to be addressed.

The persistent memory uses the page->id for its own purpose where as the
user mmap buffer currently uses that for the subbuf array mapped to user
space. If the buffer is a persistent buffer, use the page index into that
buffer as the identifier instead of the page->id.

That is, the page->id for a persistent buffer, represents the order of the
buffer is in the link list. ->id == 0 means it is the reader page.
When a reader page is swapped, the new reader page's ->id gets zero, and
the old reader page gets the ->id of the page that it swapped with.

The user space mapping has the ->id is the index of where it was mapped in
user space and does not change while it is mapped.

Since the persistent buffer is fixed in its location, the index of where
a page is in the memory range can be used as the "id" to put in the meta
page array, and it can be mapped in the same order to user space as it is
in the persistent memory.

A new rb_page_id() helper function is used to get and set the id depending
on if the page is a normal memory allocated buffer or a physical memory
mapped buffer.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250401203332.246646011@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
4fc78a7c9c ring-buffer: Do not trigger WARN_ON() due to a commit_overrun
When reading a memory mapped buffer the reader page is just swapped out
with the last page written in the write buffer. If the reader page is the
same as the commit buffer (the buffer that is currently being written to)
it was assumed that it should never have missed events. If it does, it
triggers a WARN_ON_ONCE().

But there just happens to be one scenario where this can legitimately
happen. That is on a commit_overrun. A commit overrun is when an interrupt
preempts an event being written to the buffer and then the interrupt adds
so many new events that it fills and wraps the buffer back to the commit.
Any new events would then be dropped and be reported as "missed_events".

In this case, the next page to read is the commit buffer and after the
swap of the reader page, the reader page will be the commit buffer, but
this time there will be missed events and this triggers the following
warning:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780
 Modules linked in: kvm_intel kvm irqbypass
 CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780
 Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50
 RSP: 0018:ffff888121787dc0 EFLAGS: 00010002
 RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49
 RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8
 RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982
 R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00
 R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008
 FS:  00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __pfx_ring_buffer_map_get_reader+0x10/0x10
  tracing_buffers_ioctl+0x283/0x370
  __x64_sys_ioctl+0x134/0x190
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f95c8de48db
 Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
 RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db
 RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006
 RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90
  </TASK>
 irq event stamp: 5080
 hardirqs last  enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70
 hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70
 softirqs last  enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710
 softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210
 ---[ end trace 0000000000000000 ]---

The above was triggered by running on a kernel with both lockdep and KASAN
as well as kmemleak enabled and executing the following command:

 # perf record -o perf-test.dat -a -- trace-cmd record --nosplice  -e all -p function hackbench 50

With perf interjecting a lot of interrupts and trace-cmd enabling all
events as well as function tracing, with lockdep, KASAN and kmemleak
enabled, it could cause an interrupt preempting an event being written to
add enough events to wrap the buffer. trace-cmd was modified to have
--nosplice use mmap instead of reading the buffer.

The way to differentiate this case from the normal case of there only
being one page written to where the swap of the reader page received that
one page (which is the commit page), check if the tail page is on the
reader page. The difference between the commit page and the tail page is
that the tail page is where new writes go to, and the commit page holds
the first write that hasn't been committed yet. In the case of an
interrupt preempting the write of an event and filling the buffer, it
would move the tail page but not the commit page.

Have the warning only trigger if the tail page is also on the reader page,
and also print out the number of events dropped by a commit overrun as
that can not yet be safely added to the page so that the reader can see
there were events dropped.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home
Fixes: fe832be05a ("ring-buffer: Have mmapped ring buffer keep track of missed events")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:23:48 -04:00
Linus Torvalds
90b83efa67 bpf-next-6.16
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
 bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
 xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
 G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
 HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
 PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
 qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
 LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
 mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
 4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
 3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
 2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
 =j3t8
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Fix and improve BTF deduplication of identical BTF types (Alan
   Maguire and Andrii Nakryiko)

 - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
   Alexis Lothoré)

 - Support load-acquire and store-release instructions in BPF JIT on
   riscv64 (Andrea Parri)

 - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
   Protopopov)

 - Streamline allowed helpers across program types (Feng Yang)

 - Support atomic update for hashtab of BPF maps (Hou Tao)

 - Implement json output for BPF helpers (Ihor Solodrai)

 - Several s390 JIT fixes (Ilya Leoshkevich)

 - Various sockmap fixes (Jiayuan Chen)

 - Support mmap of vmlinux BTF data (Lorenz Bauer)

 - Support BPF rbtree traversal and list peeking (Martin KaFai Lau)

 - Tests for sockmap/sockhash redirection (Michal Luczaj)

 - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)

 - Add support for dma-buf iterators in BPF (T.J. Mercier)

 - The verifier support for __bpf_trap() (Yonghong Song)

* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
  bpf, arm64: Remove unused-but-set function and variable.
  selftests/bpf: Add tests with stack ptr register in conditional jmp
  bpf: Do not include stack ptr register in precision backtracking bookkeeping
  selftests/bpf: enable many-args tests for arm64
  bpf, arm64: Support up to 12 function arguments
  bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
  bpf: Avoid __bpf_prog_ret0_warn when jit fails
  bpftool: Add support for custom BTF path in prog load/loadall
  selftests/bpf: Add unit tests with __bpf_trap() kfunc
  bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
  bpf: Remove special_kfunc_set from verifier
  selftests/bpf: Add test for open coded dmabuf_iter
  selftests/bpf: Add test for dmabuf_iter
  bpf: Add open coded dmabuf iterator
  bpf: Add dmabuf iterator
  dma-buf: Rename debugfs symbols
  bpf: Fix error return value in bpf_copy_from_user_dynptr
  libbpf: Use mmap to parse vmlinux BTF from sysfs
  selftests: bpf: Add a test for mmapable vmlinux BTF
  btf: Allow mmap of vmlinux btf
  ...
2025-05-28 15:52:42 -07:00
Pan Taixi
2fbdb6d8e0 tracing: Fix compilation warning on arm32
On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is
unsigned long. This hence triggers a compilation warning as min()
asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t
solves this issue and works on other target architectures as well.

Compilation warning details:

kernel/trace/trace.c: In function 'tracing_splice_read_pipe':
./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast
  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                            ^
./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck'
   (__typecheck(x, y) && __no_side_effects(x, y))
    ^~~~~~~~~~~

...

kernel/trace/trace.c:6771:8: note: in expansion of macro 'min'
        min((size_t)trace_seq_used(&iter->seq),
        ^~~

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com
Fixes: f5178c41bb ("tracing: Fix oob write in trace_seq_to_buffer()")
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-28 16:10:43 -04:00
Linus Torvalds
f1975e4765 Summary
* Move kern_table members out of kernel/sysctl.c
 
   Moved a subset (tracing, panic, signal, stack_tracer and sparc) out of the
   kern_table array. The goal is for kern_table to only have sysctl elements. All
   this increases modularity by placing the ctl_tables closer to where they are
   used while reducing the chances of merge conflicts in kernel/sysctl.c.
 
 * Fixed sysctl unit test panic by relocating it to selftests
 
 * Testing
 
   These have been in linux-next from rc2, so they have had more than a month
   worth of testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmgwLsAACgkQupfNUreW
 QU9ghwv/VKZW+IXEvSjc8OiwntWkL7e5ddHY6O2Vf44MzhBefLTXmfx2HfkEA0Xw
 RaOQ28Hf/zQL83RqHHnXqI7JdGWQJUm8bCPwk4H3DCaF8qOfPVvblVYmfNL2auSY
 oyRRpRzZuY5EtKcrNjiHFHL2WIC8KvPVwS748oHY1eZY7kn1fcs8DDnNO4iuWop+
 uJeDxu87wkRCFXF3DIM+MAHRvxSa8GHtZvb9EjAl/EHMbAyVSz3uTb7FdQDdnE09
 s7P30EC03RHtgi3sd2Ku04dJsHLz7VErvpToxSH2KFlcdpJuWuCSCTT8XaD8kII8
 kYYCxNpmPOf4LzEy/J2vVZB0PSHrHvuQCH7iGy+8wOPk9GHTOMkKMMXVmeGnAsef
 AiosPYroxXp/nBFcuNs6/1LKpsdpFr2F6u6oMgbzLaW1Xe/oc+6oynuOgeVj9LuM
 FrSxSwaVvpdwHYHujYPQAAWIgKRzITiEXnCgtSyohFquKb+7E8ZspwjOqYH2xWMQ
 WwABNRqY
 =45X2
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Move kern_table members out of kernel/sysctl.c

   Moved a subset (tracing, panic, signal, stack_tracer and sparc) out
   of the kern_table array. The goal is for kern_table to only have
   sysctl elements. All this increases modularity by placing the
   ctl_tables closer to where they are used while reducing the chances
   of merge conflicts in kernel/sysctl.c.

 - Fixed sysctl unit test panic by relocating it to selftests

* tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: Close test ctl_headers with a for loop
  sysctl: call sysctl tests with a for loop
  sysctl: Add 0012 to test the u8 range check
  sysctl: move u8 register test to lib/test_sysctl.c
  sparc: mv sparc sysctls into their own file under arch/sparc/kernel
  stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
  tracing: Move trace sysctls into trace.c
  signal: Move signal ctl tables into signal.c
  panic: Move panic ctl tables into panic.c
2025-05-27 20:43:35 -07:00
Steven Rostedt
c98cc9797b ring-buffer: Move cpus_read_lock() outside of buffer->mutex
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
 ------------------------------------------------------
 trace-cmd/1113 is trying to acquire lock:
 ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70

 but task is already holding lock:
 ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0xcf/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #4 (&mm->mmap_lock){++++}-{4:4}:
        __might_fault+0xa5/0x110
        _copy_to_user+0x22/0x80
        _perf_ioctl+0x61b/0x1b70
        perf_ioctl+0x62/0x90
        __x64_sys_ioctl+0x134/0x190
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #3 (&cpuctx_mutex){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0x325/0x7c0
        perf_event_init+0x52a/0x5b0
        start_kernel+0x263/0x3e0
        x86_64_start_reservations+0x24/0x30
        x86_64_start_kernel+0x95/0xa0
        common_startup_64+0x13e/0x141

 -> #2 (pmus_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0xb7/0x7c0
        cpuhp_invoke_callback+0x2c0/0x1030
        __cpuhp_invoke_callback_range+0xbf/0x1f0
        _cpu_up+0x2e7/0x690
        cpu_up+0x117/0x170
        cpuhp_bringup_mask+0xd5/0x120
        bringup_nonboot_cpus+0x13d/0x170
        smp_init+0x2b/0xf0
        kernel_init_freeable+0x441/0x6d0
        kernel_init+0x1e/0x160
        ret_from_fork+0x34/0x70
        ret_from_fork_asm+0x1a/0x30

 -> #1 (cpu_hotplug_lock){++++}-{0:0}:
        cpus_read_lock+0x2a/0xd0
        ring_buffer_resize+0x610/0x14e0
        __tracing_resize_ring_buffer.part.0+0x42/0x120
        tracing_set_tracer+0x7bd/0xa80
        tracing_set_trace_write+0x132/0x1e0
        vfs_write+0x21c/0xe80
        ksys_write+0xf9/0x1c0
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (&buffer->mutex){+.+.}-{4:4}:
        __lock_acquire+0x1405/0x2210
        lock_acquire+0x174/0x310
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0x11c/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 other info that might help us debug this:

 Chain exists of:
   &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&cpu_buffer->mapping_lock);
                                lock(&mm->mmap_lock);
                                lock(&cpu_buffer->mapping_lock);
   lock(&buffer->mutex);

  *** DEADLOCK ***

 2 locks held by trace-cmd/1113:
  #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
  #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 stack backtrace:
 CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6e/0xa0
  print_circular_bug.cold+0x178/0x1be
  check_noncircular+0x146/0x160
  __lock_acquire+0x1405/0x2210
  lock_acquire+0x174/0x310
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x169/0x18c0
  __mutex_lock+0x192/0x18c0
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? function_trace_call+0x296/0x370
  ? __pfx___mutex_lock+0x10/0x10
  ? __pfx_function_trace_call+0x10/0x10
  ? __pfx___mutex_lock+0x10/0x10
  ? _raw_spin_unlock+0x2d/0x50
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x5/0x18c0
  ring_buffer_map+0x11c/0xe70
  ? do_raw_spin_lock+0x12d/0x270
  ? find_held_lock+0x2b/0x80
  ? _raw_spin_unlock+0x2d/0x50
  ? rcu_is_watching+0x15/0xb0
  ? _raw_spin_unlock+0x2d/0x50
  ? trace_preempt_on+0xd0/0x110
  tracing_buffers_mmap+0x1c4/0x3b0
  __mmap_region+0xd8d/0x1f70
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx___mmap_region+0x10/0x10
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? bpf_lsm_mmap_addr+0x4/0x10
  ? security_mmap_addr+0x46/0xd0
  ? lock_is_held_type+0xd9/0x130
  do_mmap+0x9d7/0x1010
  ? 0xffffffffc0370095
  ? __pfx_do_mmap+0x10/0x10
  vm_mmap_pgoff+0x20b/0x390
  ? __pfx_vm_mmap_pgoff+0x10/0x10
  ? 0xffffffffc0370095
  ksys_mmap_pgoff+0x2e9/0x440
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7fb0963a7de2
 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
  </TASK>

The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-27 15:46:39 -04:00
Linus Torvalds
6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Mykyta Yatsenko
079e5c56a5 bpf: Fix error return value in bpf_copy_from_user_dynptr
On error, copy_from_user returns number of bytes not copied to
destination, but current implementation of copy_user_data_sleepable does
not handle that correctly and returns it as error value, which may
confuse user, expecting meaningful negative error value.

Fixes: a498ee7576 ("bpf: Implement dynptr copy kfuncs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250523181705.261585-1-mykyta.yatsenko5@gmail.com
2025-05-23 13:25:02 -07:00
Ritesh Harjani (IBM)
927244f6ef traceevent/block: Add REQ_ATOMIC flag to block trace events
Filesystems like XFS can implement atomic write I/O using either
REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful
if we have a flag in trace events to distinguish between the two. This
patch adds char 'U' (Untorn writes) to rwbs field of the trace events
if REQ_ATOMIC flag is set in the bio.

<W/ REQ_ATOMIC>
=================
xfs_io-4238    [009] .....  4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0       [009] d.h1.  4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0]

<W/O REQ_ATOMIC>
===============
xfs_io-4237    [010] .....  4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0       [010] d.H1.  4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0]

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 09:18:48 -06:00
Di Shen
4e2e6841ff bpf: Revert "bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic"
This reverts commit 4a8f635a60.

Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.

The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."

Add proper rcu_read_lock/unlock() to protect the find_vpid().

Fixes: 4a8f635a60 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-22 07:49:32 -07:00
Linus Torvalds
c94d59a126 tracing fixes for 6.15:
- Fix sample code that uses trace_array_printk()
 
   The sample code for in kernel use of trace_array (that creates an instance
   for use within the kernel) and shows how to use trace_array_printk() that
   writes into the created instance, used trace_printk_init_buffers(). But
   that function is used to initialize normal trace_printk() and produces the
   NOTICE banner which is not needed for use of trace_array_printk(). The
   function to initialize that is trace_array_init_printk() that takes the
   created trace array instance as a parameter.
 
   Update the sample code to reflect the proper usage.
 
 - Fix preemption count output for stacktrace event
 
   The tracing buffer shows the preempt count level when an event executes.
   Because writing the event itself disables preemption, this needs to be
   accounted for when recording. The stacktrace event did not account for
   this so the output of the stacktrace event showed preemption was disabled
   while the event that triggered the stacktrace shows preemption is enabled
   and this leads to confusion. Account for preemption being disabled for the
   stacktrace event.
 
   The same happened for stack traces triggered by function tracer.
 
 - Fix persistent ring buffer when trace_pipe is used
 
   The ring buffer swaps the reader page with the next page to read from the
   write buffer when trace_pipe is used. If there's only a page of data in
   the ring buffer, this swap will cause the "commit" pointer (last data
   written) to be on the reader page. If more data is written to the buffer,
   it is added to the reader page until it falls off back into the write
   buffer.
 
   If the system reboots and the commit pointer is still on the reader page,
   even if new data was written, the persistent buffer validator will miss
   finding the commit pointer because it only checks the write buffer and
   does not check the reader page. This causes the validator to fail the
   validation and clear the buffer, where the new data is lost.
 
   There was a check for this, but it checked the "head pointer", which was
   incorrect, because the "head pointer" always stays on the write buffer and
   is the next page to swap out for the reader page. Fix the logic to catch
   this case and allow the user to still read the data after reboot.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaCTZHBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qu04AQDjOS46Y8d58MuwjLrQAotOUnANZADz
 7d+5snlcMjhqkAEAo+zc2z9LgqBAnv1VG3GEPgac0JmyPeOnqSJRWRpRXAM=
 =UveQ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix sample code that uses trace_array_printk()

   The sample code for in kernel use of trace_array (that creates an
   instance for use within the kernel) and shows how to use
   trace_array_printk() that writes into the created instance, used
   trace_printk_init_buffers(). But that function is used to initialize
   normal trace_printk() and produces the NOTICE banner which is not
   needed for use of trace_array_printk(). The function to initialize
   that is trace_array_init_printk() that takes the created trace array
   instance as a parameter.

   Update the sample code to reflect the proper usage.

 - Fix preemption count output for stacktrace event

   The tracing buffer shows the preempt count level when an event
   executes. Because writing the event itself disables preemption, this
   needs to be accounted for when recording. The stacktrace event did
   not account for this so the output of the stacktrace event showed
   preemption was disabled while the event that triggered the stacktrace
   shows preemption is enabled and this leads to confusion. Account for
   preemption being disabled for the stacktrace event.

   The same happened for stack traces triggered by function tracer.

 - Fix persistent ring buffer when trace_pipe is used

   The ring buffer swaps the reader page with the next page to read from
   the write buffer when trace_pipe is used. If there's only a page of
   data in the ring buffer, this swap will cause the "commit" pointer
   (last data written) to be on the reader page. If more data is written
   to the buffer, it is added to the reader page until it falls off back
   into the write buffer.

   If the system reboots and the commit pointer is still on the reader
   page, even if new data was written, the persistent buffer validator
   will miss finding the commit pointer because it only checks the write
   buffer and does not check the reader page. This causes the validator
   to fail the validation and clear the buffer, where the new data is
   lost.

   There was a check for this, but it checked the "head pointer", which
   was incorrect, because the "head pointer" always stays on the write
   buffer and is the next page to swap out for the reader page. Fix the
   logic to catch this case and allow the user to still read the data
   after reboot.

* tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Fix persistent buffer when commit page is the reader page
  ftrace: Fix preemption accounting for stacktrace filter command
  ftrace: Fix preemption accounting for stacktrace trigger command
  tracing: samples: Initialize trace_array_printk() with the correct function
2025-05-14 11:24:19 -07:00
Steven Rostedt
1d6c39c89f ring-buffer: Fix persistent buffer when commit page is the reader page
The ring buffer is made up of sub buffers (sometimes called pages as they
are by default PAGE_SIZE). It has the following "pages":

  "tail page" - this is the page that the next write will write to
  "head page" - this is the page that the reader will swap the reader page with.
  "reader page" - This belongs to the reader, where it will swap the head
                  page from the ring buffer so that the reader does not
                  race with the writer.

The writer may end up on the "reader page" if the ring buffer hasn't
written more than one page, where the "tail page" and the "head page" are
the same.

The persistent ring buffer has meta data that points to where these pages
exist so on reboot it can re-create the pointers to the cpu_buffer
descriptor. But when the commit page is on the reader page, the logic is
incorrect.

The check to see if the commit page is on the reader page checked if the
head page was the reader page, which would never happen, as the head page
is always in the ring buffer. The correct check would be to test if the
commit page is on the reader page. If that's the case, then it can exit
out early as the commit page is only on the reader page when there's only
one page of data in the buffer. There's no reason to iterate the ring
buffer pages to find the "commit page" as it is already found.

To trigger this bug:

  # echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable
  # touch /tmp/x
  # chown sshd /tmp/x
  # reboot

On boot up, the dmesg will have:
 Ring buffer meta [0] is from previous boot!
 Ring buffer meta [1] is from previous boot!
 Ring buffer meta [2] is from previous boot!
 Ring buffer meta [3] is from previous boot!
 Ring buffer meta [4] commit page not found
 Ring buffer meta [5] is from previous boot!
 Ring buffer meta [6] is from previous boot!
 Ring buffer meta [7] is from previous boot!

Where the buffer on CPU 4 had a "commit page not found" error and that
buffer is cleared and reset causing the output to be empty and the data lost.

When it works correctly, it has:

  # cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe
        <...>-1137    [004] .....   998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Reported-by: Tasos Sahanidis <tasos@tasossah.com>
Tested-by: Tasos Sahanidis <tasos@tasossah.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:23 -04:00
pengdonglin
11aff32439 ftrace: Fix preemption accounting for stacktrace filter command
The preemption count of the stacktrace filter command to trace ksys_read
is consistently incorrect:

$ echo ksys_read:stacktrace > set_ftrace_filter

   <...>-453     [004] ...1.    38.308956: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

The root cause is that the trace framework disables preemption when
invoking the filter command callback in function_trace_probe_call:

   preempt_disable_notrace();
   probe_ops->func(ip, parent_ip, probe_opsbe->tr, probe_ops, probe->data);
   preempt_enable_notrace();

Use tracing_gen_ctx_dec() to account for the preempt_disable_notrace(),
which will output the correct preemption count:

$ echo ksys_read:stacktrace > set_ftrace_filter

   <...>-410     [006] .....    31.420396: <stack trace>
=> ksys_read
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

Cc: stable@vger.kernel.org
Fixes: 36590c50b2 ("tracing: Merge irqflags + preempt counter.")
Link: https://lore.kernel.org/20250512094246.1167956-2-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:23 -04:00
pengdonglin
e333332657 ftrace: Fix preemption accounting for stacktrace trigger command
When using the stacktrace trigger command to trace syscalls, the
preemption count was consistently reported as 1 when the system call
event itself had 0 (".").

For example:

root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read
$ echo stacktrace > trigger
$ echo 1 > enable

    sshd-416     [002] .....   232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000)
    sshd-416     [002] ...1.   232.864913: <stack trace>
 => ftrace_syscall_enter
 => syscall_trace_enter
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The root cause is that the trace framework disables preemption in __DO_TRACE before
invoking the trigger callback.

Use the tracing_gen_ctx_dec() that will accommodate for the increase of
the preemption count in __DO_TRACE when calling the callback. The result
is the accurate reporting of:

    sshd-410     [004] .....   210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000)
    sshd-410     [004] .....   210.117662: <stack trace>
 => ftrace_syscall_enter
 => syscall_trace_enter
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

Cc: stable@vger.kernel.org
Fixes: ce33c845b0 ("tracing: Dump stacktrace trigger to the corresponding instance")
Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 13:53:09 -04:00
Masami Hiramatsu (Google)
2632a2013f tracing: Record trace_clock and recover when reboot
Record trace_clock information in the trace_scratch area and recover
the trace_clock when boot, so that reader can docode the timestamp
correctly.
Note that since most trace_clocks records the timestamp in nano-
seconds, this is not a bug. But some trace_clock, like counter and
tsc will record the counter value. Only for those trace_clock user
needs this information.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174720625803.1925039.1815089037443798944.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 11:23:28 -04:00
Yury Norov
45c28cdce7 tracing: Cleanup upper_empty() in pid_list
Instead of find_first_bit() use the dedicated bitmap_empty(),
and make upper_empty() a nice one-liner.

While there, fix opencoded BITS_PER_TYPE().

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250429195119.620204-1-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14 11:19:32 -04:00
Tao Chen
3880cdbed1 bpf: Fix WARN() in get_bpf_raw_tp_regs
syzkaller reported an issue:

WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
Modules linked in:
CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c
RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005
RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900
FS:  0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline]
 bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931
 bpf_prog_ec3b2eefa702d8d3+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
 __bpf_prog_run include/linux/filter.h:718 [inline]
 bpf_prog_run include/linux/filter.h:725 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline]
 bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405
 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47
 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47
 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
 trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
 __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35
 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
 mmap_read_trylock include/linux/mmap_lock.h:204 [inline]
 stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157
 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483
 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline]
 bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496
 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline]
 bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931
 bpf_prog_ec3b2eefa702d8d3+0x43/0x47

Tracepoint like trace_mmap_lock_acquire_returned may cause nested call
as the corner case show above, which will be resolved with more general
method in the future. As a result, WARN_ON_ONCE will be triggered. As
Alexei suggested, remove the WARN_ON_ONCE first.

Fixes: 9594dc3c7e ("bpf: fix nested bpf tracepoints with per-cpu data")
Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev

Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com
2025-05-13 09:32:56 -07:00
Masami Hiramatsu (Google)
fd837de3c9 tracing: probes: Fix a possible race in trace_probe_log APIs
Since the shared trace_probe_log variable can be accessed and
modified via probe event create operation of kprobe_events,
uprobe_events, and dynamic_events, it should be protected.
In the dynamic_events, all operations are serialized by
`dyn_event_ops_mutex`. But kprobe_events and uprobe_events
interfaces are not serialized.

To solve this issue, introduces dyn_event_create(), which runs
create() operation under the mutex, for kprobe_events and
uprobe_events. This also uses lockdep to check the mutex is
held when using trace_probe_log* APIs.

Link: https://lore.kernel.org/all/174684868120.551552.3068655787654268804.stgit@devnote2/

Reported-by: Paul Cacheux <paulcacheux@gmail.com>
Closes: https://lore.kernel.org/all/20250510074456.805a16872b591e2971a4d221@kernel.org/
Fixes: ab105a4fb8 ("tracing: Use tracing error_log with probe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-13 22:23:34 +09:00
Mykyta Yatsenko
a498ee7576 bpf: Implement dynptr copy kfuncs
This patch introduces a new set of kfuncs for working with dynptrs in
BPF programs, enabling reading variable-length user or kernel data
into dynptr directly. To enable memory-safety, verifier allows only
constant-sized reads via existing bpf_probe_read_{user|kernel} etc.
kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory
safety shortcomings.

The following kfuncs are introduced:
* `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr
* `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr
* `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into
a dynptr
* `bpf_probe_read_user_str_dynptr()`: probes user-space string into a
dynptr
* `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into
a dynptr for the current task
* `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string
into a dynptr for the current task
* `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data
of the task into a dynptr
* `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space
string of the task into a dynptr

The implementation is built on two generic functions:
 * __bpf_dynptr_copy
 * __bpf_dynptr_copy_str
These functions take function pointers as arguments, enabling the
copying of data from various sources, including both kernel and user
space.
Use __always_inline for generic functions and callbacks to make sure the
compiler doesn't generate indirect calls into callbacks, which is more
expensive, especially on some kernel configurations. Inlining allows
compiler to put direct calls into all the specific callback implementations
(copy_user_data_sleepable, copy_user_data_nofault, and so on).

Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12 18:31:51 -07:00
Paul Cacheux
e41b5af451 tracing: add missing trace_probe_log_clear for eprobes
Make sure trace_probe_log_clear is called in the tracing
eprobe code path, matching the trace_probe_log_init call.

Link: https://lore.kernel.org/all/20250504-fix-trace-probe-log-race-v3-1-9e99fec7eddc@gmail.com/

Signed-off-by: Paul Cacheux <paulcacheux@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-10 08:44:50 +09:00
Breno Leitao
9dda18a32b tracing: fprobe: Fix RCU warning message in list traversal
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Link: https://lore.kernel.org/all/20250410-fprobe-v1-1-068ef5f41436@debian.org/

Fixes: a3dc2983ca ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Antonio Quartulli <antonio@mandelbit.com>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-10 08:28:02 +09:00
Jiri Olsa
8231533340 bpf: Add support to retrieve ref_ctr_offset for uprobe perf link
Adding support to retrieve ref_ctr_offset for uprobe perf link,
which got somehow omitted from the initial uprobe link info changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
2025-05-09 13:01:07 -07:00
Steven Rostedt
7b382efd5e tracing: Allow the top level trace_marker to write into another instances
There are applications that have it hard coded to write into the top level
trace_marker instance (/sys/kernel/tracing/trace_marker). This can be
annoying if a profiler is using that instance for other work, or if it
needs all writes to go into a new instance.

A new option is created called "copy_trace_marker". By default, the top
level has this set, as that is the default buffer that writing into the
top level trace_marker file will go to. But now if an instance is created
and sets this option, all writes into the top level trace_marker will also
be written into that instance buffer just as if an application were to
write into the instance's trace_marker file.

If the top level instance disables this option, then writes to its own
trace_marker and trace_marker_raw files will not go into its buffer.

If no instance has this option set, then the write will return an error
and errno will contain ENODEV.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250508095639.39f84eda@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Steven Rostedt
6956ea9fdc tracing: Add a helper function to handle the dereference arg in verifier
Add a helper function called handle_dereference_arg() to replace the logic
that is identical in two locations of test_event_printk().

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250507191703.5dd8a61d@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Steven Rostedt
f75340d73c tracing: Remove unnecessary "goto out" that simply returns ret is trigger code
There's several functions that have "goto out;" where the label out is just:

 out:
	return ret;

Simplify the code by just doing the return in the location and removing
all the out labels and jumps.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145456.121186494@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Miaoqian Lin
c5dd28e7fb tracing: Fix error handling in event_trigger_parse()
According to trigger_data_alloc() doc, trigger_data_free() should be
used to free an event_trigger_data object. This fixes a mismatch introduced
when kzalloc was replaced with trigger_data_alloc without updating
the corresponding deallocation calls.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.944453325@goodmis.org
Link: https://lore.kernel.org/20250318112737.4174-1-linmq006@gmail.com
Fixes: e1f187d09e ("tracing: Have existing event_command.parse() implementations use helpers")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ SDR: Changed event_trigger_alloc/free() to trigger_data_alloc/free() ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Steven Rostedt
f2947c4b7d tracing: Rename event_trigger_alloc() to trigger_data_alloc()
The function event_trigger_alloc() creates an event_trigger_data
descriptor and states that it needs to be freed via event_trigger_free().
This is incorrect, it needs to be freed by trigger_data_free() as
event_trigger_free() adds ref counting.

Rename event_trigger_alloc() to trigger_data_alloc() and state that it
needs to be freed via trigger_data_free(). This naming convention
was introducing bugs.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.776436410@goodmis.org
Fixes: 86599dbe2c ("tracing: Add helper functions to simplify event_command.parse() callback handling")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Devaansh Kumar
73207746d3 tracing: Replace deprecated strncpy() with strscpy() for stack_trace_filter_buf
strncpy() is deprecated for NUL-terminated destination buffers and must
be replaced by strscpy().

See issue: https://github.com/KSPP/linux/issues/90

Link: https://lore.kernel.org/20250507133837.19640-1-devaanshk840@gmail.com
Signed-off-by: Devaansh Kumar <devaanshk840@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:11 -04:00
Steven Rostedt
6e3b3acaf4 tracing: Remove unused buffer_page field from trace_array_cpu structure
The trace_array_cpu had a "buffer_page" field that was originally going to
be used as a backup page for the ring buffer. But the ring buffer has its
own way of reusing pages and this field was never used.

Remove it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.738849456@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
c4a80c0615 tracing: Use atomic_inc_return() for updating "disabled" counter in irqsoff tracer
The irqsoff tracer uses the per CPU "disabled" field to prevent corruption
of the accounting when it starts to trace interrupts disabled, but there's
a slight race that could happen if for some reason it was called twice.
Use atomic_inc_return() instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.567884756@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
90633c34c3 tracing: Convert the per CPU "disabled" counter to local from atomic
The per CPU "disabled" counter is used for the latency tracers and stack
tracers to make sure that their accounting isn't messed up by an NMI or
interrupt coming in and affecting the same CPU data. But the counter is an
atomic_t type. As it only needs to synchronize against the current CPU,
switch it over to local_t type.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.394925376@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
cf64792f0a tracing: branch: Use trace_tracing_is_on_cpu() instead of "disabled" field
The branch tracer currently checks the per CPU "disabled" field to know if
tracing is enabled or not for the CPU. As the "disabled" value is not used
anymore to turn of tracing generically, use tracing_tracer_is_on_cpu()
instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.224658526@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
092a38565e ring-buffer: Add ring_buffer_record_is_on_cpu()
Add the function ring_buffer_record_is_on_cpu() that returns true if the
ring buffer for a give CPU is writable and false otherwise.

Also add tracer_tracing_is_on_cpu() to return if the ring buffer for a
given CPU is writeable for a given trace_array.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.059853898@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
969043af15 tracing: Do not use per CPU array_buffer.data->disabled for cpumask
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.

Do not bother setting the per CPU disabled flag of the array_buffer data
to use to determine what CPUs can write to the buffer and only rely on the
ring buffer code itself to disabled it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.885452497@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
f62e3de375 ftrace: Do not disabled function graph based on "disabled" field
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.

Do not bother disabling the function graph tracer if the per CPU disabled
field is set. Just record as normal. If tracing is disabled in the ring
buffer it will not be recorded.

Also, when tracing is enabled again, it will not drop the return call of
the function.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.715752008@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:19:10 -04:00
Steven Rostedt
a9839d2048 tracing: kdb: Use tracer_tracing_on/off() instead of setting per CPU disabled
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.

The kdb_ftdump() function iterates over all the current tracing CPUs and
increments the "disabled" counter before doing the dump, and decrements it
afterward.

As the disabled flag can be ignored, doing this today is not reliable.
Instead, simply call tracer_tracing_off() and then tracer_tracing_on() to
disable and then enabled the entire ring buffer in one go!

Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Thompson <danielt@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/20250505212235.549033722@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:18:47 -04:00
Steven Rostedt
6ba3e0533f tracing: Use tracer_tracing_disable() instead of "disabled" field for ftrace_dump_one()
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.

The ftrace_dump_one() function iterates over all the current tracing CPUs and
increments the "disabled" counter before doing the dump, and decrements it
afterward.

As the disabled flag can be ignored, doing this today is not reliable.
Instead use the new tracer_tracing_disable() that calls into the ring
buffer code to do the disabling.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.381188238@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:18:40 -04:00
Steven Rostedt
dbecef68ad tracing: Add tracer_tracing_disable/enable() functions
Allow a tracer to disable writing to its buffer for a temporary amount of
time and re-enable it.

The tracer_tracing_disable() will disable writing to the trace array
buffer, and requires a tracer_tracing_enable() to re-enable it.

The difference between tracer_tracing_disable() and tracer_tracing_off()
is that the disable version can nest, and requires as many enable() calls
as disable() calls to re-enable the buffer. Where as the off() function
can be called multiple times and only requires a singe tracer_tracing_on()
to re-enable the buffer.

Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Thompson <danielt@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/20250505212235.210330010@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-09 15:18:21 -04:00
Feng Yang
ee971630f2 bpf: Allow some trace helpers for all prog types
if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].

[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
2025-05-09 10:37:10 -07:00
Steven Rostedt
1577683a92 tracing: Just use this_cpu_read() to access ignore_pid
The ignore_pid boolean on the per CPU data descriptor is updated at
sched_switch when a new task is scheduled in. If the new task is to be
ignored, it is set to true, otherwise it is set to false. The current task
should always have the correct value as it is updated when the task is
scheduled in.

Instead of breaking up the read of this value, which requires preemption
to be disabled, just use this_cpu_read() which gives a snapshot of the
value. Since the value will always be correct for a given task (because
it's updated at sched switch) it doesn't need preemption disabled.

This will also allow trace events to be called with preemption enabled.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.038958766@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:09 -04:00
Steven Rostedt
c638ebd823 ftrace: Do not bother checking per CPU "disabled" flag
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.

There's no reason for the function tracer to check it, if tracing is
disabled, the ring buffer will not record the event anyway.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212234.868972758@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:09 -04:00
Steven Rostedt
6936298393 tracing/mmiotrace: Remove reference to unused per CPU data pointer
The mmiotracer referenced the per CPU array_buffer->data descriptor but
never actually used it. Remove the references to it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212234.696945463@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:09 -04:00
Tomas Glozar
17f89102fe tracing/osnoise: Allow arbitrarily long CPU string
Allocate kernel memory for processing CPU string
(/sys/kernel/tracing/osnoise/cpus) also in osnoise_cpus_write to allow
the writing of a CPU string of an arbitrary length.

This replaces the 256-byte buffer, which is insufficient with the rising
number of CPUs. For example, if I wanted to measure on every even CPU
on a system with 256 CPUs, the string would be 456 characters long.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250425091839.343289-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:09 -04:00
Steven Rostedt
a54665ab7c ftrace: Comment that ftrace_func_mapper is freed with free_ftrace_hash()
The structure ftrace_func_mapper only contains a single field and that is
a ftrace_hash. It is used to abstract it out from a normal hash to control
users of how it gets modified.

The freeing of a ftrace_func_mapper structure is:

  free_ftrace_hash(&mapper->hash);

Without context, this looks like a bug. It should be commented that it is
not a bug and it is freed this way.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250416165420.5c717420@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:08 -04:00
Ilya Leoshkevich
761ef34228 ftrace: Expose call graph depth as unsigned int
Depth is stored as int because the code uses negative values to break
out of iterations. But what is recorded is always zero or positive. So
expose it as unsigned int instead of int.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-3-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:08 -04:00
Steven Rostedt
88cefd99ee ftrace: Show subops in enabled_functions
The function graph infrastructure uses subops of the function tracer.
These are not shown in enabled_functions. Add a "subops:" section to the
enabled_functions line to show what functions are attached via subops. If
the subops is from the function_graph infrastructure, then show the entry
and return callbacks that are attached.

Here's an example of the output:

schedule_on_each_cpu (1)                tramp: 0xffffffffc03ef000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60     subops: {ent:trace_graph_entry+0x0/0x20 ret:trace_graph_return+0x0/0x150}

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250410153830.5d97f108@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-08 09:36:08 -04:00
Al Viro
2dbf6e0df4 kill vfs_submount()
The last remaining user of vfs_submount() (tracefs) is easy to convert
to fs_context_for_submount(); do that and bury that thing, along with
SB_SUBMOUNT

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-06 12:49:07 -04:00
Steven Rostedt
54c53dfdb6 tracing: Add common_comm to histograms
If one wants to trace the name of the task that wakes up a process and
pass that to the synthetic events, there's nothing currently that lets the
synthetic events do that. Add a "common_comm" to the histogram logic that
allows histograms save the current->comm as a variable that can be passed
through and added to a synthetic event:

 # cd /sys/kernel/tracing
 # echo 's:wake_lat char[] waker; char[] wakee; u64 delta;' >> dynamic_events
 # echo 'hist:keys=pid:comm=common_comm:ts=common_timestamp.usecs if !(common_flags & 0x18)' > events/sched/sched_waking/trigger
 # echo 'hist:keys=next_pid:wake_comm=$comm:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,$wake_comm,next_comm,$delta)' > events/sched/sched_switch/trigger

The above will create a synthetic trace event that will save both the name
of the waker and the wakee but only if the wakeup did not happen in a hard
or soft interrupt context.

The "common_comm" is used to save the task->comm at the time of the
initial event and is passed via the "comm" variable to the second event,
and that is saved as the "waker" field in the "wake_lat" synthetic event.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250407154912.3c6c6246@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:37:03 -04:00
Steven Rostedt
7ab0fc61ce tracing: Move histogram trigger variables from stack to per CPU structure
The histogram trigger has three somewhat large arrays on the kernel stack:

	unsigned long entries[HIST_STACKTRACE_DEPTH];
	u64 var_ref_vals[TRACING_MAP_VARS_MAX];
	char compound_key[HIST_KEY_SIZE_MAX];

Checking the function event_hist_trigger() stack frame size, it currently
uses 816 bytes for its stack frame due to these variables!

Instead, allocate a per CPU structure that holds these arrays for each
context level (normal, softirq, irq and NMI). That is, each CPU will have
4 of these structures. This will be allocated when the first histogram
trigger is enabled and freed when the last is disabled. When the
histogram callback triggers, it will request this structure. The request
will disable preemption, get the per CPU structure at the index of the
per CPU variable, and increment that variable.

The callback will use the arrays in this structure to perform its work and
then release the structure. That in turn will simply decrement the per CPU
index and enable preemption.

Moving the variables from the kernel stack to the per CPU structure brings
the stack frame of event_hist_trigger() down to just 112 bytes.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250407123851.74ea8d58@gandalf.local.home
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:36:11 -04:00
Steven Rostedt
872a0d90c1 tracing: Always use memcpy() in histogram add_to_key()
The add_to_key() function tests if the key is a string or some data. If
it's a string it does some further calculations of the string size (still
truncating it to the max size it can be), and calls strncpy().

If the key isn't as string it calls memcpy(). The interesting point is
that both use the exact same parameters:

                strncpy(compound_key + key_field->offset, (char *)key, size);
        } else
                memcpy(compound_key + key_field->offset, key, size);

As strncpy() is being used simply as a memcpy() for a string, and since
strncpy() is deprecated, just call memcpy() for both memory and string
keys.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/20250403210637.1c477d4a@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:35:34 -04:00
Steven Rostedt
3e4b37160b tracing: Show preempt and irq events callsites from the offsets in field print
When the "fields" option is set in a trace instance, it ignores the "print fmt"
portion of the trace event and just prints the raw fields defined by the
TP_STRUCT__entry() of the TRACE_EVENT() macro.

The preempt_disable/enable and irq_disable/enable events record only the
caller offset from _stext to save space in the ring buffer. Even though
the "fields" option only prints the fields, it also tries to print what
they represent too, which includes function names.

Add a check in the output of the event field printing to see if the field
name is "caller_offs" or "parent_offs" and then print the function at the
offset from _stext of that field.

Instead of just showing:

  irq_disable: caller_offs=0xba634d (12215117) parent_offs=0x39d10e2 (60625122)

Show:

  irq_disable: caller_offs=trace_hardirqs_off.part.0+0xad/0x130 0xba634d (12215117) parent_offs=_raw_spin_lock_irqsave+0x62/0x70 0x39d10e2 (60625122)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250506105131.4b6089a9@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:34:52 -04:00
Steven Rostedt
dc6a49d4cd tracing: Adjust addresses for printing out fields
Add adjustments to the values of the "fields" output if the buffer is a
persistent ring buffer to adjust the addresses to both the kernel core and
kernel modules if they match a module in the persistent memory and that
module is also loaded.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250325185619.54b85587@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:32:29 -04:00
Steven Rostedt
00d872dd54 tracing: Only return an adjusted address if it matches the kernel address
The trace_adjust_address() will take a given address and examine the
persistent ring buffer to see if the address matches a module that is
listed there. If it does not, it will just adjust the value to the core
kernel delta. But if the address was for something that was not part of
the core kernel text or data it should not be adjusted.

Check the result of the adjustment and only return the adjustment if it
lands in the current kernel text or data. If not, return the original
address.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250506102300.0ba2f9e0@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:31:45 -04:00
Steven Rostedt
531ee10b43 tracing: Show function names when possible when listing fields
When the "fields" option is enabled, the "print fmt" of the trace event is
ignored and only the fields are printed. But some fields contain function
pointers. Instead of just showing the hex value in this case, show the
function name when possible:

Instead of having:

 # echo 1 > options/fields
 # cat trace
 [..]
  kmem_cache_free: call_site=0xffffffffa9afcf31 (-1448095951) ptr=0xffff888124452910 (-131386736039664) name=kmemleak_object

Have it output:

  kmem_cache_free: call_site=rcu_do_batch+0x3d1/0x14a0 (-1768960207) ptr=0xffff888132ea5ed0 (854220496) name=kmemleak_object

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250325213919.624181915@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:30:56 -04:00
Steven Rostedt
e3223e1e9a tracing: Update function trace addresses with module addresses
Now that module addresses are saved in the persistent ring buffer, their
addresses can be used to adjust the address in the persistent ring buffer
to the address of the module that is currently loaded.

Instead of blindly using the text_delta that only works for core kernel
code, call the trace_adjust_address() that will see if the address matches
an address saved in the persistent ring buffer, and then uses that against
the matching module if it is loaded.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250506111648.5df7f3ec@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-06 11:30:17 -04:00
Christoph Hellwig
eeadd68e2a block: remove bounce buffering support
The block layer bounce buffering support is unused now, remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05 13:22:39 -06:00
Al Viro
006ff7498f saner calling conventions for ->d_automount()
Currently the calling conventions for ->d_automount() instances have
an odd wart - returned new mount to be attached is expected to have
refcount 2.

That kludge is intended to make sure that mark_mounts_for_expiry() called
before we get around to attaching that new mount to the tree won't decide
to take it out.  finish_automount() drops the extra reference after it's
done with attaching mount to the tree - or drops the reference twice in
case of error.  ->d_automount() instances have rather counterintuitive
boilerplate in them.

There's a much simpler approach: have mark_mounts_for_expiry() skip the
mounts that are yet to be mounted.  And to hell with grabbing/dropping
those extra references.  Makes for simpler correctness analysis, at that...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-05 13:42:49 -04:00
Steven Rostedt
0a8f11f856 tracing: Do not take trace_event_sem in print_event_fields()
On some paths in print_event_fields() it takes the trace_event_sem for
read, even though it should always be held when the function is called.

Remove the taking of that mutex and add a lockdep_assert_held_read() to
make sure the trace_event_sem is held when print_event_fields() is called.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501224128.0b1f0571@batman.local.home
Fixes: 80a76994b2 ("tracing: Add "fields" option to show raw trace event fields")
Reported-by: syzbot+441582c1592938fccf09@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6813ff5e.050a0220.14dd7d.001b.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 22:44:52 -04:00
Steven Rostedt
1be8e54a1e tracing: Fix trace_adjust_address() when there is no modules in scratch area
The function trace_adjust_address() is used to map addresses of modules
stored in the persistent memory and are also loaded in the current boot to
return the current address for the module.

If there's only one module entry, it will simply use that, otherwise it
performs a bsearch of the entry array to find the modules to offset with.

The issue is if there are no modules in the array. The code does not
account for that and ends up referencing the first element in the array
which does not exist and causes a crash.

If nr_entries is zero, exit out early as if this was a core kernel
address.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501151909.65910359@gandalf.local.home
Fixes: 35a380ddbc ("tracing: Show last module text symbols in the stacktrace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 16:06:55 -04:00
Colin Ian King
3c1d9cfa84 ftrace: Fix NULL memory allocation check
The check for a failed memory location is incorrectly checking
the wrong level of pointer indirection by checking !filter_hash
rather than !*filter_hash.  Fix this.

Cc: asami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250422221335.89896-1-colin.i.king@gmail.com
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 15:46:19 -04:00
Jeongjun Park
f5178c41bb tracing: Fix oob write in trace_seq_to_buffer()
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260

CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xc3/0x670 mm/kasan/report.c:521
 kasan_report+0xe0/0x110 mm/kasan/report.c:634
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
 __asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
 trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
 tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
 ....
==================================================================

It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.

Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01 15:24:15 -04:00
Feng Yang
6aca583f90 bpf: Streamline allowed helpers between tracing and base sets
Many conditional checks in switch-case are redundant
with bpf_base_func_proto and should be removed.

Regarding the permission checks bpf_base_func_proto:
The permission checks in bpf_prog_load (as outlined below)
ensure that the trace has both CAP_BPF and CAP_PERFMON capabilities,
thus enabling the use of corresponding prototypes
in bpf_base_func_proto without adverse effects.
bpf_prog_load
	......
	bpf_cap = bpf_token_capable(token, CAP_BPF);
	......
	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
	    type != BPF_PROG_TYPE_CGROUP_SKB &&
	    !bpf_cap)
		goto put_token;
	......
	if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
		goto put_token;
	......

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20250423073151.297103-1-yangfeng59949@163.com
2025-04-23 10:52:16 -07:00
Alexei Starovoitov
5709be4c35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge bpf and other fixes after downstream PRs.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-21 08:04:38 -07:00
Steven Rostedt
a8c5b0ed89 tracing: Fix filter string testing
The filter string testing uses strncpy_from_kernel/user_nofault() to
retrieve the string to test the filter against. The if() statement was
incorrect as it considered 0 as a fault, when it is only negative that it
faulted.

Running the following commands:

  # cd /sys/kernel/tracing
  # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
  # echo 1 > events/syscalls/sys_enter_openat/enable
  # ls /proc/$$/maps
  # cat trace

Would produce nothing, but with the fix it will produce something like:

      ls-1192    [007] .....  8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)

Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home
Fixes: 77360f9bbc ("tracing: Add test for user space strings when filtering on string pointers")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 22:16:56 -04:00
Ilya Leoshkevich
3b4e87e6a5 ftrace: Fix type of ftrace_graph_ent_entry.depth
ftrace_graph_ent.depth is int, but ftrace_graph_ent_entry.depth is
unsigned long. This confuses trace-cmd on 64-bit big-endian systems and
makes it print a huge amount of spaces. Fix this by using unsigned int,
which has a matching size, instead.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-2-iii@linux.ibm.com
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:19:15 -04:00
Menglong Dong
92f1d3b401 ftrace: fix incorrect hash size in register_ftrace_direct()
The maximum of the ftrace hash bits is made fls(32) in
register_ftrace_direct(), which seems illogical. So, we fix it by making
the max hash bits FTRACE_HASH_MAX_BITS instead.

Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn
Fixes: d05cb47066 ("ftrace: Fix modification of direct_function hash while in use")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:16:51 -04:00
Steven Rostedt
c45c585dde ftrace: Free ftrace hashes after they are replaced in the subops code
The subops processing creates new hashes when adding and removing subops.
There were some places that the old hashes that were replaced were not
freed and this caused some memory leaks.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417135939.245b128d@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:16:07 -04:00
Steven Rostedt
08275e59a7 ftrace: Reinitialize hash to EMPTY_HASH after freeing
There's several locations that free a ftrace hash pointer but may be
referenced again. Reset them to EMPTY_HASH so that a u-a-f bug doesn't
happen.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417110933.20ab718b@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:15:28 -04:00
Steven Rostedt
31d1139956 ftrace: Initialize variables for ftrace_startup/shutdown_subops()
The reworking to fix and simplify the ftrace_startup_subops() and the
ftrace_shutdown_subops() made it possible for the filter_hash and
notrace_hash variables to be used uninitialized in a way that the compiler
did not catch it.

Initialize both filter_hash and notrace_hash to the EMPTY_HASH as that is
what they should be if they never are used.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417104017.3aea66c2@gandalf.local.home
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Closes: https://lore.kernel.org/all/1db64a42-626d-4b3a-be08-c65e47333ce2@linux.ibm.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17 15:15:05 -04:00
Linus Torvalds
7cdabafc00 tracing fixes for v6.15
- Hide get_vm_area() from MMUless builds
 
   The function get_vm_area() is not defined when CONFIG_MMU is not defined.
   Hide that function within #ifdef CONFIG_MMU.
 
 - Fix output of synthetic events when they have dynamic strings
 
   The print fmt of the synthetic event's format file use to have "%.*s" for
   dynamic size strings even though the user space exported arguments had
   only __get_str() macro that provided just a nul terminated string. This
   was fixed so that user space could parse this properly. But the reason
   that it had "%.*s" was because internally it provided the maximum size of
   the string as one of the arguments. The fix that replaced "%.*s" with "%s"
   caused the trace output (when the kernel reads the event) to write
   "(efault)" as it would now read the length of the string as "%s".
 
   As the string provided is always nul terminated, there's no reason for the
   internal code to use "%.*s" anyway. Just remove the length argument to
   match the "%s" that is now in the format.
 
 - Fix the ftrace subops hash logic of the manager ops hash
 
   The function_graph uses the ftrace subops code. The subops code is a way
   to have a single ftrace_ops registered with ftrace to determine what
   functions will call the ftrace_ops callback. More than one user of
   function graph can register a ftrace_ops with it. The function graph
   infrastructure will then add this ftrace_ops as a subops with the main
   ftrace_ops it registers with ftrace. This is because the functions will
   always call the function graph callback which in turn calls the subops
   ftrace_ops callbacks.
 
   The main ftrace_ops must add a callback to all the functions that the
   subops want a callback from. When a subops is registered, it will update
   the main ftrace_ops hash to include the functions it wants. This is the
   logic that was broken.
 
   The ftrace_ops hash has a "filter_hash" and a "notrace_hash" were all the
   functions in the filter_hash but not in the notrace_hash are attached by
   ftrace. The original logic would have the main ftrace_ops filter_hash be a
   union of all the subops filter_hashes and the main notrace_hash would be a
   intersect of all the subops filter hashes. But this was incorrect because
   the notrace hash depends on the filter_hash it is associated to and not
   the union of all filter_hashes.
 
   Instead, when a subops is added, just include all the functions of the
   subops hash that are in its filter_hash but not in its notrace_hash. The
   main subops hash should not use its notrace hash, unless all of its subops
   hashes have an empty filter_hash (which means to attach to all functions),
   and then, and only then, the main ftrace_ops notrace hash can be the
   intersect of all the subops hashes.
 
   This not only fixes the bug, but also simplifies the code.
 
 - Add a selftest to better test the subops filtering
 
   Add a selftest that would catch the bug fixed by the above change.
 
 - Fix extra newline printed in function tracing with retval
 
   The function parameter code changed the output logic slightly and called
   print_graph_retval() and also printed a newline. The print_graph_retval()
   also prints a newline which caused blank lines to be printed in the
   function graph tracer when retval was added. This caused one of the
   selftests to fail if retvals were enabled. Instead remove the new line
   output from print_graph_retval() and have the callers always print the
   new line so that it doesn't have to do special logic if it calls
   print_graph_retval() or not.
 
 - Fix out-of-bound memory access in the runtime verifier
 
   When rv_is_container_monitor() is called on the last entry on the link
   list it references the next entry, which is the list head and causes an
   out-of-bound memory access.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ/rXQxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoj7AQC0C2awpJSUIRj91qjPtMYuNUE3AVpB
 EEZEkt19LfE//gEA1fOx3Cors/LrY9dthn/3LMKL23vo9c4i0ffhs2X+1gE=
 =XJL5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Hide get_vm_area() from MMUless builds

   The function get_vm_area() is not defined when CONFIG_MMU is not
   defined. Hide that function within #ifdef CONFIG_MMU.

 - Fix output of synthetic events when they have dynamic strings

   The print fmt of the synthetic event's format file use to have "%.*s"
   for dynamic size strings even though the user space exported
   arguments had only __get_str() macro that provided just a nul
   terminated string. This was fixed so that user space could parse this
   properly.

   But the reason that it had "%.*s" was because internally it provided
   the maximum size of the string as one of the arguments. The fix that
   replaced "%.*s" with "%s" caused the trace output (when the kernel
   reads the event) to write "(efault)" as it would now read the length
   of the string as "%s".

   As the string provided is always nul terminated, there's no reason
   for the internal code to use "%.*s" anyway. Just remove the length
   argument to match the "%s" that is now in the format.

 - Fix the ftrace subops hash logic of the manager ops hash

   The function_graph uses the ftrace subops code. The subops code is a
   way to have a single ftrace_ops registered with ftrace to determine
   what functions will call the ftrace_ops callback. More than one user
   of function graph can register a ftrace_ops with it. The function
   graph infrastructure will then add this ftrace_ops as a subops with
   the main ftrace_ops it registers with ftrace. This is because the
   functions will always call the function graph callback which in turn
   calls the subops ftrace_ops callbacks.

   The main ftrace_ops must add a callback to all the functions that the
   subops want a callback from. When a subops is registered, it will
   update the main ftrace_ops hash to include the functions it wants.
   This is the logic that was broken.

   The ftrace_ops hash has a "filter_hash" and a "notrace_hash" where
   all the functions in the filter_hash but not in the notrace_hash are
   attached by ftrace. The original logic would have the main ftrace_ops
   filter_hash be a union of all the subops filter_hashes and the main
   notrace_hash would be a intersect of all the subops filter hashes.
   But this was incorrect because the notrace hash depends on the
   filter_hash it is associated to and not the union of all
   filter_hashes.

   Instead, when a subops is added, just include all the functions of
   the subops hash that are in its filter_hash but not in its
   notrace_hash. The main subops hash should not use its notrace hash,
   unless all of its subops hashes have an empty filter_hash (which
   means to attach to all functions), and then, and only then, the main
   ftrace_ops notrace hash can be the intersect of all the subops
   hashes.

   This not only fixes the bug, but also simplifies the code.

 - Add a selftest to better test the subops filtering

   Add a selftest that would catch the bug fixed by the above change.

 - Fix extra newline printed in function tracing with retval

   The function parameter code changed the output logic slightly and
   called print_graph_retval() and also printed a newline. The
   print_graph_retval() also prints a newline which caused blank lines
   to be printed in the function graph tracer when retval was added.
   This caused one of the selftests to fail if retvals were enabled.
   Instead remove the new line output from print_graph_retval() and have
   the callers always print the new line so that it doesn't have to do
   special logic if it calls print_graph_retval() or not.

 - Fix out-of-bound memory access in the runtime verifier

   When rv_is_container_monitor() is called on the last entry on the
   link list it references the next entry, which is the list head and
   causes an out-of-bound memory access.

* tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix out-of-bound memory access in rv_is_container_monitor()
  ftrace: Do not have print_graph_retval() add a newline
  tracing/selftest: Add test to better test subops filtering of function graph
  ftrace: Fix accounting of subop hashes
  ftrace: Properly merge notrace hashes
  tracing: Do not add length to print format in synthetic events
  tracing: Hide get_vm_area() from MMUless builds
2025-04-12 15:37:40 -07:00
Nam Cao
8d7861ac50 rv: Fix out-of-bound memory access in rv_is_container_monitor()
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:

  BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
  Read of size 8 at addr ffffffff97c7c798 by task setup/221

  The buggy address belongs to the variable:
   rv_monitors_list+0x18/0x40

This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.

Fix it by checking if the monitor is last in the list.

Cc: stable@vger.kernel.org
Cc: Gabriele Monaco <gmonaco@redhat.com>
Fixes: cb85c660fc ("rv: Add option for nested monitors and include sched")
Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-12 12:13:30 -04:00
Steven Rostedt
485acd207d ftrace: Do not have print_graph_retval() add a newline
The retval and retaddr options for function_graph tracer will add a
comment at the end of a function for both leaf and non leaf functions that
looks like:

               __wake_up_common(); /* ret=0x1 */

               } /* pick_next_task_fair ret=0x0 */

The function print_graph_retval() adds a newline after the "*/". But if
that's not called, the caller function needs to make sure there's a
newline added.

This is confusing and when the function parameters code was added, it
added a newline even when calling print_graph_retval() as the fact that
the print_graph_retval() function prints a newline isn't obvious.

This caused an extra newline to be printed and that made it fail the
selftests when the retval option was set, as the selftests were not
expecting blank lines being injected into the trace.

Instead of having print_graph_retval() print a newline, just have the
caller always print the newline regardless if it calls print_graph_retval()
or not. This not only fixes this bug, but it also simplifies the code.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250411133015.015ca393@gandalf.local.home
Reported-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/ccc40f2b-4b9e-4abd-8daf-d22fce2a86f0@sirena.org.uk/
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-12 12:13:30 -04:00
Steven Rostedt
0ae6b8ce20 ftrace: Fix accounting of subop hashes
The function graph infrastructure uses ftrace to hook to functions. It has
a single ftrace_ops to manage all the users of function graph. Each
individual user (tracing, bpf, fprobes, etc) has its own ftrace_ops to
track the functions it will have its callback called from. These
ftrace_ops are "subops" to the main ftrace_ops of the function graph
infrastructure.

Each ftrace_ops has a filter_hash and a notrace_hash that is defined as:

  Only trace functions that are in the filter_hash but not in the
  notrace_hash.

If the filter_hash is empty, it means to trace all functions.
If the notrace_hash is empty, it means do not disable any function.

The function graph main ftrace_ops needs to be a superset containing all
the functions to be traced by all the subops it has. The algorithm to
perform this merge was incorrect.

When the first subops was added to the main ops, it simply made the main
ops a copy of the subops (same filter_hash and notrace_hash).

When a second ops was added, it joined the new subops filter_hash with the
main ops filter_hash as a union of the two sets. The intersect between the
new subops notrace_hash and the main ops notrace_hash was created as the
new notrace_hash of the main ops.

The issue here is that it would then start tracing functions than no
subops were tracing. For example if you had two subops that had:

subops 1:

  filter_hash = '*sched*' # trace all functions with "sched" in it
  notrace_hash = '*time*' # except do not trace functions with "time"

subops 2:

  filter_hash = '*lock*' # trace all functions with "lock" in it
  notrace_hash = '*clock*' # except do not trace functions with "clock"

The intersect of '*time*' functions with '*clock*' functions could be the
empty set. That means the main ops will be tracing all functions with
'*time*' and all "*clock*" in it!

Instead, modify the algorithm to be a bit simpler and correct.

First, when adding a new subops, even if it's the first one, do not add
the notrace_hash if the filter_hash is not empty. Instead, just add the
functions that are in the filter_hash of the subops but not in the
notrace_hash of the subops into the main ops filter_hash. There's no
reason to add anything to the main ops notrace_hash.

The notrace_hash of the main ops should only be non empty iff all subops
filter_hashes are empty (meaning to trace all functions) and all subops
notrace_hashes include the same functions.

That is, the main ops notrace_hash is empty if any subops filter_hash is
non empty.

The main ops notrace_hash only has content in it if all subops
filter_hashes are empty, and the content are only functions that intersect
all the subops notrace_hashes. If any subops notrace_hash is empty, then
so is the main ops notrace_hash.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/20250409152720.216356767@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-11 16:02:08 -04:00
Andy Chiu
04a80a34c2 ftrace: Properly merge notrace hashes
The global notrace hash should be jointly decided by the intersection of
each subops's notrace hash, but not the filter hash.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250408160258.48563-1-andybnac@gmail.com
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Signed-off-by: Andy Chiu <andybnac@gmail.com>
[ fixed removing of freeing of filter_hash ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-11 15:14:54 -04:00
Tao Chen
a76116f422 bpf: Check link_create.flags parameter for multi_uprobe
The link_create.flags are currently not used for multi-uprobes, so return
-EINVAL if it is set, same as for other attach APIs.

We allow target_fd to have an arbitrary value for multi-uprobe, though,
as there are existing users (libbpf) relying on this.

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
2025-04-09 16:28:51 -07:00
Tao Chen
243911982a bpf: Check link_create.flags parameter for multi_kprobe
The link_create.flags are currently not used for multi-kprobes, so return
-EINVAL if it is set, same as for other attach APIs.

We allow target_fd, on the other hand, to have an arbitrary value for
multi-kprobe, as there are existing users (libbpf) relying on this.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
2025-04-09 16:28:22 -07:00
Steven Rostedt
e1a453a57b tracing: Do not add length to print format in synthetic events
The following causes a vsnprintf fault:

  # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
  # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
  # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger

Because the synthetic event's "wakee" field is created as a dynamic string
(even though the string copied is not). The print format to print the
dynamic string changed from "%*s" to "%s" because another location
(__set_synth_event_print_fmt()) exported this to user space, and user
space did not need that. But it is still used in print_synth_event(), and
the output looks like:

          <idle>-0       [001] d..5.   193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
    sshd-session-879     [001] d..5.   193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
          <idle>-0       [002] d..5.   193.811198: wake_lat: wakee=(efault)bashdelta=91
            bash-880     [002] d..5.   193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21
          <idle>-0       [001] d..5.   193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129
    sshd-session-879     [001] d..5.   193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50

The length isn't needed as the string is always nul terminated. Just print
the string and not add the length (which was hard coded to the max string
length anyway).

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home
Fixes: 4d38328eb4 ("tracing: Fix synth event printk format for str fields");
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09 11:34:21 -04:00
Joel Granados
67049b53e0 stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
Move stack_tracer_enabled into trace_stack_sysctl_table. This is part of
a greater effort to move ctl tables into their respective subsystems
which will reduce the merge conflicts in kernel/sysctl.c.

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09 13:32:16 +02:00
Joel Granados
dd293df639 tracing: Move trace sysctls into trace.c
Move trace ctl tables into their own const array in
kernel/trace/trace.c. The sysctl table register is called with
subsys_initcall placing if after its original place in proc_root_init.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09 13:32:16 +02:00
Linus Torvalds
bec7dcbc24 Probes fixes for v6.14:
- fprobe: Fix to remove fprobe_hlist_node when module unloading
 
   When a fprobe target module is removed, the fprobe_hlist_node
   should be removed from the fprobe's hash table to prevent reusing
   accidentally if another module is loaded at the same address.
 
 - fprobe: Fix to lock module while registering fprobe
 
  The module containing the function to be probeed is locked using a
   reference counter until the fprobe registration is complete, which
   prevents use after free.
 
 - fprobe-events: Fix possible UAF on modules
 
   Basically as same as above, but in the fprobe-events layer we also
   need to get module reference counter when we find the tracepoint
   in the module.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmf0kJ8bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b+bEIALiFuYBn2y26OJnfaRnW
 rgSC2JupswEVg7HwsN5kA/x1ypXl9SPfJGjbiHLUTq9+4KGOBTmY+k5/OpVO+Qkh
 3nYKOkZxKRTglA7hRSTH0rxDV1eobps4nv/xkPjprugcjCGU54+4yb9Hq7Kyflpa
 o8p+VS/0VOJ9f3Iy9a9JRfu9qE7Qzz9USCj4N64WMgx/qczPe27twqFEaUpTf1VW
 Sw9twtKnqGs9hNE2QmhlzUBuq6gOZMXkjH6t1U4pMWBGB51JqZ5ZBhC4kL/5XEIZ
 bEau82El5qdieQC2B7c0RxldceKa4t4QUlJDalZGKpxvTXrCw9rFyv0dRe2cXnKm
 Yo0=
 =I+MO
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - fprobe: remove fprobe_hlist_node when module unloading

   When a fprobe target module is removed, the fprobe_hlist_node should
   be removed from the fprobe's hash table to prevent reusing
   accidentally if another module is loaded at the same address.

 - fprobe: lock module while registering fprobe

   The module containing the function to be probeed is locked using a
   reference counter until the fprobe registration is complete, which
   prevents use after free.

 - fprobe-events: fix possible UAF on modules

   Basically as same as above, but in the fprobe-events layer we also
   need to get module reference counter when we find the tracepoint in
   the module.

* tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: Cleanup fprobe hash when module unloading
  tracing: fprobe events: Fix possible UAF on modules
  tracing: fprobe: Fix to lock module while registering fprobe
2025-04-08 12:51:34 -07:00
Masami Hiramatsu (Google)
a3dc2983ca tracing: fprobe: Cleanup fprobe hash when module unloading
Cleanup fprobe address hash table on module unloading because the
target symbols will be disappeared when unloading module and not
sure the same symbol is mapped on the same address.

Note that this is at least disables the fprobes if a part of target
symbols on the unloaded modules. Unlike kprobes, fprobe does not
re-enable the probe point by itself. To do that, the caller should
take care register/unregister fprobe when loading/unloading modules.
This simplifies the fprobe state managememt related to the module
loading/unloading.

Link: https://lore.kernel.org/all/174343534473.843280.13988101014957210732.stgit@devnote2/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-04-08 08:46:25 +09:00
Steven Rostedt
4808595a99 tracing: Hide get_vm_area() from MMUless builds
The function get_vm_area() is not defined for non-MMU builds and causes a
build error if it is used. Hide the map_pages() function around a:

 #ifdef CONFIG_MMU

to keep it from being compiled when CONFIG_MMU is not set.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250407120111.2ccc9319@gandalf.local.home
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/all/4f8ece8b-8862-4f7c-8ede-febd28f8a9fe@roeck-us.net/
Fixes: 394f3f02de ("tracing: Use vmap_page_range() to map memmap ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-07 12:02:11 -04:00
Linus Torvalds
6cb0bd94c0 Persistent buffer cleanups and simplifications for v6.15:
It was mistaken that the physical memory returned from "reserve_mem" had to
 be vmap()'d to get to it from a virtual address. But reserve_mem already
 maps the memory to the virtual address of the kernel so a simple
 phys_to_virt() can be used to get to the virtual address from the physical
 memory returned by "reserve_mem". With this new found knowledge, the
 code can be cleaned up and simplified.
 
 - Enforce that the persistent memory is page aligned
 
   As the buffers using the persistent memory are all going to be
   mapped via pages, make sure that the memory given to the tracing
   infrastructure is page aligned. If it is not, it will print a warning
   and fail to map the buffer.
 
 - Use phys_to_virt() to get the virtual address from reserve_mem
 
   Instead of calling vmap() on the physical memory returned from
   "reserve_mem", use phys_to_virt() instead.
 
   As the memory returned by "memmap" or any other means where a physical
   address is given to the tracing infrastructure, it still needs to
   be vmap(). Since this memory can never be returned back to the buddy
   allocator nor should it ever be memmory mapped to user space, flag
   this buffer and up the ref count. The ref count will keep it from
   ever being freed, and the flag will prevent it from ever being memory
   mapped to user space.
 
 - Use vmap_page_range() for memmap virtual address mapping
 
   For the memmap buffer, instead of allocating an array of struct pages,
   assigning them to the contiguous phsycial memory and then passing that to
   vmap(), use vmap_page_range() instead
 
 - Replace flush_dcache_folio() with flush_kernel_vmap_range()
 
   Instead of calling virt_to_folio() and passing that to
   flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
   This also fixes a bug where if a subbuffer was bigger than PAGE_SIZE
   only the PAGE_SIZE portion would be flushed.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+6oZRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhq6AP481KHAgaowQCg7zrKPkMlbYBIigYoU
 7aqoAg2rSLBRSQEAl8fViHZgZ9Q+O7xdozQWiIR7/KQW8VIaTcP/V7cHkAU=
 =+5JB
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:
 "Persistent buffer cleanups and simplifications.

  It was mistaken that the physical memory returned from "reserve_mem"
  had to be vmap()'d to get to it from a virtual address. But
  reserve_mem already maps the memory to the virtual address of the
  kernel so a simple phys_to_virt() can be used to get to the virtual
  address from the physical memory returned by "reserve_mem". With this
  new found knowledge, the code can be cleaned up and simplified.

   - Enforce that the persistent memory is page aligned

     As the buffers using the persistent memory are all going to be
     mapped via pages, make sure that the memory given to the tracing
     infrastructure is page aligned. If it is not, it will print a
     warning and fail to map the buffer.

   - Use phys_to_virt() to get the virtual address from reserve_mem

     Instead of calling vmap() on the physical memory returned from
     "reserve_mem", use phys_to_virt() instead.

     As the memory returned by "memmap" or any other means where a
     physical address is given to the tracing infrastructure, it still
     needs to be vmap(). Since this memory can never be returned back to
     the buddy allocator nor should it ever be memmory mapped to user
     space, flag this buffer and up the ref count. The ref count will
     keep it from ever being freed, and the flag will prevent it from
     ever being memory mapped to user space.

   - Use vmap_page_range() for memmap virtual address mapping

     For the memmap buffer, instead of allocating an array of struct
     pages, assigning them to the contiguous phsycial memory and then
     passing that to vmap(), use vmap_page_range() instead

   - Replace flush_dcache_folio() with flush_kernel_vmap_range()

     Instead of calling virt_to_folio() and passing that to
     flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
     This also fixes a bug where if a subbuffer was bigger than
     PAGE_SIZE only the PAGE_SIZE portion would be flushed"

* tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
  tracing: Use vmap_page_range() to map memmap ring buffer
  tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
  tracing: Enforce the persistent ring buffer to be page aligned
2025-04-03 16:09:29 -07:00
Linus Torvalds
41677970ad tracing fixes for 6.15
- Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled
 
   The tracing of arguments in the function tracer depends on some
   functions that are only defined when PROBE_EVENTS_BTF_ARGS is enabled.
   In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same configs
   as the function argument tracing requires. Just have the function
   argument tracing depend on PROBE_EVENTS_BTF_ARGS.
 
 - Free module_delta for persistent ring buffer instance
 
   When an instance holds the persistent ring buffer, it allocates
   a helper array to hold the deltas between where modules are loaded
   on the last boot and the current boot. This array needs to be freed
   when the instance is freed.
 
 - Add cond_resched() to loop in ftrace_graph_set_hash()
 
   The hash functions in ftrace loop over every function that can be
   enabled by ftrace. This can be 50,000 functions or more. This
   loop is known to trigger soft lockup warnings and requires a
   cond_resched(). The loop in ftrace_graph_set_hash() was missing it.
 
 - Fix the event format verifier to include "%*p.." arguments
 
   To prevent events from dereferencing stale pointers that can
   happen if a trace event uses a dereferece pointer to something
   that was not copied into the ring buffer and can be freed by the
   time the trace is read, a verifier is called. At boot or module
   load, the verifier scans the print format string for pointers
   that can be dereferenced and it checks the arguments to make sure
   they do not contain something that can be freed. The "%*p" was
   not handled, which would add another argument and cause the verifier
   to not only not verify this pointer, but it will look at the wrong
   argument for every pointer after that.
 
 - Fix mcount sorttable building for different endian type target
 
   When modifying the ELF file to sort the mcount_loc table in the
   sorttable.c code, the endianess of the file and the host is used
   to determine if the bytes need to be swapped when calculations are
   done. A change was made to the sorting of the mcount_loc that read
   the values from the ELF file into an array and the swap happened
   on the filling of the array. But one of the calculations of the
   array still did the swap when it did not need to. This caused building
   on a little endian machine for a big endian target to not find
   the mcount function in the 'nm' table and it zeroed it out, causing
   there to be no functions available to trace.
 
 - Add goto out_unlock jump to rv_register_monitor() on error path
 
   One of the error paths in rv_register_monitor() just returned the
   error when it should have jumped to the out_unlock label to release
   the mutex.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+2tyBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qjPYAPwJDti6nHTqheFwIa1WzJ3yC2tKRYKt
 1E5PYW/2Ct5NmwEAqgg3TvJppXHymVdutLghhGFnlBnyTWMI+KIhparSBw8=
 =NFM5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled

   The tracing of arguments in the function tracer depends on some
   functions that are only defined when PROBE_EVENTS_BTF_ARGS is
   enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same
   configs as the function argument tracing requires. Just have the
   function argument tracing depend on PROBE_EVENTS_BTF_ARGS.

 - Free module_delta for persistent ring buffer instance

   When an instance holds the persistent ring buffer, it allocates a
   helper array to hold the deltas between where modules are loaded on
   the last boot and the current boot. This array needs to be freed when
   the instance is freed.

 - Add cond_resched() to loop in ftrace_graph_set_hash()

   The hash functions in ftrace loop over every function that can be
   enabled by ftrace. This can be 50,000 functions or more. This loop is
   known to trigger soft lockup warnings and requires a cond_resched().
   The loop in ftrace_graph_set_hash() was missing it.

 - Fix the event format verifier to include "%*p.." arguments

   To prevent events from dereferencing stale pointers that can happen
   if a trace event uses a dereferece pointer to something that was not
   copied into the ring buffer and can be freed by the time the trace is
   read, a verifier is called. At boot or module load, the verifier
   scans the print format string for pointers that can be dereferenced
   and it checks the arguments to make sure they do not contain
   something that can be freed. The "%*p" was not handled, which would
   add another argument and cause the verifier to not only not verify
   this pointer, but it will look at the wrong argument for every
   pointer after that.

 - Fix mcount sorttable building for different endian type target

   When modifying the ELF file to sort the mcount_loc table in the
   sorttable.c code, the endianess of the file and the host is used to
   determine if the bytes need to be swapped when calculations are done.
   A change was made to the sorting of the mcount_loc that read the
   values from the ELF file into an array and the swap happened on the
   filling of the array. But one of the calculations of the array still
   did the swap when it did not need to. This caused building on a
   little endian machine for a big endian target to not find the mcount
   function in the 'nm' table and it zeroed it out, causing there to be
   no functions available to trace.

 - Add goto out_unlock jump to rv_register_monitor() on error path

   One of the error paths in rv_register_monitor() just returned the
   error when it should have jumped to the out_unlock label to release
   the mutex.

* tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix missing unlock on double nested monitors return path
  scripts/sorttable: Fix endianness handling in build-time mcount sort
  tracing: Verify event formats that have "%*p.."
  ftrace: Add cond_resched() to ftrace_graph_set_hash()
  tracing: Free module_delta on freeing of persistent ring buffer
  ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS
2025-04-03 09:52:44 -07:00
Linus Torvalds
af54a3a151 more printk changes for 6.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmftIx0ACgkQUqAMR0iA
 lPJKUxAArJaWC7EXtDtd8u8Rl/CYpIEaMdPd7V+XA5sqdUyjkSJI+jRswonpOsNX
 Zn9pbGMds1LNBXm1NO9039+2TrPSJFTCGK6OgeCJC17/O31wnnm0LhZiU+JElgfi
 iQI5fdTnc3sB37bsjkvEUr9HFizRxY2fHHMWZ8ngiLfkKfki4ET+1u/yf7CraRk1
 6+LK9mM/WyytP6gYaSlL5YYVYs9fNcR/ND6IQgpfIN15/fOAOXWbMB1jE2iDRzqt
 MQUD4+DTYQYmeS6jQ4ToZdx3Ql9NwcP2nJnA5fxXeqPFHc/SgRS6KqOPQgQUD4tV
 N4q6ozLPlzDFeHVHMhPz/PzlSEn0zC1ZX87xXCUAilnkJpbEujcPxf44R/3RHu3d
 y7kmCRj0RwgHpLIwzLH5POrF4il9/wVlyZFRaYBPMkj09l0WBwYvfMhlnzvAtCP8
 pRKqHkjJ1FOWQFJyn98ONqcCmm2pZ8XKW2enikAhISVXcptI/1lIQ6IIpRdTjte1
 r60CbiJ7UFL+TrVqsWBuqWQRi5u5HykPkZiCL/YYXzZmrl3zLO+0ti9YzEU8Yrzd
 K1VAB/1aK/MDrTgOI+VaqlPq79uJBwtbrflgFhFBKAKsqTpBcsZUv9/1KHthnqXV
 Y84SsY2XpoGtjn58mU6eEc+8lLTOTDVXs+ZZL4/M3maW7ygNiYY=
 =Biv4
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull more printk updates from Petr Mladek:

 - Silence warnings about candidates for ‘gnu_print’ format attribute

* tag 'printk-for-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  vsnprintf: Silence false positive GCC warning for va_format()
  vsnprintf: Drop unused const char fmt * in va_format()
  vsnprintf: Mark binary printing functions with __printf() attribute
  tracing: Mark binary printing functions with __printf() attribute
  seq_file: Mark binary printing functions with __printf() attribute
  seq_buf: Mark binary printing functions with __printf() attribute
2025-04-02 10:05:55 -07:00
Steven Rostedt
e4d4b8670c ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
Some architectures do not have data cache coherency between user and
kernel space. For these architectures, the cache needs to be flushed on
both the kernel and user addresses so that user space can see the updates
the kernel has made.

Instead of using flush_dcache_folio() and playing with virt_to_folio()
within the call to that function, use flush_kernel_vmap_range() which
takes the virtual address and does the work for those architectures that
need it.

This also fixes a bug where the flush of the reader page only flushed one
page. If the sub-buffer order is 1 or more, where the sub-buffer size
would be greater than a page, it would miss the rest of the sub-buffer
content, as the "reader page" is not just a page, but the size of a
sub-buffer.

Link: https://lore.kernel.org/all/CAG48ez3w0my4Rwttbc5tEbNsme6tc0mrSN95thjXUFaJ3aQ6SA@mail.gmail.com/

Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Link: https://lore.kernel.org/20250402144953.920792197@goodmis.org
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions");
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 11:02:27 -04:00
Steven Rostedt
394f3f02de tracing: Use vmap_page_range() to map memmap ring buffer
The code to map the physical memory retrieved by memmap currently
allocates an array of pages to cover the physical memory and then calls
vmap() to map it to a virtual address. Instead of using this temporary
array of struct page descriptors, simply use vmap_page_range() that can
directly map the contiguous physical memory to a virtual address.

Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.754618481@goodmis.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 11:02:27 -04:00
Steven Rostedt
34ea8fa084 tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
The reserve_mem kernel command line option may pass back a physical
address, but the memory is still part of the normal memory just like
using memblock_alloc() would be. This means that the physical memory
returned by the reserve_mem command line option can be converted directly
to virtual memory by simply using phys_to_virt().

When freeing the buffer there's no need to call vunmap() anymore as the
memory allocated by reserve_mem is freed by the call to
reserve_mem_release_by_name().

Because the persistent ring buffer can also be allocated via the memmap
option, which *is* different than normal memory as it cannot be added back
to the buddy system, it must be treated differently. It still needs to be
virtually mapped to have access to it. It also can not be freed nor can it
ever be memory mapped to user space.

Create a new trace_array flag called TRACE_ARRAY_FL_MEMMAP which gets set
if the buffer is created by the memmap option, and this will prevent the
buffer from being memory mapped by user space.

Also increment the ref count for memmap'ed buffers so that they can never
be freed.

Link: https://lore.kernel.org/all/Z-wFszhJ_9o4dc8O@kernel.org/

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.583750106@goodmis.org
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 11:02:26 -04:00
Steven Rostedt
c44a14f216 tracing: Enforce the persistent ring buffer to be page aligned
Enforce that the address and the size of the memory used by the persistent
ring buffer is page aligned. Also update the documentation to reflect this
requirement.

Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250402144953.412882844@goodmis.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 11:02:26 -04:00
Masami Hiramatsu (Google)
dd941507a9 tracing: fprobe events: Fix possible UAF on modules
Commit ac91052f0a ("tracing: tprobe-events: Fix leakage of module
refcount") moved try_module_get() from __find_tracepoint_module_cb()
to find_tracepoint() caller, but that introduced a possible UAF
because the module can be unloaded before try_module_get(). In this
case, the module object should be freed too. Thus, try_module_get()
does not only fail but may access to the freed object.

To avoid that, try_module_get() in __find_tracepoint_module_cb()
again.

Link: https://lore.kernel.org/all/174342990779.781946.9138388479067729366.stgit@devnote2/

Fixes: ac91052f0a ("tracing: tprobe-events: Fix leakage of module refcount")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-04-02 23:18:58 +09:00
Masami Hiramatsu (Google)
d24fa977ee tracing: fprobe: Fix to lock module while registering fprobe
Since register_fprobe() does not get the module reference count while
registering fgraph filter, if the target functions (symbols) are in
modules, those modules can be unloaded when registering fprobe to
fgraph.

To avoid this issue, get the reference counter of module for each
symbol, and put it after register the fprobe.

Link: https://lore.kernel.org/all/174330568792.459674.16874380163991113156.stgit@devnote2/

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20250325130628.3a9e234c@gandalf.local.home/
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-04-02 23:18:57 +09:00
Gabriele Monaco
fc0585c7fa rv: Fix missing unlock on double nested monitors return path
RV doesn't support nested monitors having children monitors themselves
and exits with the EINVAL code. However, it returns without unlocking
the rv_interface_lock.

Unlock the lock before returning from the initialisation function.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250402071351.19864-2-gmonaco@redhat.com
Fixes: cb85c660fc ("rv: Add option for nested monitors and include sched")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202503310200.UBXGitB4-lkp@intel.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 09:51:26 -04:00
Steven Rostedt
ea8d7647f9 tracing: Verify event formats that have "%*p.."
The trace event verifier checks the formats of trace events to make sure
that they do not point at memory that is not in the trace event itself or
in data that will never be freed. If an event references data that was
allocated when the event triggered and that same data is freed before the
event is read, then the kernel can crash by reading freed memory.

The verifier runs at boot up (or module load) and scans the print formats
of the events and checks their arguments to make sure that dereferenced
pointers are safe. If the format uses "%*p.." the verifier will ignore it,
and that could be dangerous. Cover this case as well.

Also add to the sample code a use case of "%*pbl".

Link: https://lore.kernel.org/all/bcba4d76-2c3f-4d11-baf0-02905db953dd@oracle.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5013f454a3 ("tracing: Add check of trace event print fmts for dereferencing pointers")
Link: https://lore.kernel.org/20250327195311.2d89ec66@gandalf.local.home
Reported-by: Libo Chen <libo.chen@oracle.com>
Reviewed-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 09:51:26 -04:00
zhoumin
42ea22e754 ftrace: Add cond_resched() to ftrace_graph_set_hash()
When the kernel contains a large number of functions that can be traced,
the loop in ftrace_graph_set_hash() may take a lot of time to execute.
This may trigger the softlockup watchdog.

Add cond_resched() within the loop to allow the kernel to remain
responsive even when processing a large number of functions.

This matches the cond_resched() that is used in other locations of the
code that iterates over all functions that can be traced.

Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Link: https://lore.kernel.org/tencent_3E06CE338692017B5809534B9C5C03DA7705@qq.com
Signed-off-by: zhoumin <teczm@foxmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 09:51:25 -04:00
Steven Rostedt
2c9ee74a6d tracing: Free module_delta on freeing of persistent ring buffer
If a persistent ring buffer is created, a "module_delta" array is also
allocated to hold the module deltas of loaded modules that match modules
in the scratch area. If this buffer gets freed, the module_delta array is
not freed and causes a memory leak.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250401124525.1f9ac02a@gandalf.local.home
Fixes: 35a380ddbc ("tracing: Show last module text symbols in the stacktrace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 09:51:25 -04:00
Steven Rostedt
b81ff11c21 ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS
The option PROBE_EVENTS_BTF_ARGS enables the functions
btf_find_func_proto() and btf_get_func_param() which are used by the
function argument tracing code. The option FUNCTION_TRACE_ARGS was
dependent on the same configs that PROBE_EVENTS_BTF_ARGS was dependent on,
but it was also dependent on PROBE_EVENTS_BTF_ARGS. In fact, if
PROBE_EVENTS_BTF_ARGS is supported then FUNCTION_TRACE_ARGS is supported.

Just make FUNCTION_TRACE_ARGS depend on PROBE_EVENTS_BTF_ARGS.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250401113601.17fa1129@gandalf.local.home
Fixes: 533c20b062 ("ftrace: Add print_function_args()")
Closes: https://lore.kernel.org/all/DB9PR08MB75820599801BAD118D123D7D93AD2@DB9PR08MB7582.eurprd08.prod.outlook.com/
Reported-by: Christian Loehle <Christian.Loehle@arm.com>
Tested-by: Christian Loehle <Christian.Loehle@arm.com>
Tested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-02 09:50:56 -04:00
Linus Torvalds
46d29f23a7 ring-buffer updates for v6.15
- Restructure the persistent memory to have a "scratch" area
 
   Instead of hard coding the KASLR offset in the persistent memory
   by the ring buffer, push that work up to the callers of the persistent
   memory as they are the ones that need this information. The offsets
   and such is not important to the ring buffer logic and it should
   not be part of that.
 
   A scratch pad is now created when the caller allocates a ring buffer
   from persistent memory by stating how much memory it needs to save.
 
 - Allow where modules are loaded to be saved in the new scratch pad
 
   Save the addresses of modules when they are loaded into the persistent
   memory scratch pad.
 
 - A new module_for_each_mod() helper function was created
 
   With the acknowledgement of the module maintainers a new module helper
   function was created to iterate over all the currently loaded modules.
   This has a callback to be called for each module. This is needed for
   when tracing is started in the persistent buffer and the currently loaded
   modules need to be saved in the scratch area.
 
 - Expose the last boot information where the kernel and modules were loaded
 
   The last_boot_info file is updated to print out the addresses of where
   the kernel "_text" location was loaded from a previous boot, as well
   as where the modules are loaded. If the buffer is recording the current
   boot, it only prints "# Current" so that it does not expose the KASLR
   offset of the currently running kernel.
 
 - Allow the persistent ring buffer to be released (freed)
 
   To have this in production environments, where the kernel command line can
   not be changed easily, the ring buffer needs to be freed when it is not
   going to be used. The memory for the buffer will always be allocated at
   boot up, but if the system isn't going to enable tracing, the memory needs
   to be freed. Allow it to be freed and added back to the kernel memory
   pool.
 
 - Allow stack traces to print the function names in the persistent buffer
 
   Now that the modules are saved in the persistent ring buffer, if the same
   modules are loaded, the printing of the function names will examine the
   saved modules. If the module is found in the scratch area and is also
   loaded, then it will do the offset shift and use kallsyms to display the
   function name. If the address is not found, it simply displays the address
   from the previous boot in hex.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+cUERQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrAsAQCFt2nfzxoe3wtF5EqIT1VHp/8bQVjG
 gBe8B6ouboreogD/dS7yK8MRy24ZAmObGwYG0RbVicd50S7P8Rf7+823ng8=
 =OJKk
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - Restructure the persistent memory to have a "scratch" area

   Instead of hard coding the KASLR offset in the persistent memory by
   the ring buffer, push that work up to the callers of the persistent
   memory as they are the ones that need this information. The offsets
   and such is not important to the ring buffer logic and it should not
   be part of that.

   A scratch pad is now created when the caller allocates a ring buffer
   from persistent memory by stating how much memory it needs to save.

 - Allow where modules are loaded to be saved in the new scratch pad

   Save the addresses of modules when they are loaded into the
   persistent memory scratch pad.

 - A new module_for_each_mod() helper function was created

   With the acknowledgement of the module maintainers a new module
   helper function was created to iterate over all the currently loaded
   modules. This has a callback to be called for each module. This is
   needed for when tracing is started in the persistent buffer and the
   currently loaded modules need to be saved in the scratch area.

 - Expose the last boot information where the kernel and modules were
   loaded

   The last_boot_info file is updated to print out the addresses of
   where the kernel "_text" location was loaded from a previous boot, as
   well as where the modules are loaded. If the buffer is recording the
   current boot, it only prints "# Current" so that it does not expose
   the KASLR offset of the currently running kernel.

 - Allow the persistent ring buffer to be released (freed)

   To have this in production environments, where the kernel command
   line can not be changed easily, the ring buffer needs to be freed
   when it is not going to be used. The memory for the buffer will
   always be allocated at boot up, but if the system isn't going to
   enable tracing, the memory needs to be freed. Allow it to be freed
   and added back to the kernel memory pool.

 - Allow stack traces to print the function names in the persistent
   buffer

   Now that the modules are saved in the persistent ring buffer, if the
   same modules are loaded, the printing of the function names will
   examine the saved modules. If the module is found in the scratch area
   and is also loaded, then it will do the offset shift and use kallsyms
   to display the function name. If the address is not found, it simply
   displays the address from the previous boot in hex.

* tag 'trace-ringbuffer-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Use _text and the kernel offset in last_boot_info
  tracing: Show last module text symbols in the stacktrace
  ring-buffer: Remove the unused variable bmeta
  tracing: Skip update_last_data() if cleared and remove active check for save_mod()
  tracing: Initialize scratch_size to zero to prevent UB
  tracing: Fix a compilation error without CONFIG_MODULES
  tracing: Freeable reserved ring buffer
  mm/memblock: Add reserved memory release function
  tracing: Update modules to persistent instances when loaded
  tracing: Show module names and addresses of last boot
  tracing: Have persistent trace instances save module addresses
  module: Add module_for_each_mod() function
  tracing: Have persistent trace instances save KASLR offset
  ring-buffer: Add ring_buffer_meta_scratch()
  ring-buffer: Add buffer meta data for persistent ring buffer
  ring-buffer: Use kaslr address instead of text delta
  ring-buffer: Fix bytes_dropped calculation issue
2025-03-31 13:37:22 -07:00
Linus Torvalds
01d5b167dc Modules changes for 6.15-rc1
- Use RCU instead of RCU-sched
 
   The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable()
   in the module code and its users has been replaced with just
   rcu_read_lock().
 
 - The rest of changes are smaller fixes and updates.
 
 The changes have been on linux-next for at least 2 weeks, with the RCU
 cleanup present for 2 months. One performance problem was reported with the
 RCU change when KASAN + lockdep were enabled, but it was effectively
 addressed by the already merged ee57ab5a32 ("locking/lockdep: Disable
 KASAN instrumentation of lockdep.c").
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmfmwrsUHHBldHIucGF2
 bHVAc3VzZS5jb20ACgkQumpXJwqY6prWxgf/S7Pvdywm10vJ6fooYa+GxXNMwhyh
 XRjZ4m9gjeTNf2KLwX0XHv0XZeFHOmHfjd3iI+pS6CXZnCFTN9J3XPLYsrTxXUb6
 U6zzLf8Zsz8TzeI4dgvSBsZln7oICSACkAgdJCq23hpNKeaeRo91dgiZaIwyZJG3
 FekqSFtP7pYhfFoNkrFKysqbgl1+RWWZ79L2qRJA0bPzVFlvRUuh6cOHQw+8RMqf
 BYLwnArjTkW8AcXpxIXSiwphDHVZ81B96xoplavyoprA5FDpv1W+8y4DtxdWFn+1
 QVWCs/ZV3KrwXWpZev625w3fIOOIXILqRINOzLfvXTw+1xFS3TzSQEpVeg==
 =4OKc
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules updates from Petr Pavlu:

 - Use RCU instead of RCU-sched

   The mix of rcu_read_lock(), rcu_read_lock_sched() and
   preempt_disable() in the module code and its users has
   been replaced with just rcu_read_lock()

 - The rest of changes are smaller fixes and updates

* tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits)
  MAINTAINERS: Update the MODULE SUPPORT section
  module: Remove unnecessary size argument when calling strscpy()
  module: Replace deprecated strncpy() with strscpy()
  params: Annotate struct module_param_attrs with __counted_by()
  bug: Use RCU instead RCU-sched to protect module_bug_list.
  static_call: Use RCU in all users of __module_text_address().
  kprobes: Use RCU in all users of __module_text_address().
  bpf: Use RCU in all users of __module_text_address().
  jump_label: Use RCU in all users of __module_text_address().
  jump_label: Use RCU in all users of __module_address().
  x86: Use RCU in all users of __module_address().
  cfi: Use RCU while invoking __module_address().
  powerpc/ftrace: Use RCU in all users of __module_text_address().
  LoongArch: ftrace: Use RCU in all users of __module_text_address().
  LoongArch/orc: Use RCU in all users of __module_address().
  arm64: module: Use RCU in all users of __module_text_address().
  ARM: module: Use RCU in all users of __module_text_address().
  module: Use RCU in all users of __module_text_address().
  module: Use RCU in all users of __module_address().
  module: Use RCU in search_module_extables().
  ...
2025-03-30 15:44:36 -07:00
Linus Torvalds
fa593d0f96 bpf-next-6.15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v
 bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD
 D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y
 rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ
 MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF
 bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS
 QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a
 wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS
 9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e
 lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG
 IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ
 Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg=
 =aN/V
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:
 "For this merge window we're splitting BPF pull request into three for
  higher visibility: main changes, res_spin_lock, try_alloc_pages.

  These are the main BPF changes:

   - Add DFA-based live registers analysis to improve verification of
     programs with loops (Eduard Zingerman)

   - Introduce load_acquire and store_release BPF instructions and add
     x86, arm64 JIT support (Peilin Ye)

   - Fix loop detection logic in the verifier (Eduard Zingerman)

   - Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet)

   - Add kfunc for populating cpumask bits (Emil Tsalapatis)

   - Convert various shell based tests to selftests/bpf/test_progs
     format (Bastien Curutchet)

   - Allow passing referenced kptrs into struct_ops callbacks (Amery
     Hung)

   - Add a flag to LSM bpf hook to facilitate bpf program signing
     (Blaise Boscaccy)

   - Track arena arguments in kfuncs (Ihor Solodrai)

   - Add copy_remote_vm_str() helper for reading strings from remote VM
     and bpf_copy_from_user_task_str() kfunc (Jordan Rome)

   - Add support for timed may_goto instruction (Kumar Kartikeya
     Dwivedi)

   - Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy)

   - Reduce bpf_cgrp_storage_busy false positives when accessing cgroup
     local storage (Martin KaFai Lau)

   - Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko)

   - Allow retrieving BTF data with BTF token (Mykyta Yatsenko)

   - Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix
     (Song Liu)

   - Reject attaching programs to noreturn functions (Yafang Shao)

   - Introduce pre-order traversal of cgroup bpf programs (Yonghong
     Song)"

* tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits)
  selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid
  bpf: Fix out-of-bounds read in check_atomic_load/store()
  libbpf: Add namespace for errstr making it libbpf_errstr
  bpf: Add struct_ops context information to struct bpf_prog_aux
  selftests/bpf: Sanitize pointer prior fclose()
  selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
  selftests/bpf: test_xdp_vlan: Rename BPF sections
  bpf: clarify a misleading verifier error message
  selftests/bpf: Add selftest for attaching fexit to __noreturn functions
  bpf: Reject attaching fexit/fmod_ret to __noreturn functions
  bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
  bpf: Make perf_event_read_output accessible in all program types.
  bpftool: Using the right format specifiers
  bpftool: Add -Wformat-signedness flag to detect format errors
  selftests/bpf: Test freplace from user namespace
  libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
  bpf: Return prog btf_id without capable check
  bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
  bpf, x86: Fix objtool warning for timed may_goto
  bpf: Check map->record at the beginning of check_and_free_fields()
  ...
2025-03-30 12:43:03 -07:00
Steven Rostedt
028a58ec15 tracing: Use _text and the kernel offset in last_boot_info
Instead of using kaslr_offset() just record the location of "_text". This
makes it possible for user space to use either the System.map or
/proc/kallsyms as what to map all addresses to functions with.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250326220304.38dbedcd@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Masami Hiramatsu (Google)
35a380ddbc tracing: Show last module text symbols in the stacktrace
Since the previous boot trace buffer can include module text address in
the stacktrace. As same as the kernel text address, convert the module
text address using the module address information.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174282689201.356346.17647540360450727687.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Jiapeng Chong
de48d7fff7 ring-buffer: Remove the unused variable bmeta
Variable bmeta is not effectively used, so delete it.

kernel/trace/ring_buffer.c:1952:27: warning: variable ‘bmeta’ set but not used.

Link: https://lore.kernel.org/20250317015524.3902-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19524
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Masami Hiramatsu (Google)
486fbcb380 tracing: Skip update_last_data() if cleared and remove active check for save_mod()
If the last boot data is already cleared, there is no reason to update it
again. Skip if the TRACE_ARRAY_FL_LAST_BOOT is cleared.
Also, for calling save_mod() when module loading, we don't need to check
the trace is active or not because any module address can be on the
stacktrace.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174165660328.1173316.15529357882704817499.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Steven Rostedt
5dbeb56bb9 tracing: Initialize scratch_size to zero to prevent UB
In allocate_trace_buffer() the following code:

  buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
				      tr->range_addr_start,
				      tr->range_addr_size,
				      struct_size(tscratch, entries, 128));

  tscratch = ring_buffer_meta_scratch(buf->buffer, &scratch_size);
  setup_trace_scratch(tr, tscratch, scratch_size);

Has undefined behavior if ring_buffer_alloc_range() fails because
"scratch_size" is not initialize. If the allocation fails, then
buf->buffer will be NULL. The ring_buffer_meta_scratch() will return
NULL immediately if it is passed a NULL buffer and it will not update
scratch_size. Then setup_trace_scratch() will return immediately if
tscratch is NULL.

Although there's no real issue here, but it is considered undefined
behavior to pass an uninitialized variable to a function as input, and
UBSan may complain about it.

Just initialize scratch_size to zero to make the code defined behavior and
a little more robust.

Link: https://lore.kernel.org/all/44c5deaa-b094-4852-90f9-52f3fb10e67a@stanley.mountain/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Masami Hiramatsu (Google)
f00c9201f9 tracing: Fix a compilation error without CONFIG_MODULES
There are some code which depends on CONFIG_MODULES. #ifdef
to enclose it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174230515367.2909896.8132122175220657625.stgit@mhiramat.tok.corp.google.com
Fixes: dca91c1c5468 ("tracing: Have persistent trace instances save module addresses")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:56 -04:00
Masami Hiramatsu (Google)
fb6d03238e tracing: Freeable reserved ring buffer
Make the ring buffer on reserved memory to be freeable. This allows us
to free the trace instance on the reserved memory without changing
cmdline and rebooting. Even if we can not change the kernel cmdline
for security reason, we can release the reserved memory for the ring
buffer as free (available) memory.

For example, boot kernel with reserved memory;
"reserve_mem=20M:2M:trace trace_instance=boot_mapped^traceoff@trace"

~ # free
              total        used        free      shared  buff/cache   available
Mem:        1995548       50544     1927568       14964       17436     1911480
Swap:             0           0           0
~ # rmdir /sys/kernel/tracing/instances/boot_mapped/
[   23.704023] Freeing reserve_mem:trace memory: 20476K
~ # free
              total        used        free      shared  buff/cache   available
Mem:        2016024       41844     1956740       14968       17440     1940572
Swap:             0           0           0

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Link: https://lore.kernel.org/173989134814.230693.18199312930337815629.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:28 -04:00
Steven Rostedt
5f3719f697 tracing: Update modules to persistent instances when loaded
When a module is loaded and a persistent buffer is actively tracing, add
it to the list of modules in the persistent memory.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164609.469844721@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:27 -04:00
Steven Rostedt
1bd25a6f71 tracing: Show module names and addresses of last boot
Add the last boot module's names and addresses to the last_boot_info file.
This only shows the module information from a previous boot. If the buffer
is started and is recording the current boot, this file still will only
show "current".

  ~# cat instances/boot_mapped/last_boot_info
  10c00000		[kernel]
  ffffffffc00ca000	usb_serial_simple
  ffffffffc00ae000	usbserial
  ffffffffc008b000	bfq

  ~# echo function > instances/boot_mapped/current_tracer
  ~# cat instances/boot_mapped/last_boot_info
  # Current

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164609.299186021@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:27 -04:00
Steven Rostedt
fd39e48fe8 tracing: Have persistent trace instances save module addresses
For trace instances that are mapped to persistent memory, have them use
the scratch area to save the currently loaded modules. This will allow
where the modules have been loaded on the next boot so that their
addresses can be deciphered by using where they were loaded previously.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164609.129741650@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:27 -04:00
Steven Rostedt
b65334825f tracing: Have persistent trace instances save KASLR offset
There's no reason to save the KASLR offset for the ring buffer itself.
That is used by the tracer. Now that the tracer has a way to save data in
the persistent memory of the ring buffer, have the tracing infrastructure
take care of the saving of the KASLR offset.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164608.792722274@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:27 -04:00
Steven Rostedt
4af0a9c518 ring-buffer: Add ring_buffer_meta_scratch()
Now that there's one meta data at the start of the persistent memory used by
the ring buffer, allow the caller to request some memory right after that
data that it can use as its own persistent memory.

Also fix some white space issues with ring_buffer_alloc().

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164608.619631731@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:27 -04:00
Steven Rostedt
4009cc31e7 ring-buffer: Add buffer meta data for persistent ring buffer
Instead of just having a meta data at the first page of each sub buffer
that has duplicate data, add a new meta page to the entire block of memory
that holds the duplicate data and remove it from the sub buffer meta data.

This will open up the extra memory in this first page to be used by the
tracer for its own persistent data.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164608.446351513@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:26 -04:00
Steven Rostedt
bcba8d4dbe ring-buffer: Use kaslr address instead of text delta
Instead of saving off the text and data pointers and using them to compare
with the current boot's text and data pointers, just save off the KASLR
offset. Then that can be used to figure out how to read the previous boots
buffer.

The last_boot_info will now show this offset, but only if it is for a
previous boot:

  ~# cat instances/boot_mapped/last_boot_info
  39000000	[kernel]

  ~# echo function > instances/boot_mapped/current_tracer
  ~# cat instances/boot_mapped/last_boot_info
  # Current

If the KASLR offset saved is for the current boot, the last_boot_info will
show the value of "current".

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250305164608.274956504@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:26 -04:00
Feng Yang
c73f0b6964 ring-buffer: Fix bytes_dropped calculation issue
The calculation of bytes-dropped and bytes_dropped_nested is reversed.
Although it does not affect the final calculation of total_dropped,
it should still be modified.

Link: https://lore.kernel.org/20250223070106.6781-1-yangfeng59949@163.com
Fixes: 6c43e554a2 ("ring-buffer: Add ring buffer startup selftest")
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:26 -04:00
Andy Shevchenko
196a062641 tracing: Mark binary printing functions with __printf() attribute
Binary printing functions are using printf() type of format, and compiler
is not happy about them as is:

kernel/trace/trace.c:3292:9: error: function ‘trace_vbprintk’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
kernel/trace/trace_seq.c:182:9: error: function ‘trace_seq_bprintf’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]

Fix the compilation errors by adding __printf() attribute.

While at it, move existing __printf() attributes from the implementations
to the declarations. IT also fixes incorrect attribute parameters that are
used for trace_array_printk().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20250321144822.324050-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-03-28 13:37:11 +01:00
Linus Torvalds
a7e135fe59 Probes updates for v6.15:
- probe-events: Add comments about entry data storing code to clarify
   where and how the entry data is stored for function return events.
 
 - probe-events: Log error for exceeding the number of arguments to help
   user to identify error reason via tracefs/error_log file.
 
 - selftests/ftrace: Improve the ftracetest to add followngs.
   . Expand the tprobe event test to check if it can correctly find
     the wrong format tracepoint name.
   . Add new syntax error test to check whether error_log correctly
     indicates a wrong character in the tracepoint name.
   . Add a new dynamic events argument limitation test case which checks
     max number of probe arguments.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmflTlobHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8ba9UIALzzZQxzUhJYO/B/XaCz
 KTuSrU2594cLr3nCrmCfL3UHUCr5IcjXKCfUdrgzE9mckWF+nVRXWwbp29KpOQky
 fzU9Ardbr7ksGAYFk4My+P/BeYa7vh9LwofXzWlJibANVxWvq66+GfKnWnh1P8Bl
 /zjov61DAEQmDfXNGZ2oTmKnYYMoPkJU4voRvAEUgiP02SrF04UUa00uC7hmZJJV
 aI5b6TatE5FDEmAoHWh0cf04HoyevRyLVOQd5GSswH/oWpyhs90M7a9WENHamBx6
 RXEZ3xYyuH9zBSvYVeRn2H1eHnGCR4RMctZiRGXF8EdLWo7RXRQ1Hp9CgvwDrU8Z
 IL0=
 =vZ3e
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - probe-events: Add comments about entry data storing code to clarify
   where and how the entry data is stored for function return events.

 - probe-events: Log error for exceeding the number of arguments to help
   user to identify error reason via tracefs/error_log file.

 - Improve the ftracetest selftests:
    - Expand the tprobe event test to check if it can correctly find the
      wrong format tracepoint name.
    - Add new syntax error test to check whether error_log correctly
      indicates a wrong character in the tracepoint name.
    - Add a new dynamic events argument limitation test case which
      checks max number of probe arguments.

* tag 'probes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: probe-events: Add comments about entry data storing code
  selftests/ftrace: Add dynamic events argument limitation test case
  selftests/ftrace: Add new syntax error test
  selftests/ftrace: Expand the tprobe event test to check wrong format
  tracing: probe-events: Log error for exceeding the number of arguments
2025-03-27 19:31:34 -07:00
Linus Torvalds
744fab2d9f tracing updates for v6.15:
- Add option traceoff_after_boot
 
   In order to debug kernel boot, it sometimes is helpful to enable tracing
   via the kernel command line. Unfortunately, by the time the login prompt
   appears, the trace is overwritten by the init process and other user space
   start up applications. Adding a "traceoff_after_boot" will disable tracing
   when the kernel passes control to init which will allow developers to be
   able to see the traces that occurred during boot.
 
 - Clean up the mmflags macros that display the GFP flags in trace events
 
   The macros to print the GFP flags for trace events had a bit of
   duplication. The code was restructured to remove duplication and in the
   process it also adds some flags that were missed before.
 
 - Removed some dead code and scripts/draw_functrace.py
 
   draw_functrace.py hasn't worked in years and as nobody complained
   about it, remove it.
 
 - Constify struct event_trigger_ops
 
   The event_trigger_ops is just a structure that has function pointers that
   are assigned when the variables are created. These variables should all be
   constants.
 
 - Other minor clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+V9IhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr4RAP9JhE3n69pGuOVaJTN/LGLr2Axl59n4
 KqZSZS1nUM76/gD6AxYpR7nxyxgJ7VjNkLptS9tSjJVdPDxGAl0v3eO04w4=
 =SU30
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Add option traceoff_after_boot

   In order to debug kernel boot, it sometimes is helpful to enable
   tracing via the kernel command line. Unfortunately, by the time the
   login prompt appears, the trace is overwritten by the init process
   and other user space start up applications.

   Adding a "traceoff_after_boot" will disable tracing when the kernel
   passes control to init which will allow developers to be able to see
   the traces that occurred during boot.

 - Clean up the mmflags macros that display the GFP flags in trace
   events

   The macros to print the GFP flags for trace events had a bit of
   duplication. The code was restructured to remove duplication and in
   the process it also adds some flags that were missed before.

 - Removed some dead code and scripts/draw_functrace.py

   draw_functrace.py hasn't worked in years and as nobody complained
   about it, remove it.

 - Constify struct event_trigger_ops

   The event_trigger_ops is just a structure that has function pointers
   that are assigned when the variables are created. These variables
   should all be constants.

 - Other minor clean ups and fixes

* tag 'trace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Replace strncpy with memcpy for fixed-length substring copy
  tracing: Fix synth event printk format for str fields
  tracing: Do not use PERF enums when perf is not defined
  tracing: Ensure module defining synth event cannot be unloaded while tracing
  tracing: fix return value in __ftrace_event_enable_disable for TRACE_REG_UNREGISTER
  tracing/osnoise: Fix possible recursive locking for cpus_read_lock()
  tracing: Align synth event print fmt
  tracing: gfp: vsprintf: Do not print "none" when using %pGg printf format
  tracepoint: Print the function symbol when tracepoint_debug is set
  tracing: Constify struct event_trigger_ops
  scripts/tracing: Remove scripts/tracing/draw_functrace.py
  tracing: Update MAINTAINERS file to include tracepoint.c
  tracing/user_events: Slightly simplify user_seq_show()
  tracing/user_events: Don't use %pK through printk
  tracing: gfp: Remove duplication of recording GFP flags
  tracing: Remove orphaned event_trace_printk
  ring-buffer: Fix typo in comment about header page pointer
  tracing: Add traceoff_after_boot option
2025-03-27 16:22:12 -07:00
Linus Torvalds
88221ac0d5 Latency tracing changes for v6.15:
- Add some trace events to osnoise and timerlat sample generation
 
   This adds more information to the osnoise and timerlat tracers as well as
   allows BPF programs to be attached to these locations to extract even more
   data.
 
 - Fix to DECLARE_TRACE_CONDITION() macro
 
   It wasn't used but now will be and it happened to be broken causing the
   build to fail.
 
 - Add scheduler specification monitors to runtime verifier (RV)
 
   This is a continuation of Daniel Bristot's work.
 
   RV allows monitors to run and react concurrently. Running the cumulative
   model is equivalent to running single components using the same
   reactors, with the advantage that it's easier to point out which
   specification failed in case of error.
 
   This update introduces nested monitors to RV, in short, the sysfs
   monitor folder will contain a monitor named sched, which is nothing but
   an empty container for other monitors. Controlling the sched monitor
   (enable, disable, set reactors) controls all nested monitors.
 
   The following scheduling monitors are added:
 
   * sco: scheduling context operations
       Monitor to ensure sched_set_state happens only in thread context
   * tss: task switch while scheduling
       Monitor to ensure sched_switch happens only in scheduling context
   * snroc: set non runnable on its own context
       Monitor to ensure set_state happens only in the respective task's context
   * scpd: schedule called with preemption disabled
       Monitor to ensure schedule is called with preemption disabled
   * snep: schedule does not enable preempt
       Monitor to ensure schedule does not enable preempt
   * sncid: schedule not called with interrupt disabled
       Monitor to ensure schedule is not called with interrupt disabled
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+QhuxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg62AP9bkeNDbiCuAqjZGddV09Hw26wC3yum
 kQoNSebD8G52rQEA9GDjK37xGzYwW/fJokhJVTV39qfub6inAJE5dS6WeQY=
 =8Ikv
 -----END PGP SIGNATURE-----

Merge tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull latency tracing updates from Steven Rostedt:

 - Add some trace events to osnoise and timerlat sample generation

   This adds more information to the osnoise and timerlat tracers as
   well as allows BPF programs to be attached to these locations to
   extract even more data.

 - Fix to DECLARE_TRACE_CONDITION() macro

   It wasn't used but now will be and it happened to be broken causing
   the build to fail.

 - Add scheduler specification monitors to runtime verifier (RV)

   This is a continuation of Daniel Bristot's work.

   RV allows monitors to run and react concurrently. Running the
   cumulative model is equivalent to running single components using the
   same reactors, with the advantage that it's easier to point out which
   specification failed in case of error.

   This update introduces nested monitors to RV, in short, the sysfs
   monitor folder will contain a monitor named sched, which is nothing
   but an empty container for other monitors. Controlling the sched
   monitor (enable, disable, set reactors) controls all nested monitors.

   The following scheduling monitors are added:

     - sco: scheduling context operations
       Monitor to ensure sched_set_state happens only in thread context

     - tss: task switch while scheduling
       Monitor to ensure sched_switch happens only in scheduling context

     - snroc: set non runnable on its own context
       Monitor to ensure set_state happens only in the respective task's context

     - scpd: schedule called with preemption disabled
       Monitor to ensure schedule is called with preemption disabled

     - snep: schedule does not enable preempt
       Monitor to ensure schedule does not enable preempt

     - sncid: schedule not called with interrupt disabled
       Monitor to ensure schedule is not called with interrupt disabled

* tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tools/rv: Allow rv list to filter for container
  Documentation/rv: Add docs for the sched monitors
  verification/dot2k: Add support for nested monitors
  tools/rv: Add support for nested monitors
  rv: Add scpd, snep and sncid per-cpu monitors
  rv: Add snroc per-task monitor
  rv: Add sco and tss per-cpu monitors
  rv: Add option for nested monitors and include sched
  sched: Add sched tracepoints for RV task model
  rv: Add license identifiers to monitor files
  tracing: Fix DECLARE_TRACE_CONDITION
  trace/osnoise: Add trace events for samples
2025-03-27 16:03:52 -07:00
Linus Torvalds
31eb415bf6 ftrace changes for v6.15:
- Record function parameters for function and function graph tracers
 
   An option has been added to function tracer (func-args) and the function
   graph tracer (funcgraph-args) that when set, the tracers will record the
   registers that hold the arguments into each function event. On reading of
   the trace, it will use BTF to print those arguments. Most archs support up
   to 6 arguments (depending on the complexity of the arguments) and those
   are printed. If a function has more arguments then what was recorded, the
   output will end with " ... )".
 
     Example of function graph tracer:
 
     6)              | dummy_xmit [dummy](skb = 0x8887c100, dev = 0x872ca000) {
     6)              |   consume_skb(skb = 0x8887c100) {
     6)              |     skb_release_head_state(skb = 0x8887c100) {
     6)  0.178 us    |       sock_wfree(skb = 0x8887c100)
     6)  0.627 us    |     }
 
 - The rest of the changes are minor clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+M9vRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrBGAP9NTn4Lpci69n8FJGfz+UKUEKbzOOeg
 QrkIZIxbm8N56gEA0RaUionWjt1znZXUdBTsE3h+en/Ik0//VZytQcmJBwM=
 =fxuG
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Record function parameters for function and function graph tracers

   An option has been added to function tracer (func-args) and the
   function graph tracer (funcgraph-args) that when set, the tracers
   will record the registers that hold the arguments into each function
   event. On reading of the trace, it will use BTF to print those
   arguments. Most archs support up to 6 arguments (depending on the
   complexity of the arguments) and those are printed.

   If a function has more arguments then what was recorded, the output
   will end with " ... )".

   Example of function graph tracer:

	6)              | dummy_xmit [dummy](skb = 0x8887c100, dev = 0x872ca000) {
	6)              |   consume_skb(skb = 0x8887c100) {
	6)              |     skb_release_head_state(skb = 0x8887c100) {
	6)  0.178 us    |       sock_wfree(skb = 0x8887c100)
	6)  0.627 us    |     }

 - The rest of the changes are minor clean ups and fixes

* tag 'ftrace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Use hashtable.h for event_hash
  tracing: Fix use-after-free in print_graph_function_flags during tracer switching
  function_graph: Remove the unused variable func
  ftrace: Add arguments to function tracer
  ftrace: Have funcgraph-args take affect during tracing
  ftrace: Add support for function argument to graph tracer
  ftrace: Add print_function_args()
  ftrace: Have ftrace_free_filter() WARN and exit if ops is active
  fgraph: Correct typo in ftrace_return_to_handler comment
2025-03-27 15:57:29 -07:00
Linus Torvalds
dd161f74f8 tracing and sorttable updates for 6.15:
- Implement arm64 build time sorting of the mcount location table
 
   When gcc is used to build arm64, the mcount_loc section is all zeros in
   the vmlinux elf file. The addresses are stored in the Elf_Rela location.
   To sort at build time, an array is allocated and the addresses are added
   to it via the content of the mcount_loc section as well as he Elf_Rela
   data. After sorting, the information is put back into the Elf_Rela which
   now has the section sorted.
 
 - Make sorting of mcount location table for arm64 work with clang as well
 
   When clang is used, the mcount_loc section contains the addresses, unlike
   the gcc build. An array is still created and the sorting works for both
   methods.
 
 - Remove weak functions from the mcount_loc section
 
   Have the sorttable code pass in the data of functions defined via nm -S
   which shows the functions as well as their sizes. Using this information
   the sorttable code can determine if a function in the mcount_loc section
   was weak and overridden. If the function is not found, it is set to be
   zero. On boot, when the mcount_loc section is read and the ftrace table is
   created, if the address in the mcount_loc is not in the kernel core text
   then it is removed and not added to the ftrace_filter_functions (the
   functions that can be attached by ftrace callbacks).
 
 - Update and fix the reporting of how much data is used for ftrace functions
 
   On boot, a report of how many pages were used by the ftrace table as well
   as how they were grouped (the table holds a list of sections that are
   groups of pages that were able to be allocated). The removing of the weak
   functions required the accounting to be updated.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+MnThQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qivsAQDhPOCaONai7rvHX9T1aOHGjdajZ7SI
 qoZgBOsc2ZUkoQD/U2M/m7Yof9aR4I+VFKtT5NsAwpfqPSOL/t/1j6UEOQ8=
 =45AV
 -----END PGP SIGNATURE-----

Merge tag 'trace-sorttable-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing / sorttable updates from Steven Rostedt:

 - Implement arm64 build time sorting of the mcount location table

   When gcc is used to build arm64, the mcount_loc section is all zeros
   in the vmlinux elf file. The addresses are stored in the Elf_Rela
   location.

   To sort at build time, an array is allocated and the addresses are
   added to it via the content of the mcount_loc section as well as he
   Elf_Rela data. After sorting, the information is put back into the
   Elf_Rela which now has the section sorted.

 - Make sorting of mcount location table for arm64 work with clang as
   well

   When clang is used, the mcount_loc section contains the addresses,
   unlike the gcc build. An array is still created and the sorting works
   for both methods.

 - Remove weak functions from the mcount_loc section

   Have the sorttable code pass in the data of functions defined via
   'nm -S' which shows the functions as well as their sizes. Using this
   information the sorttable code can determine if a function in the
   mcount_loc section was weak and overridden. If the function is not
   found, it is set to be zero. On boot, when the mcount_loc section is
   read and the ftrace table is created, if the address in the
   mcount_loc is not in the kernel core text then it is removed and not
   added to the ftrace_filter_functions (the functions that can be
   attached by ftrace callbacks).

 - Update and fix the reporting of how much data is used for ftrace
   functions

   On boot, a report of how many pages were used by the ftrace table as
   well as how they were grouped (the table holds a list of sections
   that are groups of pages that were able to be allocated). The
   removing of the weak functions required the accounting to be updated.

* tag 'trace-sorttable-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  scripts/sorttable: Allow matches to functions before function entry
  scripts/sorttable: Use normal sort if theres no relocs in the mcount section
  ftrace: Check against is_kernel_text() instead of kaslr_offset()
  ftrace: Test mcount_loc addr before calling ftrace_call_addr()
  ftrace: Have ftrace pages output reflect freed pages
  ftrace: Update the mcount_loc check of skipped entries
  scripts/sorttable: Zero out weak functions in mcount_loc table
  scripts/sorttable: Always use an array for the mcount_loc sorting
  scripts/sorttable: Have mcount rela sort use direct values
  arm64: scripts/sorttable: Implement sorting mcount_loc at boot for arm64
2025-03-27 15:44:34 -07:00
Masami Hiramatsu (Google)
bb9c6020f4 tracing: probe-events: Add comments about entry data storing code
Add comments about entry data storing code to __store_entry_arg() and
traceprobe_get_entry_data_size(). These are a bit complicated because of
building the entry data storing code and scanning it.

This just add comments, no behavior change.

Link: https://lore.kernel.org/all/174061715004.501424.333819546601401102.stgit@devnote2/

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20250226102223.586d7119@gandalf.local.home/
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-03-27 21:19:54 +09:00
Masami Hiramatsu (Google)
57faaa0480 tracing: probe-events: Log error for exceeding the number of arguments
Add error message when the number of arguments exceeds the limitation.

Link: https://lore.kernel.org/all/174055075075.4079315.10916648136898316476.stgit@mhiramat.tok.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-27 21:19:54 +09:00
Linus Torvalds
054570267d lsm/stable-6.15 PR 20250323
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8
 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI
 BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx
 dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz
 as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H
 dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4
 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv
 kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG
 VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv
 n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9
 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC
 lnxBQwPnat86iI8=
 =oxfV
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Siddarth G
e0344f9564 tracing: Replace strncpy with memcpy for fixed-length substring copy
checkpatch.pl reports the following warning:
WARNING: Prefer strscpy, strscpy_pad, or __nonstring over strncpy
(see: https://github.com/KSPP/linux/issues/90)

In synth_field_string_size(), replace strncpy() with memcpy() to copy 'len'
characters from 'start' to 'buf'. The code manually adds a NUL terminator
after the copy, making memcpy safe here.

Link: https://lore.kernel.org/20250325181232.38284-1-siddarthsgml@gmail.com
Signed-off-by: Siddarth G <siddarthsgml@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-25 17:44:32 -04:00
Douglas Raillard
4d38328eb4 tracing: Fix synth event printk format for str fields
The printk format for synth event uses "%.*s" to print string fields,
but then only passes the pointer part as var arg.

Replace %.*s with %s as the C string is guaranteed to be null-terminated.

The output in print fmt should never have been updated as __get_str()
handles the string limit because it can access the length of the string in
the string meta data that is saved in the ring buffer.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 8db4d6bfbb ("tracing: Change synthetic event string format to limit printed length")
Link: https://lore.kernel.org/20250325165202.541088-1-douglas.raillard@arm.com
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-25 17:42:38 -04:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds
e34c38057a [ Merge note: this pull request depends on you having merged
two locking commits in the locking tree,
 	      part of the locking-core-2025-03-22 pull request. ]
 
 x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid='
     (Brendan Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)
 
 Percpu code:
   - Standardize & reorganize the x86 percpu layout and
     related cleanups (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu
     variable (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
 
 MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction
     (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation
     (Kirill A. Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)
 
 KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems,
     to support PCI BAR space beyond the 10TiB region
     (CONFIG_PCI_P2PDMA=y) (Balbir Singh)
 
 CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta)
 
 System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)
 
 Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling
 
 AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
 
 Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()
 
 Bootup:
 
 Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0
     (Nathan Chancellor)
 
 Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit
 
 Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers
     (Thomas Huth)
 
 Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
     (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking instructions
     (Uros Bizjak)
 
 Earlyprintk:
   - Harden early_serial (Peter Zijlstra)
 
 NMI handler:
   - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
     (Waiman Long)
 
 Miscellaneous fixes and cleanups:
 
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel,
     Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst,
     Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin,
     Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport,
     Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker,
     Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra,
     Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak,
     Vitaly Kuznetsov, Xin Li, liuye.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd
 7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0
 3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw
 OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq
 2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX
 Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1
 q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/
 DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn
 fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW
 NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf
 THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M
 tJj1oc2TIZw=
 =Dcb3
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:
 "x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
     Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)

  Percpu code:
   - Standardize & reorganize the x86 percpu layout and related cleanups
     (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu variable
     (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)

  MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB
     instruction (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation (Kirill A.
     Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)

  KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
     BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
     Singh)

  CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
     Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan
     Gupta)

  System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)

  Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling

  AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)

  Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()

  Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)

  Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit

  Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
     headers (Thomas Huth)

  Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh
     Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from
     <asm/asm.h> (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking
     instructions (Uros Bizjak)

  Earlyprintk:
   - Harden early_serial (Peter Zijlstra)

  NMI handler:
   - Add an emergency handler in nmi_desc & use it in
     nmi_shootdown_cpus() (Waiman Long)

  Miscellaneous fixes and cleanups:
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
     Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
     Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
     Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
     Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
     Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
     Kuznetsov, Xin Li, liuye"

* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
  zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
  x86/asm: Make asm export of __ref_stack_chk_guard unconditional
  x86/mm: Only do broadcast flush from reclaim if pages were unmapped
  perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
  perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
  x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
  x86/asm: Use asm_inline() instead of asm() in clwb()
  x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
  x86/hweight: Use asm_inline() instead of asm()
  x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
  x86/hweight: Use named operands in inline asm()
  x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
  x86/head/64: Avoid Clang < 17 stack protector in startup code
  x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
  x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
  x86/cpu/intel: Limit the non-architectural constant_tsc model checks
  x86/mm/pat: Replace Intel x86_model checks with VFM ones
  x86/cpu/intel: Fix fast string initialization for extended Families
  ...
2025-03-24 22:06:11 -07:00
Linus Torvalds
32b22538be Scheduler updates for v6.15:
[ Merge note, these two commits are identical:
 
    - f3fa0e40df ("sched/clock: Don't define sched_clock_irqtime as static key")
    - b9f2b29b94 ("sched: Don't define sched_clock_irqtime as static key")
 
   The first one is a cherry-picked version of the second, and the first one
   is already upstream. ]
 
 Core & fair scheduler changes:
 
   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun Dong)
 
 Deadline scheduler: (Juri Lelli)
 
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h
 
 Uclamp:
 
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)
 
 Scheduler topology support: (Juri Lelli)
 
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked
 
 RSEQ: (Michael Jeanson)
 
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes
 
 Membarriers:
 
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)
 
 Scheduler debugging:
 
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
 
 Fixes and cleanups:
 
   - Always save/restore x86 TSC sched_clock() on suspend/resume
    (Guilherme G. Piccoli)
 
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli,
     Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU
 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r
 w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691
 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0
 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO
 jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O
 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1
 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy
 SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5
 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A
 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe
 YIFnnu1dhRw=
 =SuQE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
2025-03-24 21:28:12 -07:00
Linus Torvalds
3ba7dfb8da RCU pull request for v6.15
This pull request contains the following branches:
 
 docs.2025.02.04a:
  - Add broken-timing possibility to stallwarn.rst.
  - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr().
  - Document self-propagating callbacks.
  - Point call_srcu() to call_rcu() for detailed memory ordering.
  - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header.
  - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text.
  - Remove references to old grace-period-wait primitives.
 
 srcu.2025.02.05a:
  - Introduce srcu_read_{un,}lock_fast(), which is similar to
    srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the
    cost of calling synchronize_rcu() in synchronize_srcu(). Moreover, by
    returning the percpu offset of the counter at srcu_read_lock_fast()
    time, srcu_read_unlock_fast() can save extra pointer dereferencing,
    which makes it faster than srcu_read_{un,}lock_lite().
    srcu_read_{un,}lock_fast() are intended to replace
    rcu_read_{un,}lock_trace() if possible.
 
 torture.2025.02.05a:
  - Add get_torture_init_jiffies() to return the start time of the test.
  - Add a test_boost_holdoff module parameter to allow delaying boosting
    tests when building rcutorture as built-in.
  - Add grace period sequence number logging at the beginning and end of
    failure/close-call results.
  - Switch to hexadecimal for the expedited grace period sequence number
    in the rcu_exp_grace_period trace point.
  - Make cur_ops->format_gp_seqs take buffer length.
  - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool.
  - Complain when invalid SRCU reader_flavor is specified.
  - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU
    uses atomics even when percpu ops are NMI safe, and use the Kconfig
    for SRCU lockdep testing.
 
 misc.2025.03.04a:
  - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing.
  - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes().
  - Fix get_state_synchronize_rcu_full() GP-start detection.
  - Move RCU Tasks self-tests to core_initcall().
  - Print segment lengths in show_rcu_nocb_gp_state().
  - Make RCU watch ct_kernel_exit_state() warning.
  - Flush console log from kernel_power_off().
  - rcutorture: Allow a negative value for nfakewriters.
  - rcu: Update TREE05.boot to test normal synchronize_rcu().
  - rcu: Use _full() API to debug synchronize_rcu().
 
 lazypreempt.2025.03.04a: Make RCU handle PREEMPT_LAZY better:
  - Fix header guard for rcu_all_qs().
  - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY.
  - Update __cond_resched comment about RCU quiescent states.
  - Handle unstable rdp in rcu_read_unlock_strict().
  - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y.
  - osnoise: Provide quiescent states.
  - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y
    combination.
  - Limit PREEMPT_RCU configurations.
  - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmfeBLQACgkQSXnow7UH
 +rh11Qf/Rt6IZJ/YT/V9Sd+8hMx4O0BMh779pr9cD6mbAG+FDk2Yeva1m8vIdFOb
 qId6oc8K/ef2JfFjSn0oHMzQP2D3XUyiJWPNbBDHv/D8Os8GZgjzu8dkxVkSbdbY
 OxtvIflbcqFN1JDJfGKZnTEW0/YxGqfnS9b6R7iyyA7SOGQ/WubGOE5qNCqPufc9
 zJiP+qTUFYQzCIiPlEJul39o9KboPogbt3QAAQjWmi3utd77ehJnm/15FvAjyau4
 uhC2cnGfMY535rQaiaQeBQ/IHIowKripCq0JQFvcUNdyArZM3HOI2x79+2II6ft7
 mjHskNODOIJHfW2o1RzQ0yRYAywFIg==
 =J+mH
 -----END PGP SIGNATURE-----

Merge tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Boqun Feng:
 "Documentation:
   - Add broken-timing possibility to stallwarn.rst
   - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr()
   - Document self-propagating callbacks
   - Point call_srcu() to call_rcu() for detailed memory ordering
   - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header
   - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text
   - Remove references to old grace-period-wait primitives

  srcu:
   - Introduce srcu_read_{un,}lock_fast(), which is similar to
     srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock
     at the cost of calling synchronize_rcu() in synchronize_srcu()

     Moreover, by returning the percpu offset of the counter at
     srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid
     extra pointer dereferencing, which makes it faster than
     srcu_read_{un,}lock_lite()

     srcu_read_{un,}lock_fast() are intended to replace
     rcu_read_{un,}lock_trace() if possible

  RCU torture:
   - Add get_torture_init_jiffies() to return the start time of the test
   - Add a test_boost_holdoff module parameter to allow delaying
     boosting tests when building rcutorture as built-in
   - Add grace period sequence number logging at the beginning and end
     of failure/close-call results
   - Switch to hexadecimal for the expedited grace period sequence
     number in the rcu_exp_grace_period trace point
   - Make cur_ops->format_gp_seqs take buffer length
   - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
   - Complain when invalid SRCU reader_flavor is specified
   - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU
     uses atomics even when percpu ops are NMI safe, and use the Kconfig
     for SRCU lockdep testing

  Misc:
   - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing
   - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes()
   - Fix get_state_synchronize_rcu_full() GP-start detection
   - Move RCU Tasks self-tests to core_initcall()
   - Print segment lengths in show_rcu_nocb_gp_state()
   - Make RCU watch ct_kernel_exit_state() warning
   - Flush console log from kernel_power_off()
   - rcutorture: Allow a negative value for nfakewriters
   - rcu: Update TREE05.boot to test normal synchronize_rcu()
   - rcu: Use _full() API to debug synchronize_rcu()

  Make RCU handle PREEMPT_LAZY better:
   - Fix header guard for rcu_all_qs()
   - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY
   - Update __cond_resched comment about RCU quiescent states
   - Handle unstable rdp in rcu_read_unlock_strict()
   - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
   - osnoise: Provide quiescent states
   - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y
     combination
   - Limit PREEMPT_RCU configurations
   - Make rcutorture senario TREE07 and senario TREE10 use
     PREEMPT_LAZY=y"

* tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits)
  rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y
  rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y
  rcu: limit PREEMPT_RCU configurations
  rcutorture: Update ->extendables check for lazy preemption
  rcutorture: Update rcutorture_one_extend_check() for lazy preemption
  osnoise: provide quiescent states
  rcu: Use _full() API to debug synchronize_rcu()
  rcu: Update TREE05.boot to test normal synchronize_rcu()
  rcutorture: Allow a negative value for nfakewriters
  Flush console log from kernel_power_off()
  context_tracking: Make RCU watch ct_kernel_exit_state() warning
  rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state()
  rcu-tasks: Move RCU Tasks self-tests to core_initcall()
  rcu: Fix get_state_synchronize_rcu_full() GP-start detection
  torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe()
  srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing
  rcutorture: Complain when invalid SRCU reader_flavor is specified
  rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
  rcutorture: Make cur_ops->format_gp_seqs take buffer length
  rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output
  ...
2025-03-24 19:41:37 -07:00
Gabriele Monaco
fbe6c09b7e rv: Add scpd, snep and sncid per-cpu monitors
Add 3 per-cpu monitors as part of the sched model:

* scpd: schedule called with preemption disabled
    Monitor to ensure schedule is called with preemption disabled
* snep: schedule does not enable preempt
    Monitor to ensure schedule does not enable preempt
* sncid: schedule not called with interrupt disabled
    Monitor to ensure schedule is not called with interrupt disabled

To: Ingo Molnar <mingo@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-6-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:27:39 -04:00
Gabriele Monaco
93bac9cf35 rv: Add snroc per-task monitor
Add a per-task monitor as part of the sched model:

* snroc: set non runnable on its own context
    Monitor to ensure set_state happens only in the respective task's context

To: Ingo Molnar <mingo@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-5-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:27:39 -04:00
Gabriele Monaco
9fd420abc4 rv: Add sco and tss per-cpu monitors
Add 2 per-cpu monitors as part of the sched model:

* sco: scheduling context operations
    Monitor to ensure sched_set_state happens only in thread context
* tss: task switch while scheduling
    Monitor to ensure sched_switch happens only in scheduling context

To: Ingo Molnar <mingo@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-4-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:27:39 -04:00
Gabriele Monaco
cb85c660fc rv: Add option for nested monitors and include sched
Monitors describing complex systems, such as the scheduler, can easily
grow to the point where they are just hard to understand because of the
many possible state transitions.
Often it is possible to break such descriptions into smaller monitors,
sharing some or all events. Enabling those smaller monitors concurrently
is, in fact, testing the system as if we had one single larger monitor.
Splitting models into multiple specification is not only easier to
understand, but gives some more clues when we see errors.

Add the possibility to create container monitors, whose only purpose is
to host other nested monitors. Enabling a container monitor enables all
nested ones, but it's still possible to enable nested monitors
independently.
Add the sched monitor as first container, for now empty.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-3-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:27:39 -04:00
Steven Rostedt
8eb1518642 tracing: Do not use PERF enums when perf is not defined
An update was made to up the module ref count when a synthetic event is
registered for both trace and perf events. But if perf is not configured
in, the perf enums used will cause the kernel to fail to build.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://lore.kernel.org/20250323152151.528b5ced@batman.local.home
Fixes: 21581dd4e7 ("tracing: Ensure module defining synth event cannot be unloaded while tracing")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503232230.TeREVy8R-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:12:33 -04:00
Sasha Levin
391dda1bd7 tracing: Use hashtable.h for event_hash
Convert the event_hash array in trace_output.c to use the generic
hashtable implementation from hashtable.h instead of the manually
implemented hash table.

This simplifies the code and makes it more maintainable by using the
standard hashtable API defined in hashtable.h.

Rename EVENT_HASHSIZE to EVENT_HASH_BITS to properly reflect its new
meaning as the number of bits for the hashtable size.

Link: https://lore.kernel.org/20250323132800.3010783-1-sashal@kernel.org
Link: https://lore.kernel.org/20250319190545.3058319-1-sashal@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23 15:26:14 -04:00
Douglas Raillard
21581dd4e7 tracing: Ensure module defining synth event cannot be unloaded while tracing
Currently, using synth_event_delete() will fail if the event is being
used (tracing in progress), but that is normally done in the module exit
function. At that stage, failing is problematic as returning a non-zero
status means the module will become locked (impossible to unload or
reload again).

Instead, ensure the module exit function does not get called in the
first place by increasing the module refcnt when the event is enabled.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 35ca5207c2 ("tracing: Add synthetic event command generation functions")
Link: https://lore.kernel.org/20250318180906.226841-1-douglas.raillard@arm.com
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23 08:34:31 -04:00
Gabriele Paoloni
0c588ac0ca tracing: fix return value in __ftrace_event_enable_disable for TRACE_REG_UNREGISTER
When __ftrace_event_enable_disable invokes the class callback to
unregister the event, the return value is not reported up to the
caller, hence leading to event unregister failures being silently
ignored.

This patch assigns the ret variable to the invocation of the
event unregister callback, so that its return value is stored
and reported to the caller, and it raises a warning in case
of error.

Link: https://lore.kernel.org/20250321170821.101403-1-gpaoloni@redhat.com
Signed-off-by: Gabriele Paoloni <gpaoloni@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23 08:34:31 -04:00
Ran Xiaokai
7e6b3fcc9c tracing/osnoise: Fix possible recursive locking for cpus_read_lock()
Lockdep reports this deadlock log:

osnoise: could not start sampling thread
============================================
WARNING: possible recursive locking detected
--------------------------------------------
       CPU0
       ----
  lock(cpu_hotplug_lock);
  lock(cpu_hotplug_lock);

 Call Trace:
  <TASK>
  print_deadlock_bug+0x282/0x3c0
  __lock_acquire+0x1610/0x29a0
  lock_acquire+0xcb/0x2d0
  cpus_read_lock+0x49/0x120
  stop_per_cpu_kthreads+0x7/0x60
  start_kthread+0x103/0x120
  osnoise_hotplug_workfn+0x5e/0x90
  process_one_work+0x44f/0xb30
  worker_thread+0x33e/0x5e0
  kthread+0x206/0x3b0
  ret_from_fork+0x31/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

This is the deadlock scenario:
osnoise_hotplug_workfn()
  guard(cpus_read_lock)();      // first lock call
  start_kthread(cpu)
    if (IS_ERR(kthread)) {
      stop_per_cpu_kthreads(); {
        cpus_read_lock();      // second lock call. Cause the AA deadlock
      }
    }

It is not necessary to call stop_per_cpu_kthreads() which stops osnoise
kthread for every other CPUs in the system if a failure occurs during
hotplug of a certain CPU.
For start_per_cpu_kthreads(), if the start_kthread() call fails,
this function calls stop_per_cpu_kthreads() to handle the error.
Therefore, similarly, there is no need to call stop_per_cpu_kthreads()
again within start_kthread().
So just remove stop_per_cpu_kthreads() from start_kthread to solve this issue.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250321095249.2739397-1-ranxiaokai627@163.com
Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23 08:34:31 -04:00
Douglas Raillard
81c7a515b0 tracing: Align synth event print fmt
The vast majority of ftrace event print fmt consist of a space-separated
field=value pair. Synthetic event currently use a comma-separated
field=value pair, which sticks out from events created via more
classical means.

Align the format of synth events so they look just like any other event,
for better consistency and less headache when doing crude text-based
data processing.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250319215028.1680278-1-douglas.raillard@arm.com
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23 08:34:31 -04:00
Tengda Wu
7f81f27b10 tracing: Fix use-after-free in print_graph_function_flags during tracer switching
Kairui reported a UAF issue in print_graph_function_flags() during
ftrace stress testing [1]. This issue can be reproduced if puting a
'mdelay(10)' after 'mutex_unlock(&trace_types_lock)' in s_start(),
and executing the following script:

  $ echo function_graph > current_tracer
  $ cat trace > /dev/null &
  $ sleep 5  # Ensure the 'cat' reaches the 'mdelay(10)' point
  $ echo timerlat > current_tracer

The root cause lies in the two calls to print_graph_function_flags
within print_trace_line during each s_show():

  * One through 'iter->trace->print_line()';
  * Another through 'event->funcs->trace()', which is hidden in
    print_trace_fmt() before print_trace_line returns.

Tracer switching only updates the former, while the latter continues
to use the print_line function of the old tracer, which in the script
above is print_graph_function_flags.

Moreover, when switching from the 'function_graph' tracer to the
'timerlat' tracer, s_start only calls graph_trace_close of the
'function_graph' tracer to free 'iter->private', but does not set
it to NULL. This provides an opportunity for 'event->funcs->trace()'
to use an invalid 'iter->private'.

To fix this issue, set 'iter->private' to NULL immediately after
freeing it in graph_trace_close(), ensuring that an invalid pointer
is not passed to other tracers. Additionally, clean up the unnecessary
'iter->private = NULL' during each 'cat trace' when using wakeup and
irqsoff tracers.

 [1] https://lore.kernel.org/all/20231112150030.84609-1-ryncsn@gmail.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Link: https://lore.kernel.org/20250320122137.23635-1-wutengda@huaweicloud.com
Fixes: eecb91b9f9 ("tracing: Fix memleak due to race between current_tracer and trace")
Closes: https://lore.kernel.org/all/CAMgjq7BW79KDSCyp+tZHjShSzHsScSiJxn5ffskp-QzVM06fxw@mail.gmail.com/
Reported-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-22 05:42:42 -04:00
Christophe JAILLET
502d2e71a8 tracing: Constify struct event_trigger_ops
'event_trigger_ops mwifiex_if_ops' are not modified in these drivers.

Constifying these structures moves some data to a read-only section, so
increase overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  31368	   9024	   6200	  46592	   b600	kernel/trace/trace_events_trigger.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  31752	   8608	   6200	  46560	   b5e0	kernel/trace/trace_events_trigger.o

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/66e8f990e649678e4be37d4d1a19158ca0dea2f4.1741521295.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-20 07:04:00 -04:00
Ingo Molnar
89771319e0 Linux 6.14-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA
 s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e
 WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM
 qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB
 uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C
 dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/
 qrOKLys=
 =8r7a
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc7' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-19 11:03:06 +01:00
Emil Tsalapatis
ae0a457f5d bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to
tracing protrams, but is useful for other BPF programs like sched_ext
schedulers. When the helper is available, provide its bpf_func_proto
directly from the bpf base_proto.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18 10:21:59 -07:00
Linus Torvalds
47c7efa4f0 Probes fixes for v6.14-rc6:
- tprobe-events: Fix to clean up tprobe correctly when module unload
   tprobe (Tracepoint probe) event does not set TRACEPOINT_STUB to its
   'tpoint' pointer when unloading module, thus it is shown as a normal
   'fprobe' instead of 'tprobe' and never comes back.
 - tprobe-events: Fix leakage of module refcount
   When a tprobe's target module is loaded, it gets the module's refcount
   in the module notifier but forgot to put it after registering the probe
   on it. Fix to get the refcount only when registering tprobe.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmfYFpobHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bi0kH/3wSyMa7wP7pSPQ1Hq8i
 ZLeKa7pSS3X39IE6urvpYrTUBRw75tqSSozyF6g8jM3Tj8ISpF9rlZJ6ct775mTR
 9Ld9wPNgnj0pCZU29VGf9TPm3JCm5V0h0bhE9qZp8wZRiedbwg9e6uZ92LBLAmW+
 MisEmLsKrYcOUF1bwhLwoNRzlwidhHKCBAF6vUqVC62kktCe08XKKmiusslVfEZz
 KrU0XM1MPnncWd2Cp3qw9ywcC40ES61lBdhMjJcGM3yKqKqlJSQxIBa5rOl0TlJA
 D58bEpPhoyGT3sWS7UlqVjdkHy9NtpFx3nvcvPGCrDMLl9qwj4aSrwUBjK693RS7
 Dyo=
 =D3X0
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

 - Clean up tprobe correctly when module unload

   Tracepoint probes do not set TRACEPOINT_STUB on the 'tpoint' pointer
   when unloading a module, thus they show as a normal 'fprobe' instead
   of 'tprobe' and never come back

 - Fix leakage of tprobe module refcount

   When a tprobe's target module is loaded, it gets the module's
   refcount in the module notifier but forgot to put it after
   registering the probe on it.

   Fix it by getting the refcount only when registering tprobe.

* tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: tprobe-events: Fix leakage of module refcount
  tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
2025-03-17 14:30:31 -07:00
Sebastian Andrzej Siewior
3bffa47a02 tracing: Use preempt_model_str()
Use preempt_model_str() instead of manually conducting the preemption
model.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20250314160810.2373416-10-bigeasy@linutronix.de
2025-03-17 11:23:41 +01:00
Linus Torvalds
ad87a8d0c4 Fix ref count of trace_array in error path of histogram file open
Tracing instances have a ref count to keep them around while files within
 their directories are open. This prevents them from being deleted while
 they are used. The histogram code had some files that needed to take the
 ref count and that was added, but the error paths did not decrement the
 ref counts. This caused the instances from ever being removed if a
 histogram file failed to open due to some error.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ9cE8hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtm8AQDrdfwo63g/K0xpM0Uej9Ex5pBkHEq/
 WwCI2hUbPya0wAEAtEcfvgt7P0GuO0PNOYY3uzhAokfBiN04elatbLEetQ4=
 =i35g
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix ref count of trace_array in error path of histogram file open

  Tracing instances have a ref count to keep them around while files
  within their directories are open. This prevents them from being
  deleted while they are used.

  The histogram code had some files that needed to take the ref count
  and that was added, but the error paths did not decrement the ref
  counts. This caused the instances from ever being removed if a
  histogram file failed to open due to some error"

* tag 'trace-v6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Correct the refcount if the hist/hist_debug file fails to open
2025-03-16 09:05:00 -10:00
Masami Hiramatsu (Google)
ac91052f0a tracing: tprobe-events: Fix leakage of module refcount
When enabling the tracepoint at loading module, the target module
refcount is incremented by find_tracepoint_in_module(). But it is
unnecessary because the module is not unloaded while processing
module loading callbacks.
Moreover, the refcount is not decremented in that function.
To be clear the module refcount handling, move the try_module_get()
callsite to trace_fprobe_create_internal(), where it is actually
required.

Link: https://lore.kernel.org/all/174182761071.83274.18334217580449925882.stgit@devnote2/

Fixes: 57a7e6de9e ("tracing/fprobe: Support raw tracepoints on future loaded modules")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
2025-03-15 08:37:47 +09:00
Masami Hiramatsu (Google)
0a8bb688aa tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
When unloading module, the tprobe events are not correctly cleaned
up. Thus it becomes `fprobe-event` and never be enabled again even
if loading the same module again.

For example;

 # cd /sys/kernel/tracing
 # modprobe trace_events_sample
 # echo 't:my_tprobe foo_bar' >> dynamic_events
 # cat dynamic_events
t:tracepoints/my_tprobe foo_bar
 # rmmod trace_events_sample
 # cat dynamic_events
f:tracepoints/my_tprobe foo_bar

As you can see, the second time my_tprobe starts with 'f' instead
of 't'.

This unregisters the fprobe and tracepoint callback when module is
unloaded but marks the fprobe-event is tprobe-event.

Link: https://lore.kernel.org/all/174158724946.189309.15826571379395619524.stgit@mhiramat.tok.corp.google.com/

Fixes: 57a7e6de9e ("tracing/fprobe: Support raw tracepoints on future loaded modules")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-03-15 08:37:12 +09:00
Tengda Wu
0b4ffbe488 tracing: Correct the refcount if the hist/hist_debug file fails to open
The function event_{hist,hist_debug}_open() maintains the refcount of
'file->tr' and 'file' through tracing_open_file_tr(). However, it does
not roll back these counts on subsequent failure paths, resulting in a
refcount leak.

A very obvious case is that if the hist/hist_debug file belongs to a
specific instance, the refcount leak will prevent the deletion of that
instance, as it relies on the condition 'tr->ref == 1' within
__remove_instance().

Fix this by calling tracing_release_file_tr() on all failure paths in
event_{hist,hist_debug}_open() to correct the refcount.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com
Fixes: 1cc111b9cd ("tracing: Fix uaf issue when open the hist or hist_debug file")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-14 08:29:12 -04:00
Sebastian Andrzej Siewior
8c6eb7ca86 bpf: Use RCU in all users of __module_text_address().
__module_address() can be invoked within a RCU section, there is no
requirement to have preemption disabled.

Replace the preempt_disable() section around __module_address() with
RCU.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matt Bobrowski <mattbobrowski@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250129084751.tH6iidUO@linutronix.de
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-03-10 11:54:46 +01:00
Sebastian Andrzej Siewior
febaa65c94 module: Use RCU in find_module_all().
The modules list and module::kallsyms can be accessed under RCU
assumption.

Remove module_assert_mutex_or_preempt() from find_module_all() so it can
be used under RCU protection without warnings. Update its callers to use
RCU protection instead of preempt_disable().

Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-trace-kernel@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20250108090457.512198-7-bigeasy@linutronix.de
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-03-10 11:54:44 +01:00
Jiapeng Chong
5ba8f4a39e function_graph: Remove the unused variable func
Variable func is not effectively used, so delete it.

kernel/trace/trace_functions_graph.c:925:16: warning: variable ‘func’ set but not used.

This happened because the variable "func" which came from "call->func" was
replaced by "ret_func" coming from "graph_ret->func" but "func" wasn't
removed after the replacement.

Link: https://lore.kernel.org/20250307021412.119107-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19250
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-07 12:34:07 -05:00
Christophe JAILLET
06889030f5 tracing/user_events: Slightly simplify user_seq_show()
2 seq_puts() calls can be merged.

It saves a few lines of code and a few cycles, should it matter.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/845caa94b74cea8d72c158bf1994fe250beee28c.1739979791.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06 13:35:27 -05:00
Thomas Weißschuh
effd1059c4 tracing/user_events: Don't use %pK through printk
Restricted pointers ("%pK") are not meant to be used through printk().
It can unintentionally expose security sensitive, raw pointer values.

Use regular pointer formatting instead.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250217-restricted-pointers-trace-v1-1-bbe9ea279848@linutronix.de
Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06 13:35:26 -05:00
Zhouyi Zhou
3ca4d7af35 ring-buffer: Fix typo in comment about header page pointer
Fix typo in comment about header page pointer in function
rb_get_reader_page.

Link: https://lore.kernel.org/20250118012352.3430519-1-zhouzhouyi@gmail.com
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06 13:35:26 -05:00
Ankur Arora
9fd858cc5a osnoise: provide quiescent states
To reduce RCU noise for nohz_full configurations, osnoise depends
on cond_resched() providing quiescent states for PREEMPT_RCU=n
configurations. For PREEMPT_RCU=y configurations -- where
cond_resched() is a stub -- we do this by directly calling
rcu_momentary_eqs().

With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), however, we have a
configuration with (PREEMPTION=y, PREEMPT_RCU=n) where neither
of the above can help.

Handle that by providing an explicit quiescent state here for all
configurations.

As mentioned above this is not needed for non-stubbed cond_resched(),
but, providing a quiescent state here just pulls in one that a future
cond_resched() would provide, so doesn't cause any extra work for
this configuration.

Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04 18:46:09 -08:00
Gabriele Monaco
41a4d2d3e3 rv: Add license identifiers to monitor files
Some monitor files like the main header and the Kconfig are missing the
license identifier.

Add it to those and make sure the automatic generation script includes
the line in newly created monitors.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/20250218123121.253551-3-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 12:11:08 -05:00
Sven Schnelle
76fe0337c2 ftrace: Add arguments to function tracer
Wire up the code to print function arguments in the function tracer.
This functionality can be enabled/disabled during runtime with
options/func-args.

        ping-689     [004] b....    77.170220: dummy_xmit(skb = 0x82904800, dev = 0x882d0000) <-dev_hard_start_xmit

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Donglin Peng <dolinux.peng@gmail.com>
Cc: Zheng Yejian <zhengyejian@huaweicloud.com>
Link: https://lore.kernel.org/20250227185823.154996172@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:27:24 -05:00
Steven Rostedt
c7a60a733c ftrace: Have funcgraph-args take affect during tracing
Currently, when function_graph is started, it looks at the option
funcgraph-args, and if it is set, it will enable tracing of the arguments.

But if tracing is already running, and the user enables funcgraph-args, it
will have no effect. Instead, it should enable argument tracing when it is
enabled, even if it means disabling the function graph tracing for a short
time in order to do the transition.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Donglin Peng <dolinux.peng@gmail.com>
Cc: Zheng Yejian <zhengyejian@huaweicloud.com>
Link: https://lore.kernel.org/20250227185822.978998710@goodmis.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:27:23 -05:00
Sven Schnelle
ff5c9c576e ftrace: Add support for function argument to graph tracer
Wire up the code to print function arguments in the function graph
tracer. This functionality can be enabled/disabled during runtime with
options/funcgraph-args.

Example usage:

6)              | dummy_xmit [dummy](skb = 0x8887c100, dev = 0x872ca000) {
6)              |   consume_skb(skb = 0x8887c100) {
6)              |     skb_release_head_state(skb = 0x8887c100) {
6)  0.178 us    |       sock_wfree(skb = 0x8887c100)
6)  0.627 us    |     }

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Donglin Peng <dolinux.peng@gmail.com>
Cc: Zheng Yejian <zhengyejian@huaweicloud.com>
Link: https://lore.kernel.org/20250227185822.810321199@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:27:23 -05:00
Sven Schnelle
533c20b062 ftrace: Add print_function_args()
Add a function to decode argument types with the help of BTF. Will
be used to display arguments in the function and function graph
tracer.

It can only handle simply arguments and up to FTRACE_REGS_MAX_ARGS number
of arguments. When it hits a max, it will print ", ...":

   page_to_skb(vi=0xffff8d53842dc980, rq=0xffff8d53843a0800, page=0xfffffc2e04337c00, offset=6160, len=64, truesize=1536, ...)

And if it hits an argument that is not recognized, it will print the raw
value and the type of argument it is:

   make_vfsuid(idmap=0xffffffff87f99db8, fs_userns=0xffffffff87e543c0, kuid=0x0 (STRUCT))
   __pti_set_user_pgtbl(pgdp=0xffff8d5384ab47f8, pgd=0x110e74067 (STRUCT))

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Guo Ren <guoren@kernel.org>
Cc: Donglin Peng <dolinux.peng@gmail.com>
Cc: Zheng Yejian <zhengyejian@huaweicloud.com>
Link: https://lore.kernel.org/20250227185822.639418500@goodmis.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:27:23 -05:00
Steven Rostedt
0c667775fe ftrace: Have ftrace_free_filter() WARN and exit if ops is active
The ftrace_free_filter() is used to reset the ops filters. But it must be
done if the ops is not currently active (tracing). If it is, it will mess
up the ftrace accounting of what functions are attached and what is not.

WARN and exit the ftrace_free_filter() if the ops is active when it is
called.

Currently, it doesn't seem if anything does this, but it may in the
future.

Link: https://lore.kernel.org/all/20250219095330.2e9f171c@gandalf.local.home/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250219135040.3a9fbe00@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:26:48 -05:00
Haiyue Wang
97d6a9c4b3 fgraph: Correct typo in ftrace_return_to_handler comment
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250218122052.58348-1-haiyuewa@163.com
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04 11:15:50 -05:00
Ingo Molnar
1fff9f8730 Linux 6.14-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfEtgQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUI8H/0TJm0ia8TOseHDT
 bV0alw4qlNdq7Thi7M5HCsInzyhKImdHDBd/cY3HI2QRZS927WoPHUExDvEd6vUE
 sETrrBrl/iGJ5nLvXPKkCxXC43ZD3nCI/chBPWwspH2DA/Nih8LAmMkESGVFC7fa
 TUOKqY1FsAWMYUJ64hP4s9Dwi5XECps7/Di46ypgtr7sVA15jpfF3ePi2mfR73Bm
 hSfF7E5Xa3E22IBE2NPxvO4fHiYJWbNJk8Vv2ewfHroE/zKsJ/zCMk9mOtFII0P/
 TODkBSImFqUx+cDZVc0bJAy8rwA2lNXo3LU28N0Ca4EAoqyyIIdTPQ2wYA82Z2Y2
 hE/BeN0=
 =a9Md
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc5' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-03 21:05:45 +01:00
Linus Torvalds
26edad06d5 Probes fixes for v6.14-rc4:
- probe-events: Some issues are fixed.
  . probe-events: Remove unused MAX_ARG_BUF_LEN macro.
    MAX_ARG_BUF_LEN is not used so remove it.
  . fprobe-events: Log error for exceeding the number of entry args.
    Since the max number of entry args is limited, it should be checked
    and rejected when the parser detects it.
  . tprobe-events: Reject invalid tracepoint name
    User can specify an invalid tracepoint name e.g. including '/', then
    the new event is not defined correctly in the eventfs.
  . tprobe-events: Fix a memory leak when tprobe defined with $retval
    There is a memory leak if tprobe is defined with $retval.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmfFKkcbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b/F4H/10qmUSsec9+IbQseg0E
 MSRxAhJQ+xOcLfGsWhblW2zirkw9o4PghZYwBodkastu4Wgq2M5ASKd6KqUY2o7D
 CX+tCoXf80SDLEVd2go5m72Ml40rrGDEgLvS5YcEa4Iqr5nPZrvCJ7rl2tlqupQH
 W2ttOTkX9H28phAFDCsdl5ZJUCJRxlFc6fYG0yZYHsFdRub9J2LPiMTMwIlu56YS
 8HH3NxS+wxlKK2I4VfD8mFsOnrNh7MFDLOOwNMlKWvm2wSPbPmVho+eXLAc5xyTO
 d+vUpkp4Dp9WWCLuNdO/sqY0IKngO2sM++WbtL/YPP8YijqsrImep4PCR8/fvlN6
 Urs=
 =dyZm
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probe events fixes from Masami Hiramatsu:

 - probe-events: Remove unused MAX_ARG_BUF_LEN macro - it is not used

 - fprobe-events: Log error for exceeding the number of entry args.

   Since the max number of entry args is limited, it should be checked
   and rejected when the parser detects it.

 - tprobe-events: Reject invalid tracepoint name

   If a user specifies an invalid tracepoint name (e.g. including '/')
   then the new event is not defined correctly in the eventfs.

 - tprobe-events: Fix a memory leak when tprobe defined with $retval

   There is a memory leak if tprobe is defined with $retval.

* tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro
  tracing: fprobe-events: Log error for exceeding the number of entry args
  tracing: tprobe-events: Reject invalid tracepoint name
  tracing: tprobe-events: Fix a memory leak when tprobe with $retval
2025-03-03 07:28:15 -10:00
Masami Hiramatsu (Google)
fd5ba38390 tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro
Commit 18b1e870a4 ("tracing/probes: Add $arg* meta argument for all
function args") introduced MAX_ARG_BUF_LEN but it is not used.
Remove it.

Link: https://lore.kernel.org/all/174055075876.4079315.8805416872155957588.stgit@mhiramat.tok.corp.google.com/

Fixes: 18b1e870a4 ("tracing/probes: Add $arg* meta argument for all function args")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-03 11:17:54 +09:00
Nikolay Kuratov
a1a7eb89ca ftrace: Avoid potential division by zero in function_stat_show()
Check whether denominator expression x * (x - 1) * 1000 mod {2^32, 2^64}
produce zero and skip stddev computation in that case.

For now don't care about rec->counter * rec->counter overflow because
rec->time * rec->time overflow will likely happen earlier.

Cc: stable@vger.kernel.org
Cc: Wen Yang <wenyang@linux.alibaba.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250206090156.1561783-1-kniv@yandex-team.ru
Fixes: e31f7939c1 ("ftrace: Avoid potential division by zero in function profiler")
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27 21:02:10 -05:00
Steven Rostedt
6f86bdeab6 tracing: Fix bad hist from corrupting named_triggers list
The following commands causes a crash:

 ~# cd /sys/kernel/tracing/events/rcu/rcu_callback
 ~# echo 'hist:name=bad:keys=common_pid:onmax(bogus).save(common_pid)' > trigger
 bash: echo: write error: Invalid argument
 ~# echo 'hist:name=bad:keys=common_pid' > trigger

Because the following occurs:

event_trigger_write() {
  trigger_process_regex() {
    event_hist_trigger_parse() {

      data = event_trigger_alloc(..);

      event_trigger_register(.., data) {
        cmd_ops->reg(.., data, ..) [hist_register_trigger()] {
          data->ops->init() [event_hist_trigger_init()] {
            save_named_trigger(name, data) {
              list_add(&data->named_list, &named_triggers);
            }
          }
        }
      }

      ret = create_actions(); (return -EINVAL)
      if (ret)
        goto out_unreg;
[..]
      ret = hist_trigger_enable(data, ...) {
        list_add_tail_rcu(&data->list, &file->triggers); <<<---- SKIPPED!!! (this is important!)
[..]
 out_unreg:
      event_hist_unregister(.., data) {
        cmd_ops->unreg(.., data, ..) [hist_unregister_trigger()] {
          list_for_each_entry(iter, &file->triggers, list) {
            if (!hist_trigger_match(data, iter, named_data, false))   <- never matches
                continue;
            [..]
            test = iter;
          }
          if (test && test->ops->free) <<<-- test is NULL

            test->ops->free(test) [event_hist_trigger_free()] {
              [..]
              if (data->name)
                del_named_trigger(data) {
                  list_del(&data->named_list);  <<<<-- NEVER gets removed!
                }
              }
           }
         }

         [..]
         kfree(data); <<<-- frees item but it is still on list

The next time a hist with name is registered, it causes an u-a-f bug and
the kernel can crash.

Move the code around such that if event_trigger_register() succeeds, the
next thing called is hist_trigger_enable() which adds it to the list.

A bunch of actions is called if get_named_trigger_data() returns false.
But that doesn't need to be called after event_trigger_register(), so it
can be moved up, allowing event_trigger_register() to be called just
before hist_trigger_enable() keeping them together and allowing the
file->triggers to be properly populated.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250227163944.1c37f85f@gandalf.local.home
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reported-by: Tomas Glozar <tglozar@redhat.com>
Tested-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Closes: https://lore.kernel.org/all/CAP4=nvTsxjckSBTz=Oe_UYh8keD9_sZC4i++4h72mJLic4_W4A@mail.gmail.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27 21:01:34 -05:00
Tomas Glozar
a065bbf776 trace/osnoise: Add trace events for samples
Add trace events that fire at osnoise and timerlat sample generation, in
addition to the already existing noise and threshold events.

This allows processing the samples directly in the kernel, either with
ftrace triggers or with BPF.

Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-26 19:44:30 -05:00
Masami Hiramatsu (Google)
db5e228611 tracing: fprobe-events: Log error for exceeding the number of entry args
Add error message when the number of entry argument exceeds the
maximum size of entry data.
This is currently checked when registering fprobe, but in this case
no error message is shown in the error_log file.

Link: https://lore.kernel.org/all/174055074269.4079315.17809232650360988538.stgit@mhiramat.tok.corp.google.com/

Fixes: 25f00e40ce ("tracing/probes: Support $argN in return probe (kprobe and fprobe)")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27 09:11:51 +09:00
Masami Hiramatsu (Google)
d0453655b6 tracing: tprobe-events: Reject invalid tracepoint name
Commit 57a7e6de9e ("tracing/fprobe: Support raw tracepoints on
future loaded modules") allows user to set a tprobe on non-exist
tracepoint but it does not check the tracepoint name is acceptable.
So it leads tprobe has a wrong character for events (e.g. with
subsystem prefix). In this case, the event is not shown in the
events directory.

Reject such invalid tracepoint name.

The tracepoint name must consist of alphabet or digit or '_'.

Link: https://lore.kernel.org/all/174055073461.4079315.15875502830565214255.stgit@mhiramat.tok.corp.google.com/

Fixes: 57a7e6de9e ("tracing/fprobe: Support raw tracepoints on future loaded modules")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
2025-02-27 09:10:58 +09:00
Masami Hiramatsu (Google)
ac965d7d88 tracing: tprobe-events: Fix a memory leak when tprobe with $retval
Fix a memory leak when a tprobe is defined with $retval. This
combination is not allowed, but the parse_symbol_and_return() does
not free the *symbol which should not be used if it returns the error.
Thus, it leaks the *symbol memory in that error path.

Link: https://lore.kernel.org/all/174055072650.4079315.3063014346697447838.stgit@mhiramat.tok.corp.google.com/

Fixes: ce51e6153f ("tracing: fprobe-event: Fix to check tracepoint event and return")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
2025-02-27 09:10:21 +09:00
Luo Gengkun
9ec84f79c5 perf: Remove unnecessary parameter of security check
It seems that the attr parameter was never been used in security
checks since it was first introduced by:

commit da97e18458 ("perf_event: Add support for LSM and SELinux checks")

so remove it.

Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-26 14:13:58 -05:00
Alexei Starovoitov
4580f4e0eb bpf: Fix deadlock between rcu_tasks_trace and event_mutex.
Fix the following deadlock:
CPU A
_free_event()
  perf_kprobe_destroy()
    mutex_lock(&event_mutex)
      perf_trace_event_unreg()
        synchronize_rcu_tasks_trace()

There are several paths where _free_event() grabs event_mutex
and calls sync_rcu_tasks_trace. Above is one such case.

CPU B
bpf_prog_test_run_syscall()
  rcu_read_lock_trace()
    bpf_prog_run_pin_on_cpu()
      bpf_prog_load()
        bpf_tracing_func_proto()
          trace_set_clr_event()
            mutex_lock(&event_mutex)

Delegate trace_set_clr_event() to workqueue to avoid
such lock dependency.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250224221637.4780-1-alexei.starovoitov@gmail.com
2025-02-26 08:48:40 -08:00
Steven Rostedt
937fbf111a tracing: Add traceoff_after_boot option
Sometimes tracing is used to debug issues during the boot process. Since
the trace buffer has a limited amount of storage, it may be prudent to
disable tracing after the boot is finished, otherwise the critical
information may be overwritten.  With this option, the main tracing buffer
will be turned off at the end of the boot process.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/20250208103017.48a7ec83@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25 13:46:40 -05:00
Steven Rostedt
da0f622b34 ftrace: Check against is_kernel_text() instead of kaslr_offset()
As kaslr_offset() is architecture dependent and also may not be defined by
all architectures, when zeroing out unused weak functions, do not check
against kaslr_offset(), but instead check if the address is within the
kernel text sections. If KASLR added a shift to the zeroed out function,
it would still not be located in the kernel text. This is a more robust
way to test if the text is valid or not.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Arnd Bergmann" <arnd@arndb.de>
Link: https://lore.kernel.org/20250225182054.471759017@goodmis.org
Fixes: ef378c3b82 ("scripts/sorttable: Zero out weak functions in mcount_loc table")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Mark Brown <broonie@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20250224180805.GA1536711@ax162/
Closes: https://lore.kernel.org/all/5225b07b-a9b2-4558-9d5f-aa60b19f6317@sirena.org.uk/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25 13:25:13 -05:00
Steven Rostedt
6eeca746fa ftrace: Test mcount_loc addr before calling ftrace_call_addr()
The addresses in the mcount_loc can be zeroed and then moved by KASLR
making them invalid addresses. ftrace_call_addr() for ARM 64 expects a
valid address to kernel text. If the addr read from the mcount_loc section
is invalid, it must not call ftrace_call_addr(). Move the addr check
before calling ftrace_call_addr() in ftrace_process_locs().

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/20250225182054.290128736@goodmis.org
Fixes: ef378c3b82 ("scripts/sorttable: Zero out weak functions in mcount_loc table")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20250225025631.GA271248@ax162/
Closes: https://lore.kernel.org/all/91523154-072b-437b-bbdc-0b70e9783fd0@app.fastmail.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25 13:25:13 -05:00
Adrian Huang
2fa6a01345 tracing: Fix memory leak when reading set_event file
kmemleak reports the following memory leak after reading set_event file:

  # cat /sys/kernel/tracing/set_event

  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xff110001234449e0 (size 16):
  comm "cat", pid 13645, jiffies 4294981880
  hex dump (first 16 bytes):
    01 00 00 00 00 00 00 00 a8 71 e7 84 ff ff ff ff  .........q......
  backtrace (crc c43abbc):
    __kmalloc_cache_noprof+0x3ca/0x4b0
    s_start+0x72/0x2d0
    seq_read_iter+0x265/0x1080
    seq_read+0x2c9/0x420
    vfs_read+0x166/0xc30
    ksys_read+0xf4/0x1d0
    do_syscall_64+0x79/0x150
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

The issue can be reproduced regardless of whether set_event is empty or
not. Here is an example about the valid content of set_event.

  # cat /sys/kernel/tracing/set_event
  sched:sched_process_fork
  sched:sched_switch
  sched:sched_wakeup
  *:*:mod:trace_events_sample

The root cause is that s_next() returns NULL when nothing is found.
This results in s_stop() attempting to free a NULL pointer because its
parameter is NULL.

Fix the issue by freeing the memory appropriately when s_next() fails
to find anything.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250220031528.7373-1-ahuang12@lenovo.com
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Sebastian Andrzej Siewior
57b76bedc5 ftrace: Correct preemption accounting for function tracing.
The function tracer should record the preemption level at the point when
the function is invoked. If the tracing subsystem decrement the
preemption counter it needs to correct this before feeding the data into
the trace buffer. This was broken in the commit cited below while
shifting the preempt-disabled section.

Use tracing_gen_ctx_dec() which properly subtracts one from the
preemption counter on a preemptible kernel.

Cc: stable@vger.kernel.org
Cc: Wander Lairson Costa <wander@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de
Fixes: ce5e48036c ("ftrace: disable preemption when recursion locked")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
ca26554a14 fprobe: Fix accounting of when to unregister from function graph
When adding a new fprobe, it will update the function hash to the
functions the fprobe is attached to and register with function graph to
have it call the registered functions. The fprobe_graph_active variable
keeps track of the number of fprobes that are using function graph.

If two fprobes attach to the same function, it increments the
fprobe_graph_active for each of them. But when they are removed, the first
fprobe to be removed will see that the function it is attached to is also
used by another fprobe and it will not remove that function from
function_graph. The logic will skip decrementing the fprobe_graph_active
variable.

This causes the fprobe_graph_active variable to not go to zero when all
fprobes are removed, and in doing so it does not unregister from
function graph. As the fgraph ops hash will now be empty, and an empty
filter hash means all functions are enabled, this triggers function graph
to add a callback to the fprobe infrastructure for every function!

 # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
 # echo "f:myevent2 kernel_clone%return" >> /sys/kernel/tracing/dynamic_events
 # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc0024000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

 # > /sys/kernel/tracing/dynamic_events
 # cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1)             tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
run_init_process (1)            tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
try_to_run_init_process (1)             tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
x86_pmu_show_pmu_cap (1)                tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
cleanup_rapl_pmus (1)                   tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_free_pcibus_map (1)              tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_types_exit (1)                   tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_pci_exit.part.0 (1)              tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
kvm_shutdown (1)                tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
vmx_dump_msrs (1)               tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
[..]

 # cat /sys/kernel/tracing/enabled_functions | wc -l
54702

If a fprobe is being removed and all its functions are also traced by
other fprobes, still decrement the fprobe_graph_active counter.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.565129766@goodmis.org
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Closes: https://lore.kernel.org/all/20250217114918.10397-A-hca@linux.ibm.com/
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
ded9140622 fprobe: Always unregister fgraph function from ops
When the last fprobe is removed, it calls unregister_ftrace_graph() to
remove the graph_ops from function graph. The issue is when it does so, it
calls return before removing the function from its graph ops via
ftrace_set_filter_ips(). This leaves the last function lingering in the
fprobe's fgraph ops and if a probe is added it also enables that last
function (even though the callback will just drop it, it does add unneeded
overhead to make that call).

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # > /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions

  # echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kmem_cache_free (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1)           	tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

The above enabled a fprobe on kernel_clone, and then on schedule_timeout.
The content of the enabled_functions shows the functions that have a
callback attached to them. The fprobe attached to those functions
properly. Then the fprobes were cleared, and enabled_functions was empty
after that. But after adding a fprobe on kmem_cache_free, the
enabled_functions shows that the schedule_timeout was attached again. This
is because it was still left in the fprobe ops that is used to tell
function graph what functions it wants callbacks from.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org
Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
8eb4b09e0b ftrace: Do not add duplicate entries in subops manager ops
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org
Fixes: 4f554e9556 ("ftrace: Add ftrace_set_filter_ips function")
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:36:12 -05:00
Steven Rostedt
38b1406194 ftrace: Fix accounting of adding subops to a manager ops
Function graph uses a subops and manager ops mechanism to attach to
ftrace.  The manager ops connects to ftrace and the functions it connects
to is defined by a list of subops that it manages.

The function hash that defines what the above ops attaches to limits the
functions to attach if the hash has any content. If the hash is empty, it
means to trace all functions.

The creation of the manager ops hash is done by iterating over all the
subops hashes. If any of the subops hashes is empty, it means that the
manager ops hash must trace all functions as well.

The issue is in the creation of the manager ops. When a second subops is
attached, a new hash is created by starting it as NULL and adding the
subops one at a time. But the NULL ops is mistaken as an empty hash, and
once an empty hash is found, it stops the loop of subops and just enables
all functions.

  # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
kernel_clone (1)           	tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60

  # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
  # cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
run_init_process (1)            tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
try_to_run_init_process (1)             tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
x86_pmu_show_pmu_cap (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
cleanup_rapl_pmus (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_free_pcibus_map (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_types_exit (1)                   tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_pci_exit.part.0 (1)              tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
kvm_shutdown (1)                tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_dump_msrs (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_cleanup_l1d_flush (1)               tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
[..]

Fix this by initializing the new hash to NULL and if the hash is NULL do
not treat it as an empty hash but instead allocate by copying the content
of the first sub ops. Then on subsequent iterations, the new hash will not
be NULL, but the content of the previous subops. If that first subops
attached to all functions, then new hash may assume that the manager ops
also needs to attach to all functions.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21 09:35:44 -05:00
Hou Tao
b4a8b5bba7 bpf: Use preempt_count() directly in bpf_send_signal_common()
bpf_send_signal_common() uses preemptible() to check whether or not the
current context is preemptible. If it is preemptible, it will use
irq_work to send the signal asynchronously instead of trying to hold a
spin-lock, because spin-lock is sleepable under PREEMPT_RT.

However, preemptible() depends on CONFIG_PREEMPT_COUNT. When
CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y),
!preemptible() will be evaluated as 1 and bpf_send_signal_common() will
use irq_work unconditionally.

Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 ||
irqs_disabled()" instead.

Fixes: 87c544108b ("bpf: Send signals asynchronously if !preemptible")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250220042259.1583319-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20 18:39:38 -08:00
Steven Rostedt
264143c4e5 ftrace: Have ftrace pages output reflect freed pages
The amount of memory that ftrace uses to save the descriptors to manage
the functions it can trace is shown at output. But if there are a lot of
functions that are skipped because they were weak or the architecture
added holes into the tables, then the extra pages that were allocated are
freed. But these freed pages are not reflected in the numbers shown, and
they can even be inconsistent with what is reported:

 ftrace: allocating 57482 entries in 225 pages
 ftrace: allocated 224 pages with 3 groups

The above shows the number of original entries that are in the mcount_loc
section and the pages needed to save them (225), but the second output
reflects the number of pages that were actually used. The two should be
consistent as:

 ftrace: allocating 56739 entries in 224 pages
 ftrace: allocated 224 pages with 3 groups

The above also shows the accurate number of entires that were actually
stored and does not include the entries that were removed.

Cc: bpf <bpf@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Martin  Kelly <martin.kelly@crowdstrike.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250218200023.221100846@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18 17:12:04 -05:00
Steven Rostedt
4a3efc6baf ftrace: Update the mcount_loc check of skipped entries
Now that weak functions turn into skipped entries, update the check to
make sure the amount that was allocated would fit both the entries that
were allocated as well as those that were skipped.

Cc: bpf <bpf@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Martin  Kelly <martin.kelly@crowdstrike.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250218200023.055162048@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18 17:12:03 -05:00
Steven Rostedt
ef378c3b82 scripts/sorttable: Zero out weak functions in mcount_loc table
When a function is annotated as "weak" and is overridden, the code is not
removed. If it is traced, the fentry/mcount location in the weak function
will be referenced by the "__mcount_loc" section. This will then be added
to the available_filter_functions list. Since only the address of the
functions are listed, to find the name to show, a search of kallsyms is
used.

Since kallsyms will return the function by simply finding the function
that the address is after but before the next function, an address of a
weak function will show up as the function before it. This is because
kallsyms does not save names of weak functions. This has caused issues in
the past, as now the traced weak function will be listed in
available_filter_functions with the name of the function before it.

At best, this will cause the previous function's name to be listed twice.
At worse, if the previous function was marked notrace, it will now show up
as a function that can be traced. Note that it only shows up that it can
be traced but will not be if enabled, which causes confusion.

 https://lore.kernel.org/all/20220412094923.0abe90955e5db486b7bca279@kernel.org/

The commit b39181f7c6 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid
adding weak function") was a workaround to this by checking the function
address before printing its name. If the address was too far from the
function given by the name then instead of printing the name it would
print: __ftrace_invalid_address___<invalid-offset>

The real issue is that these invalid addresses are listed in the ftrace
table look up which available_filter_functions is derived from. A place
holder must be listed in that file because set_ftrace_filter may take a
series of indexes into that file instead of names to be able to do O(1)
lookups to enable filtering (many tools use this method).

Even if kallsyms saved the size of the function, it does not remove the
need of having these place holders. The real solution is to not add a weak
function into the ftrace table in the first place.

To solve this, the sorttable.c code that sorts the mcount regions during
the build is modified to take a "nm -S vmlinux" input, sort it, and any
function listed in the mcount_loc section that is not within a boundary of
the function list given by nm is considered a weak function and is zeroed
out.

Note, this does not mean they will remain zero when booting as KASLR
will still shift those addresses. To handle this, the entries in the
mcount_loc section will be ignored if they are zero or match the
kaslr_offset() value.

Before:

 ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l
 551

After:

 ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l
 0

Cc: bpf <bpf@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Martin  Kelly <martin.kelly@crowdstrike.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250218200022.883095980@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18 17:12:03 -05:00
Ingo Molnar
e8f925c320 Linux 6.14-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmeyYIQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNy0H/jWdgjddRaEHQ1RB
 e18Oi6MJcTQikHbCHKGZGlyxR4dYxdAONuMmWwgt+266K8qUJSZcNXePwqGEWjx2
 qkJ9Tu0Agr8KkfVDtGHGXyd4tuZRpx9Fco6+jKkKiMjjtif7nrUajUGGwRsqGoib
 YYzrhbjNZDl17/J58O1E4YZs3w7Lu26PwDR58RZMsSG0pygAfU2fogKcYmi1pTYV
 w86icn0LlO8b5Y7fsrY56rLrawnI1RGlxfylUTHzo4QkoIUGvQLB8c6XPMYsVf9R
 lvkphu+/fGVnSw577WlVy8DTBso+Pj2nWw4jUTiEAy9hYY6zMxrqrX3XowAwbxj1
 m6zP+F8=
 =ieVA
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc3' into x86/core, to pick up fixes

Pick up upstream x86 fixes before applying new patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-18 11:07:15 +01:00
Nam Cao
19fec9c443 tracing/osnoise: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Patch was created by using Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/ff8e6e11df5f928b2b97619ac847b4fa045376a1.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:33 +01:00
Steven Rostedt
97937834ae ring-buffer: Update pages_touched to reflect persistent buffer content
The pages_touched field represents the number of subbuffers in the ring
buffer that have content that can be read. This is used in accounting of
"dirty_pages" and "buffer_percent" to allow the user to wait for the
buffer to be filled to a certain amount before it reads the buffer in
blocking mode.

The persistent buffer never updated this value so it was set to zero, and
this accounting would take it as it had no content. This would cause user
space to wait for content even though there's enough content in the ring
buffer that satisfies the buffer_percent.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 14:00:59 -05:00
Steven Rostedt
129fe71881 tracing: Do not allow mmap() of persistent ring buffer
When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:

 BUG: unable to handle page fault for address: ffffe97bd00025c8
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 2862f3067 P4D 2862f3067 PUD 0
 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
 CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:validate_page_before_insert+0x5/0xb0
 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
 FS:  00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x1f
  ? __die+0x2e/0x40
  ? page_fault_oops+0x157/0x2b0
  ? search_module_extables+0x53/0x80
  ? validate_page_before_insert+0x5/0xb0
  ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
  ? __bad_area_nosemaphore+0x16e/0x1b0
  ? bad_area_nosemaphore+0x16/0x20
  ? do_kern_addr_fault+0x77/0x90
  ? exc_page_fault+0x22b/0x230
  ? asm_exc_page_fault+0x2b/0x30
  ? validate_page_before_insert+0x5/0xb0
  ? vm_insert_pages+0x151/0x400
  __rb_map_vma+0x21f/0x3f0
  ring_buffer_map+0x21b/0x2f0
  tracing_buffers_mmap+0x70/0xd0
  __mmap_region+0x6f0/0xbd0
  mmap_region+0x7f/0x130
  do_mmap+0x475/0x610
  vm_mmap_pgoff+0xf2/0x1d0
  ksys_mmap_pgoff+0x166/0x200
  __x64_sys_mmap+0x37/0x50
  x64_sys_call+0x1670/0x1d70
  do_syscall_64+0xbb/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

The reason was that the code that maps the ring buffer pages to user space
has:

	page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);

And uses that in:

	vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.

If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home
Fixes: 9b7bdf6f6e ("tracing: Have trace_printk not use binary prints if boot buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15 13:59:52 -05:00
Steven Rostedt
f5b95f1fa2 ring-buffer: Validate the persistent meta data subbuf array
The meta data for a mapped ring buffer contains an array of indexes of all
the subbuffers. The first entry is the reader page, and the rest of the
entries lay out the order of the subbuffers in how the ring buffer link
list is to be created.

The validator currently makes sure that all the entries are within the
range of 0 and nr_subbufs. But it does not check if there are any
duplicates.

While working on the ring buffer, I corrupted this array, where I added
duplicates. The validator did not catch it and created the ring buffer
link list on top of it. Luckily, the corruption was only that the reader
page was also in the writer path and only presented corrupted data but did
not crash the kernel. But if there were duplicates in the writer side,
then it could corrupt the ring buffer link list and cause a crash.

Create a bitmask array with the size of the number of subbuffers. Then
clear it. When walking through the subbuf array checking to see if the
entries are within the range, test if its bit is already set in the
subbuf_mask. If it is, then there is duplicates and fail the validation.
If not, set the corresponding bit and continue.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home
Fixes: c76883f18e ("ring-buffer: Add test if range of boot buffer is valid")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:51 -05:00
Steven Rostedt
60b8f71114 tracing: Have the error of __tracing_resize_ring_buffer() passed to user
Currently if __tracing_resize_ring_buffer() returns an error, the
tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory
issue that caused the function to fail. If the ring buffer is memory
mapped, then the resizing of the ring buffer will be disabled. But if the
user tries to resize the buffer, it will get an -ENOMEM returned, which is
confusing because there is plenty of memory. The actual error returned was
-EBUSY, which would make much more sense to the user.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14 12:50:27 -05:00
Steven Rostedt
9ba0e1755a ring-buffer: Unlock resize on mmap error
Memory mapping the tracing ring buffer will disable resizing the buffer.
But if there's an error in the memory mapping like an invalid parameter,
the function exits out without re-enabling the resizing of the ring
buffer, preventing the ring buffer from being resized after that.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14 12:50:05 -05:00
Peter Zijlstra
72e213a7cc x86/ibt: Clean up is_endbr()
Pretty much every caller of is_endbr() actually wants to test something at an
address and ends up doing get_kernel_nofault(). Fold the lot into a more
convenient helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250207122546.181367417@infradead.org
2025-02-14 10:32:04 +01:00
Steven Rostedt
c8c9b1d2d5 fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.

There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.

For example:

 # cd /sys/kernel/tracing
 # echo set_track_prepare stack_trace_save  > set_graph_notrace
 # echo function_graph > current_tracer
 # cat trace
[..]
 0)               |                          __slab_free() {
 0)               |                            free_to_partial_list() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {
 0)   0.501 us    |                                      get_stack_info();

Where a non filter trace looks like:

 # echo > set_graph_notrace
 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              set_track_prepare() {
 0)               |                                stack_trace_save() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {

Where the filter should look like:

 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              _raw_spin_lock_irqsave() {
 0)   0.350 us    |                                preempt_count_add();
 0)   0.351 us    |                                do_raw_spin_lock();
 0)   2.440 us    |                              }

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-08 08:36:45 -05:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds
40648d246f rv: tools/rtla: Updates for 6.14
- Add a test suite to test the tool
 
   Add a small test suite that can be used to test rtla's basic features to
   at least have something to test when applying changes.
 
 - Automate manual steps in monitor creation
 
   While creating a new monitor in RV, besides generating code from dot2k,
   there are a few manual steps which can be tedious and error prone, like
   adding the tracepoints, makefile lines and kconfig, or selecting events
   that start the monitor in the initial state.
 
   Updates were made to try and automate as much as possible among those steps to
   make creating a new RV monitor much quicker. It is still requires to
   select proper tracepoints, this step is harder to automate in a general
   way and, in several cases, would still need user intervention.
 
 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag
 
   Have both rtla-timerlat-hist and rtla-timerlat-top set OSNOISE_WORKLOAD to
   the proper value ("on" when running with -k, "off" when running with -u)
   every time the option is available instead of setting it only when running
   with -u.
 
   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally exited earlier
   run of rtla timerlat -u.
 
 -  Stop rtla timerlat on signal properly when overloaded
 
   There is an issue where if rtla is run on machines with a high number of
   CPUs (100+), timerlat can generate more samples than rtla is able to process
   via tracefs_iterate_raw_events. This is especially common when the interval
   is set to 100us (rteval and cyclictest default) as opposed to the rtla
   default of 1000us, but also happens with the rtla default.
 
   Currently, this leads to rtla hanging and having to be terminated with
   SIGTERM. SIGINT setting stop_tracing is not enough, since more and more
   events are coming and tracefs_iterate_raw_events never exits.
 
   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more
   events are generated when rtla is supposed to exit.
 
   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately with
   tracefs_iterate_stop, making rtla exit right away instead of waiting for all
   events to be processed.
 
 - Account for missed events
 
   Due to tracefs buffer overflow, it can happen that rtla misses events,
   making the tracing results inaccurate.
 
   Count both the number of missed events and the total number of processed
   events, and display missed events as well as their percentage. The numbers
   are displayed for both osnoise and timerlat, even though for the earlier,
   missed events are generally not expected.
 
   For hist, the number is displayed at the end of the run; for top, it is
   displayed on each printing of the top table.
 
 - Changes to make osnoise more robust
 
   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever changed,
   then the code work break. Change the code to encapsulate this dependency
   where the code that uses the structure does not have this dependency.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UQ4BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qktFAQD2px6MyoOVTssB5Iw3aTWGUfTFoDEc
 bfng5JsBxlVJkQEA+2UUvP8FJlLTOQvVEwJiscX7CCJxl5bYkV6GWuGRxQU=
 =h//9
 -----END PGP SIGNATURE-----

Merge tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull rv and tools/rtla updates from Steven Rostedt:

 - Add a test suite to test the tool

   Add a small test suite that can be used to test rtla's basic features
   to at least have something to test when applying changes.

 - Automate manual steps in monitor creation

   While creating a new monitor in RV, besides generating code from
   dot2k, there are a few manual steps which can be tedious and error
   prone, like adding the tracepoints, makefile lines and kconfig, or
   selecting events that start the monitor in the initial state.

   Updates were made to try and automate as much as possible among those
   steps to make creating a new RV monitor much quicker. It is still
   requires to select proper tracepoints, this step is harder to
   automate in a general way and, in several cases, would still need
   user intervention.

 - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag

   Have both rtla-timerlat-hist and rtla-timerlat-top set
   OSNOISE_WORKLOAD to the proper value ("on" when running with -k,
   "off" when running with -u) every time the option is available
   instead of setting it only when running with -u.

   This prevents rtla timerlat -k from giving no results when
   NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally
   exited earlier run of rtla timerlat -u.

 - Stop rtla timerlat on signal properly when overloaded

   There is an issue where if rtla is run on machines with a high number
   of CPUs (100+), timerlat can generate more samples than rtla is able
   to process via tracefs_iterate_raw_events. This is especially common
   when the interval is set to 100us (rteval and cyclictest default) as
   opposed to the rtla default of 1000us, but also happens with the rtla
   default.

   Currently, this leads to rtla hanging and having to be terminated
   with SIGTERM. SIGINT setting stop_tracing is not enough, since more
   and more events are coming and tracefs_iterate_raw_events never
   exits.

   To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no
   more events are generated when rtla is supposed to exit.

   Also on receiving SIGINT/SIGALRM twice, abort iteration immediately
   with tracefs_iterate_stop, making rtla exit right away instead of
   waiting for all events to be processed.

 - Account for missed events

   Due to tracefs buffer overflow, it can happen that rtla misses
   events, making the tracing results inaccurate.

   Count both the number of missed events and the total number of
   processed events, and display missed events as well as their
   percentage. The numbers are displayed for both osnoise and timerlat,
   even though for the earlier, missed events are generally not
   expected.

   For hist, the number is displayed at the end of the run; for top, it
   is displayed on each printing of the top table.

 - Changes to make osnoise more robust

   There was a dependency in the code that the first field of the
   osnoise_tool structure was the trace field. If that that ever
   changed, then the code work break. Change the code to encapsulate
   this dependency where the code that uses the structure does not have
   this dependency.

* tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
  rtla: Report missed event count
  rtla: Add function to report missed events
  rtla: Count all processed events
  rtla: Count missed trace events
  tools/rtla: Add osnoise_trace_is_off()
  rtla/timerlat_top: Set OSNOISE_WORKLOAD for kernel threads
  rtla/timerlat_hist: Set OSNOISE_WORKLOAD for kernel threads
  rtla/osnoise: Distinguish missing workload option
  rtla/timerlat_top: Abort event processing on second signal
  rtla/timerlat_hist: Abort event processing on second signal
  rtla/timerlat_top: Stop timerlat tracer on signal
  rtla/timerlat_hist: Stop timerlat tracer on signal
  rtla: Add trace_instance_stop
  tools/rtla: Add basic test suite
  verification/dot2k: Implement event type detection
  verification/dot2k: Auto patch current kernel source
  verification/dot2k: Simplify manual steps in monitor creation
  rv: Simplify manual steps in monitor creation
  verification/dot2k: Add support for name and description options
  verification/dot2k: More robust template variables
  ...
2025-01-26 14:25:58 -08:00
Linus Torvalds
90ab2117f4 Runtime verifier and osnoise fixes for 6.14:
- Reset idle tasks on reset for runtime verifier
 
   When the runtime verifier is reset, it resets the task's data that is being
   monitored. But it only iterates for_each_process() which does not include
   the idle tasks. As the idle tasks can be monitored, they need to be reset
   as well.
 
 - Fix the enabling and disabling of tracepoints in osnoise
 
   If timerlat is enabled and the WORKLOAD flag is not set, then the osnoise
   tracer will enable the migrate task tracepoint to monitor it for its own
   workload. The test to enable the tracepoint is done against user space
   modifiable parameters. On disabling of the tracer, those same parameters
   are used to determine if the tracepoint should be disabled. The problem is
   if user space were to modify the parameters after it enables the tracer
   then it may not disable the tracepoint.
 
   Instead, a static variable is used to keep track if the tracepoint was
   enabled or not. Then when the tracer shuts down, it will use this variable
   to decide to disable the tracepoint or not, instead of looking at the user
   space parameters.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UJ1BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qo46AQDjr1yVTyUmzvxe1bMLDoDOq1xeRMRe
 4f8CdpOJRxbi0QEAwnl5Ey9Rvcuy8GFpt/USESr4VYWAN10fvsPx7pkphAs=
 =RYyn
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier and osnoise fixes from Steven Rostedt:

 - Reset idle tasks on reset for runtime verifier

   When the runtime verifier is reset, it resets the task's data that is
   being monitored. But it only iterates for_each_process() which does
   not include the idle tasks. As the idle tasks can be monitored, they
   need to be reset as well.

 - Fix the enabling and disabling of tracepoints in osnoise

   If timerlat is enabled and the WORKLOAD flag is not set, then the
   osnoise tracer will enable the migrate task tracepoint to monitor it
   for its own workload. The test to enable the tracepoint is done
   against user space modifiable parameters. On disabling of the tracer,
   those same parameters are used to determine if the tracepoint should
   be disabled. The problem is if user space were to modify the
   parameters after it enables the tracer then it may not disable the
   tracepoint.

   Instead, a static variable is used to keep track if the tracepoint
   was enabled or not. Then when the tracer shuts down, it will use this
   variable to decide to disable the tracepoint or not, instead of
   looking at the user space parameters.

* tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/osnoise: Fix resetting of tracepoints
  rv: Reset per-task monitors also for idle tasks
2025-01-26 14:19:45 -08:00
Steven Rostedt
e3ff424592 tracing/osnoise: Fix resetting of tracepoints
If a timerlat tracer is started with the osnoise option OSNOISE_WORKLOAD
disabled, but then that option is enabled and timerlat is removed, the
tracepoints that were enabled on timerlat registration do not get
disabled. If the option is disabled again and timelat is started, then it
triggers a warning in the tracepoint code due to registering the
tracepoint again without ever disabling it.

Do not use the same user space defined options to know to disable the
tracepoints when timerlat is removed. Instead, set a global flag when it
is enabled and use that flag to know to disable the events.

 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer
 ~# echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo nop > /sys/kernel/tracing/current_tracer
 ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
 ~# echo timerlat > /sys/kernel/tracing/current_tracer

Triggers:

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 1337 at kernel/tracepoint.c:294 tracepoint_add_func+0x3b6/0x3f0
 Modules linked in:
 CPU: 6 UID: 0 PID: 1337 Comm: rtla Not tainted 6.13.0-rc4-test-00018-ga867c441128e-dirty #73
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:tracepoint_add_func+0x3b6/0x3f0
 Code: 48 8b 53 28 48 8b 73 20 4c 89 04 24 e8 23 59 11 00 4c 8b 04 24 e9 36 fe ff ff 0f 0b b8 ea ff ff ff 45 84 e4 0f 84 68 fe ff ff <0f> 0b e9 61 fe ff ff 48 8b 7b 18 48 85 ff 0f 84 4f ff ff ff 49 8b
 RSP: 0018:ffffb9b003a87ca0 EFLAGS: 00010202
 RAX: 00000000ffffffef RBX: ffffffff92f30860 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9bf59e91ccd0 RDI: ffffffff913b6410
 RBP: 000000000000000a R08: 00000000000005c7 R09: 0000000000000002
 R10: ffffb9b003a87ce0 R11: 0000000000000002 R12: 0000000000000001
 R13: ffffb9b003a87ce0 R14: ffffffffffffffef R15: 0000000000000008
 FS:  00007fce81209240(0000) GS:ffff9bf6fdd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055e99b728000 CR3: 00000001277c0002 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __warn.cold+0xb7/0x14d
  ? tracepoint_add_func+0x3b6/0x3f0
  ? report_bug+0xea/0x170
  ? handle_bug+0x58/0x90
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? tracepoint_add_func+0x3b6/0x3f0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  tracepoint_probe_register+0x78/0xb0
  ? __pfx_trace_sched_migrate_callback+0x10/0x10
  osnoise_workload_start+0x2b5/0x370
  timerlat_tracer_init+0x76/0x1b0
  tracing_set_tracer+0x244/0x400
  tracing_set_trace_write+0xa0/0xe0
  vfs_write+0xfc/0x570
  ? do_sys_openat2+0x9c/0xe0
  ksys_write+0x72/0xf0
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250123204159.4450c88e@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-24 15:20:44 -05:00
Linus Torvalds
606489dbfa Fix atomic64 operations on some architectures for the tracing ring buffer:
- Have emulating atomic64 use arch_spin_locks instead of raw_spin_locks
 
   The tracing ring buffer events have a small timestamp that holds the
   delta between itself and the event before it. But this can be tricky
   to update when interrupts come in. It originally just set the deltas
   to zero for events that interrupted the adding of another event which
   made all the events in the interrupt have the same timestamp as the
   event it interrupted. This was not suitable for many tools, so it
   was eventually fixed. But that fix required adding an atomic64 cmpxchg
   on the timestamp in cases where an event was added while another
   event was in the process of being added.
 
   Originally, for 32 bit architectures, the manipulation of the 64 bit
   timestamp was done by a structure that held multiple 32bit words to hold
   parts of the timestamp and a counter. But as updates to the ring buffer
   were done, maintaining this became too complex and was replaced by the
   atomic64 generic operations which are now used by both 64bit and 32bit
   architectures.  Shortly after that, it was reported that riscv32 and
   other 32 bit architectures that just used the generic atomic64 were
   locking up. This was because the generic atomic64 operations defined in
   lib/atomic64.c uses a raw_spin_lock() to emulate an atomic64 operation.
   The problem here was that raw_spin_lock() can also be traced by the
   function tracer (which is commonly used for debugging raw spin locks).
   Since the function tracer uses the tracing ring buffer, which now is being
   traced internally, this was triggering a recursion and setting off a
   warning that the spin locks were recusing.
 
   There's no reason for the code that emulates atomic64 operations to be
   using raw_spin_locks which have a lot of debugging infrastructure attached
   to them (depending on the config options). Instead it should be using
   the arch_spin_lock() which does not have any infrastructure attached to
   them and is used by low level infrastructure like RCU locks, lockdep
   and of course tracing. Using arch_spin_lock()s fixes this issue.
 
 - Do not trace in NMI if the architecture uses emulated atomic64 operations
 
   Another issue with using the emulated atomic64 operations that uses
   spin locks to emulate the atomic64 operations is that they cannot be
   used in NMI context. As an NMI can trigger while holding the atomic64
   spin locks it can try to take the same lock and cause a deadlock.
 
   Have the ring buffer fail recording events if in NMI context and the
   architecture uses the emulated atomic64 operations.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jr7RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg7cAPoD/H4BRsFa3UUDnxofTlBuj4A7neJd
 rk9ddD9HXH8KywEAhBn1Oujiw81Ayjx7E6s4ednAQX4rldTXBXDyFNuuGgU=
 =b13F
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace fing buffer fix from Steven Rostedt:
 "Fix atomic64 operations on some architectures for the tracing ring
  buffer:

   - Have emulating atomic64 use arch_spin_locks instead of
     raw_spin_locks

     The tracing ring buffer events have a small timestamp that holds
     the delta between itself and the event before it. But this can be
     tricky to update when interrupts come in. It originally just set
     the deltas to zero for events that interrupted the adding of
     another event which made all the events in the interrupt have the
     same timestamp as the event it interrupted. This was not suitable
     for many tools, so it was eventually fixed. But that fix required
     adding an atomic64 cmpxchg on the timestamp in cases where an event
     was added while another event was in the process of being added.

     Originally, for 32 bit architectures, the manipulation of the 64
     bit timestamp was done by a structure that held multiple 32bit
     words to hold parts of the timestamp and a counter. But as updates
     to the ring buffer were done, maintaining this became too complex
     and was replaced by the atomic64 generic operations which are now
     used by both 64bit and 32bit architectures. Shortly after that, it
     was reported that riscv32 and other 32 bit architectures that just
     used the generic atomic64 were locking up. This was because the
     generic atomic64 operations defined in lib/atomic64.c uses a
     raw_spin_lock() to emulate an atomic64 operation. The problem here
     was that raw_spin_lock() can also be traced by the function tracer
     (which is commonly used for debugging raw spin locks). Since the
     function tracer uses the tracing ring buffer, which now is being
     traced internally, this was triggering a recursion and setting off
     a warning that the spin locks were recusing.

     There's no reason for the code that emulates atomic64 operations to
     be using raw_spin_locks which have a lot of debugging
     infrastructure attached to them (depending on the config options).
     Instead it should be using the arch_spin_lock() which does not have
     any infrastructure attached to them and is used by low level
     infrastructure like RCU locks, lockdep and of course tracing. Using
     arch_spin_lock()s fixes this issue.

   - Do not trace in NMI if the architecture uses emulated atomic64
     operations

     Another issue with using the emulated atomic64 operations that uses
     spin locks to emulate the atomic64 operations is that they cannot
     be used in NMI context. As an NMI can trigger while holding the
     atomic64 spin locks it can try to take the same lock and cause a
     deadlock.

     Have the ring buffer fail recording events if in NMI context and
     the architecture uses the emulated atomic64 operations"

* tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  atomic64: Use arch_spin_locks instead of raw_spin_locks
  ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
2025-01-23 18:02:55 -08:00
Linus Torvalds
7c1badb2a9 Remove calltime and rettime from fgraph infrastructure
The calltime and rettime were used by the function graph tracer to
 calculate the timings of functions where it traced their entry and exit.
 The calltime and rettime were stored in the generic structures that
 were used for the mechanisms to add an entry and exit callback.
 
 Now that function graph infrastructure is used by other subsystems than
 just the tracer, the calltime and rettime are not needed for them.
 Remove the calltime and rettime from the generic fgraph infrastructure
 and have the callers that require them handle them.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jk2BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr3GAQCtzSX6PYTliZKm/Wa5zhebopwhGKEb
 Mv0VZejaxYu0hgEA5O57J676FAOaQ7BLzoMojKFhR8RZ6LadKR5r7A0Qzgc=
 =Bb/r
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull fgraph updates from Steven Rostedt:
 "Remove calltime and rettime from fgraph infrastructure

  The calltime and rettime were used by the function graph tracer to
  calculate the timings of functions where it traced their entry and
  exit. The calltime and rettime were stored in the generic structures
  that were used for the mechanisms to add an entry and exit callback.

  Now that function graph infrastructure is used by other subsystems
  than just the tracer, the calltime and rettime are not needed for
  them. Remove the calltime and rettime from the generic fgraph
  infrastructure and have the callers that require them handle them"

* tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Remove calltime and rettime from generic operations
2025-01-23 17:59:25 -08:00
Linus Torvalds
e8744fbc83 tracing updates for v6.14:
- Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Update the Rust tracepoint code to use the C code too
 
   There was some duplication of the tracepoint code for Rust that did the
   same logic as the C code. Add a helper that makes it possible for both
   algorithms to use the same logic in one place.
 
 - Add poll to trace event hist files
 
   It is useful to know when an event is triggered, or even with some
   filtering. Since hist files of events get updated when active and the
   event is triggered, allow applications to poll the hist file and wake up
   when an event is triggered. This will let the application know that the
   event it is waiting for happened.
 
 - Add :mod: command to enable events for current or future modules
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Add the command where if ':mod:<module>' is written into set_event, then
   either all the modules events are enabled if it is loaded, or cache it so
   that the module's events are enabled when it is loaded. This also works
   from the kernel command line, where "trace_event=:mod:<module>", when the
   module is loaded at boot up, its events will be enabled then.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5EbMxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkZsAP9Amgx9frSbR1pn1t0I3wVnQx7khgOu
 s/b8Ro+vjTx1/QD/RN2AA7f+HK4F27w3Aqfrs0nKXAPtXWsJ9Epp8raG5w8=
 =Pg+4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Update the Rust tracepoint code to use the C code too

   There was some duplication of the tracepoint code for Rust that did
   the same logic as the C code. Add a helper that makes it possible for
   both algorithms to use the same logic in one place.

 - Add poll to trace event hist files

   It is useful to know when an event is triggered, or even with some
   filtering. Since hist files of events get updated when active and the
   event is triggered, allow applications to poll the hist file and wake
   up when an event is triggered. This will let the application know
   that the event it is waiting for happened.

 - Add :mod: command to enable events for current or future modules

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Add the command where if ':mod:<module>' is written into set_event,
   then either all the modules events are enabled if it is loaded, or
   cache it so that the module's events are enabled when it is loaded.
   This also works from the kernel command line, where
   "trace_event=:mod:<module>", when the module is loaded at boot up,
   its events will be enabled then.

* tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  tracing: Fix output of set_event for some cached module events
  tracing: Fix allocation of printing set_event file content
  tracing: Rename update_cache() to update_mod_cache()
  tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES
  selftests/ftrace: Add test that tests event :mod: commands
  tracing: Cache ":mod:" events for modules not loaded yet
  tracing: Add :mod: command to enabled module events
  selftests/tracing: Add hist poll() support test
  tracing/hist: Support POLLPRI event for poll on histogram
  tracing/hist: Add poll(POLLIN) support on hist file
  tracing: Fix using ret variable in tracing_set_tracer()
  tracepoint: Reduce duplication of __DO_TRACE_CALL
  tracing/string: Create and use __free(argv_free) in trace_dynevent.c
  tracing: Switch trace_stat.c code over to use guard()
  tracing: Switch trace_stack.c code over to use guard()
  tracing: Switch trace_osnoise.c code over to use guard() and __free()
  tracing: Switch trace_events_synth.c code over to use guard()
  tracing: Switch trace_events_filter.c code over to use guard()
  tracing: Switch trace_events_trigger.c code over to use guard()
  tracing: Switch trace_events_hist.c code over to use guard()
  ...
2025-01-23 17:51:16 -08:00
Linus Torvalds
544521d621 Probes updates for v6.14:
- kprobes: Cleanups using guard() and __free(): Use cleanup.h macros
   to cleanup code and remove all gotos from kprobes code. This work
   includes below changes.
   . kprobes: Adopt guard() and scoped_guard()
   . jump_label: Define guard() for jump_label_lock
   . kprobes: Use guard() for external locks
   . kprobes: Use guard for rcu_read_lock
   . kprobes: Remove unneeded goto
   . kprobes: Remove remaining gotos
 
 - tracing/probes: Also cleanups tracing/*probe events code with guard()
   and __free(). These patches are just to simplify the parser codes.
   This work includes below changes.
   . tracing/kprobe: Adopt guard() and scoped_guard()
   . tracing/uprobe: Adopt guard() and scoped_guard()
   . tracing/eprobe: Adopt guard() and scoped_guard()
   . tracing: Use __free() in trace_probe for cleanup
   . tracing: Use __free() for kprobe events to cleanup
   . tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
 
 - kprobes: Reduce preempt disable scope in check_kprobe_access_safe()
   This reduces preempt disable time only when getting the module
   refcount in check_kprobe_access_safe(). Previously it disabled
   preempt needlessly for other checks including
   jump_label_text_reserved(), but it took long time because of
   the linear search.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmeNy8kbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b/UoIAMYX8pYtLMZjokKSlsnN
 Y3rMZecFJNKfXeA62YazpESO9n+BYyqj/Zj25Ws4YzdfnoT4kGtMuoUCS7d5U3lo
 RZO/a7i3TgrzQdWbRwBz7l75h+FnCAsrSoLGoH/sQxXUP+p/2C6GJG3O3A/0qMaS
 TPp7/nsR78P7MSf6Yr/ODZj0UWMSX6a6QNrzwloiMUjY6aPxz7K1on9cdzMF2QPE
 fgtJk+JbCn9nLOJi/SVlO+wu4EfdKrfT7sqz9taNX/eQYmx2O9fE3R92Qvhxh79r
 MrqcrMfcwfWtX+yD1kb9mf+jMNYy8Dhcf595B0TGSzCZq64ir9UukntR6YD3raAO
 Wm0=
 =qmrY
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - kprobes: Cleanups using guard() and __free(): Use cleanup.h macros to
   cleanup code and remove all gotos from kprobes code.

 - tracing/probes: Also cleanups tracing/*probe events code with guard()
   and __free(). These patches are just to simplify the parser codes.

 - kprobes: Reduce preempt disable scope in check_kprobe_access_safe()

   This reduces preempt disable time to only when getting the module
   refcount in check_kprobe_access_safe().

   Previously it disabled preempt needlessly for other checks including
   jump_label_text_reserved(), which took a long time because of the
   linear search.

* tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
  tracing: Use __free() for kprobe events to cleanup
  tracing: Use __free() in trace_probe for cleanup
  kprobes: Remove remaining gotos
  kprobes: Remove unneeded goto
  kprobes: Use guard for rcu_read_lock
  kprobes: Use guard() for external locks
  jump_label: Define guard() for jump_label_lock
  tracing/eprobe: Adopt guard() and scoped_guard()
  tracing/uprobe: Adopt guard() and scoped_guard()
  tracing/kprobe: Adopt guard() and scoped_guard()
  kprobes: Adopt guard() and scoped_guard()
  kprobes: Reduce preempt disable scope in check_kprobe_access_safe()
2025-01-23 17:24:20 -08:00
Linus Torvalds
d0d106a2bd bpf-next-6.14
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmeOu1YACgkQ6rmadz2v
 bTrrHxAAn6eqEsluWnDlzhI0OGsPjvgS00sf+MOeqiXYeS2eJ8yJuKifp38+nIQZ
 lIplsWU2ReUY20eizPqLPnQ7TXZGvLgp08E8yHUoZ0siWanqr9iDRfbZCCNrDMNm
 lMqeR1SLapMws2R/UX9JbvPn2ajIJ6Lb4wxenTfdlW6q+0hAGM6Dt0k/jBod+quq
 /oo+xwG3L0q4APBovJfiAFN2z6IYN03b+zLiOrpIJtMACGewEXnl3m4mkL8ZM/FV
 nZGPIxIUPXCpKTGEkNqxfkrnHN2wZQ4ZSKEJ6lhEEp4jrgCVITaGZ/E7jlx6fZoj
 bbd4YMonIPo9Nhim8p1dt8yYBhKKiE5IXIq0GqlMv5+MvAN8ylrlydpsouW1fu66
 hZ1W1BxbxmrgyF0Bwo9JPOMhBHwMrmD6iH9LgiMpZf0ASeF+q9cJpoSOU5j5E9XB
 LpLIRf5jYTd4wZjhDmrQREReLo+Bng9DlCBu+jjh2+YTz6l6Qed+ETpENcd7lL5i
 IHZVbgD2RVPNJoUfdrd763HfYfDTk+50MF5FIMEyfKHz11if0E/LhBMzto22hm6b
 2f8ruj/8yvg8s2dxEP3ySQgcnynlwEnGxLenUVv7uEOYKeWri1rq+fvTK5ne1OLK
 oHnTlkViwQb74c0r8cFW+nkyfUYTfhhBAql14rl/fMjGDO2KZ10=
 =f2CA
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:
 "A smaller than usual release cycle.

  The main changes are:

   - Prepare selftest to run with GCC-BPF backend (Ihor Solodrai)

     In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile
     only mode. Half of the tests are failing, since support for
     btf_decl_tag is still WIP, but this is a great milestone.

   - Convert various samples/bpf to selftests/bpf/test_progs format
     (Alexis Lothoré and Bastien Curutchet)

   - Teach verifier to recognize that array lookup with constant
     in-range index will always succeed (Daniel Xu)

   - Cleanup migrate disable scope in BPF maps (Hou Tao)

   - Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao)

   - Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin
     KaFai Lau)

   - Refactor verifier lock support (Kumar Kartikeya Dwivedi)

     This is a prerequisite for upcoming resilient spin lock.

   - Remove excessive 'may_goto +0' instructions in the verifier that
     LLVM leaves when unrolls the loops (Yonghong Song)

   - Remove unhelpful bpf_probe_write_user() warning message (Marco
     Elver)

   - Add fd_array_cnt attribute for prog_load command (Anton Protopopov)

     This is a prerequisite for upcoming support for static_branch"

* tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits)
  selftests/bpf: Add some tests related to 'may_goto 0' insns
  bpf: Remove 'may_goto 0' instruction in opt_remove_nops()
  bpf: Allow 'may_goto 0' instruction in verifier
  selftests/bpf: Add test case for the freeing of bpf_timer
  bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
  bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
  bpf: Bail out early in __htab_map_lookup_and_delete_elem()
  bpf: Free special fields after unlock in htab_lru_map_delete_node()
  tools: Sync if_xdp.h uapi tooling header
  libbpf: Work around kernel inconsistently stripping '.llvm.' suffix
  bpf: selftests: verifier: Add nullness elision tests
  bpf: verifier: Support eliding map lookup nullness
  bpf: verifier: Refactor helper access type tracking
  bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write
  bpf: verifier: Add missing newline on verbose() call
  selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED
  libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED
  libbpf: Fix return zero when elf_begin failed
  selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test
  veristat: Load struct_ops programs only once
  ...
2025-01-23 08:04:07 -08:00
Steven Rostedt
66611c0475 fgraph: Remove calltime and rettime from generic operations
The function graph infrastructure is now generic so that kretprobes,
fprobes and BPF can use it. But there is still some leftover logic that
only the function graph tracer itself uses. This is the calculation of the
calltime and return time of the functions. The calculation of the calltime
has been moved into the function graph tracer and those users that need it
so that it doesn't cause overhead to the other users. But the return
function timestamp was still called.

Instead of just moving the taking of the timestamp into the function graph
trace remove the calltime and rettime completely from the ftrace_graph_ret
structure. Instead, move it into the function graph return entry event
structure and this also moves all the calltime and rettime logic out of
the generic fgraph.c code and into the tracing code that uses it.

This has been reported to decrease the overhead by ~27%.

Link: https://lore.kernel.org/all/Z3aSuql3fnXMVMoM@krava/
Link: https://lore.kernel.org/all/173665959558.1629214.16724136597211810729.stgit@devnote2/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250121194436.15bdf71a@gandalf.local.home
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 21:55:49 -05:00
Linus Torvalds
2e04247f7c ftrace updates for v6.14:
- Have fprobes built on top of function graph infrastructure
 
   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function. The
   fprobe and kretprobe logic implements a similar method as the function
   graph tracer to trace the end of the function. That is to hijack the
   return address and jump to a trampoline to do the trace when the function
   exits. To do this, a shadow stack needs to be created to store the
   original return address.  Fprobes and function graph do this slightly
   differently. Fprobes (and kretprobes) has slots per callsite that are
   reserved to save the return address. This is fine when just a few points
   are traced. But users of fprobes, such as BPF programs, are starting to add
   many more locations, and this method does not scale.
 
   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started, every
   task gets its own shadow stack to hold the return address that is going to
   be traced. The function graph tracer has been updated to allow multiple
   users to use its infrastructure. Now have fprobes be one of those users.
   This will also allow for the fprobe and kretprobe methods to trace the
   return address to become obsolete. With new technologies like CFI that
   need to know about these methods of hijacking the return address, going
   toward a solution that has only one method of doing this will make the
   kernel less complex.
 
 - Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Remove disabling of interrupts in the function graph tracer
 
   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable interrupts and
   not trace NMIs. But the code has changed to allow NMIs and also
   interrupts. This change was done a long time ago, but the disabling of
   interrupts was never removed. Remove the disabling of interrupts in the
   function graph tracer is it is not needed. This greatly improves its
   performance.
 
 - Allow the :mod: command to enable tracing module functions on the kernel
   command line.
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42E2RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqXSAPwOMxuhye8tb1GYG62QD9+w7e6nOmlC
 2GCPj4detnEM2QD/ciivkhespVKhHpZHRewAuSnJgHPSM45NQ3EVESzjWQ4=
 =snbx
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Have fprobes built on top of function graph infrastructure

   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function.
   The fprobe and kretprobe logic implements a similar method as the
   function graph tracer to trace the end of the function. That is to
   hijack the return address and jump to a trampoline to do the trace
   when the function exits. To do this, a shadow stack needs to be
   created to store the original return address. Fprobes and function
   graph do this slightly differently. Fprobes (and kretprobes) has
   slots per callsite that are reserved to save the return address. This
   is fine when just a few points are traced. But users of fprobes, such
   as BPF programs, are starting to add many more locations, and this
   method does not scale.

   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started,
   every task gets its own shadow stack to hold the return address that
   is going to be traced. The function graph tracer has been updated to
   allow multiple users to use its infrastructure. Now have fprobes be
   one of those users. This will also allow for the fprobe and kretprobe
   methods to trace the return address to become obsolete. With new
   technologies like CFI that need to know about these methods of
   hijacking the return address, going toward a solution that has only
   one method of doing this will make the kernel less complex.

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Remove disabling of interrupts in the function graph tracer

   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable
   interrupts and not trace NMIs. But the code has changed to allow NMIs
   and also interrupts. This change was done a long time ago, but the
   disabling of interrupts was never removed. Remove the disabling of
   interrupts in the function graph tracer is it is not needed. This
   greatly improves its performance.

 - Allow the :mod: command to enable tracing module functions on the
   kernel command line.

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.

* tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ftrace: Implement :mod: cache filtering on kernel command line
  tracing: Adopt __free() and guard() for trace_fprobe.c
  bpf: Use ftrace_get_symaddr() for kprobe_multi probes
  ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
  Documentation: probes: Update fprobe on function-graph tracer
  selftests/ftrace: Add a test case for repeating register/unregister fprobe
  selftests: ftrace: Remove obsolate maxactive syntax check
  tracing/fprobe: Remove nr_maxactive from fprobe
  fprobe: Add fprobe_header encoding feature
  fprobe: Rewrite fprobe on function-graph tracer
  s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC
  ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
  bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
  tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  tracing: Add ftrace_fill_perf_regs() for perf event
  tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
  fprobe: Use ftrace_regs in fprobe exit handler
  fprobe: Use ftrace_regs in fprobe entry handler
  fgraph: Pass ftrace_regs to retfunc
  fgraph: Replace fgraph_ret_regs with ftrace_regs
  ...
2025-01-21 15:15:28 -08:00
Linus Torvalds
0074adea39 ring-buffer changes for v6.14
- Clean up the __rb_map_vma() logic
 
   The logic of __rb_map_vma() has a error check with WARN_ON() that makes
   sure that the index does not go past the end of the array of buffers. The
   test in the loop pretty much guarantees that it will never happen, but
   since the relation of the variables used is a little complex, the
   WARN_ON() check was added. It was noticed that the array was dereferenced
   before this check and if the logic does break and for some reason the
   logic goes past the array, there will be an out of bounds access here.
   Move the access to after the WARN_ON().
 
 - Consolidate how the ring buffer is determined to be empty
 
   Currently there's two ways that are used to determine if the ring buffer
   is empty. One relies on the status of the commit and reader pages and what
   was read, and the other is on what was written vs what was read. By using
   the number of entries (written) method, it can be used for reading events
   that are out of the kernel's control (what pKVM will use). Move to this
   method to make it easier to implement a pKVM ring buffer that the kernel
   can read.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42XuRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qn3GAQCOQ94vr88FSXb/azC9281iDGYC/KbJ
 7J4dGv2rXHpoVAEAtXRXSXpG0mTIJ6TtgVKgMrIFAuT/AVo4EIUr2q/CsgA=
 =2G7c
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace ring-buffer updates from Steven Rostedt:

 - Clean up the __rb_map_vma() logic

   The logic of __rb_map_vma() has a error check with WARN_ON() that
   makes sure that the index does not go past the end of the array of
   buffers. The test in the loop pretty much guarantees that it will
   never happen, but since the relation of the variables used is a
   little complex, the WARN_ON() check was added. It was noticed that
   the array was dereferenced before this check and if the logic does
   break and for some reason the logic goes past the array, there will
   be an out of bounds access here. Move the access to after the
   WARN_ON().

 - Consolidate how the ring buffer is determined to be empty

   Currently there's two ways that are used to determine if the ring
   buffer is empty. One relies on the status of the commit and reader
   pages and what was read, and the other is on what was written vs what
   was read. By using the number of entries (written) method, it can be
   used for reading events that are out of the kernel's control (what
   pKVM will use). Move to this method to make it easier to implement a
   pKVM ring buffer that the kernel can read.

* tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Make reading page consistent with the code logic
  ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
2025-01-21 15:11:54 -08:00
Steven Rostedt
8f21943e10 tracing: Fix output of set_event for some cached module events
The following works fine:

 ~# echo ':mod:trace_events_sample' > /sys/kernel/tracing/set_event
 ~# cat /sys/kernel/tracing/set_event
 *:*:mod:trace_events_sample
 ~#

But if a name is given without a ':' where it can match an event name or
system name, the output of the cached events does not include a new line:

 ~# echo 'foo_bar:mod:trace_events_sample' > /sys/kernel/tracing/set_event
 ~# cat /sys/kernel/tracing/set_event
 foo_bar:mod:trace_events_sample~#

Add the '\n' to that as well.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250121151336.6c491844@gandalf.local.home
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:23:07 -05:00
Steven Rostedt
f95ee54294 tracing: Fix allocation of printing set_event file content
The adding of cached events for modules not loaded yet required a
descriptor to separate the iteration of events with the iteration of
cached events for a module. But the allocation used the size of the
pointer and not the size of the contents to allocate its data and caused a
slab-out-of-bounds.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250121151236.47fcf433@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/Z4_OHKESRSiJcr-b@lappy/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:22:41 -05:00
Steven Rostedt
cd2375a356 ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
Some architectures can not safely do atomic64 operations in NMI context.
Since the ring buffer relies on atomic64 operations to do its time
keeping, if an event is requested in NMI context, reject it for these
architectures.

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Link: https://lore.kernel.org/20250120235721.407068250@goodmis.org
Fixes: c84897c0ff ("ring-buffer: Remove 32bit timestamp logic")
Closes: https://lore.kernel.org/all/86fb4f86-a0e4-45a2-a2df-3154acc4f086@gaisler.com/
Reported-by: Ludwig Rydberg <ludwig.rydberg@gaisler.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-21 15:19:00 -05:00
Linus Torvalds
6c4aa896eb Performance events changes for v6.14:
- Seqlock optimizations that arose in a perf context and were
    merged into the perf tree:
 
    - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
    - mm: Convert mm_lock_seq to a proper seqcount ((Suren Baghdasaryan)
    - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan)
    - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
 
  - Core perf enhancements:
 
    - Reduce 'struct page' footprint of perf by mapping pages
      in advance (Lorenzo Stoakes)
    - Save raw sample data conditionally based on sample type (Yabin Cui)
    - Reduce sampling overhead by checking sample_type in
      perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui)
    - Export perf_exclude_event() (Namhyung Kim)
 
  - Uprobes scalability enhancements: (Andrii Nakryiko)
 
    - Simplify find_active_uprobe_rcu() VMA checks
    - Add speculative lockless VMA-to-inode-to-uprobe resolution
    - Simplify session consumer tracking
    - Decouple return_instance list traversal and freeing
    - Ensure return_instance is detached from the list before freeing
    - Reuse return_instances between multiple uretprobes within task
    - Guard against kmemdup() failing in dup_return_instance()
 
  - AMD core PMU driver enhancements:
 
    - Relax privilege filter restriction on AMD IBS (Namhyung Kim)
 
  - AMD RAPL energy counters support: (Dhananjay Ugwekar)
 
    - Introduce topology_logical_core_id() (K Prateek Nayak)
 
    - Remove the unused get_rapl_pmu_cpumask() function
    - Remove the cpu_to_rapl_pmu() function
    - Rename rapl_pmu variables
    - Make rapl_model struct global
    - Add arguments to the init and cleanup functions
    - Modify the generic variable names to *_pkg*
    - Remove the global variable rapl_msrs
    - Move the cntr_mask to rapl_pmus struct
    - Add core energy counter support for AMD CPUs
 
  - Intel core PMU driver enhancements:
 
    - Support RDPMC 'metrics clear mode' feature (Kan Liang)
    - Clarify adaptive PEBS processing (Kan Liang)
    - Factor out functions for PEBS records processing (Kan Liang)
    - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
 
  - Intel uncore driver enhancements: (Kan Liang)
 
    - Convert buggy pmu->func_id use to pmu->registered
    - Support more units on Granite Rapids
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOJdQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i2yQ/+MXl7yfJOgdbwjBpgGGzH4burEO7ppak+
 ktzz+YjpNgjODe/xMAJGjjblouuYArCnRolc1UPvPm6M7jSY76wi42Y6c4dRtFoB
 2ReSrRqnreLOcrRS9nsTjvWRHfJHqJDVSd9TfHX6ILfzbaizCZOGYk558ZxAKRqu
 Lw7FOvLEe/Y3tg4z8dDg083jsasalKySP9wIPc0BkSqQTOfusd3KXju/Fux/9wkn
 hZcUgF4ds+0bH7xtO1/G9ILqGyeq97X1McIR9bAjln5Mxykclen4hSjRaWWHHo9O
 mzBKmd/blIATisfuuW+QLDQow3M1k3688cz7e9QOeWHHd/dJiMb9RLV90jdND/T/
 uLINC5vNemzyWEfnNiYQ31LjhG3SeuDiKWzRp36MbQcCh6EBdRXWLBgtmxq1L/3o
 ZCaCdtFu5+6epycdyOVZEpWDnjdx4GmLXMZi5WJfZ7fZ/IFjNkjk4OdzI1iRQ+i3
 Sbi75ep59ayTUhm5AB7gCJsP3R7EsZsiPHUenQdA2n9Sj6xE+IuhlS/QDQ9g5mdY
 Ijs0jHeVCGmhYoOD1xWnCZSzlnkEVU3zwfypAK+MC7pgtFMwDy5/Bu1USGxXXDy+
 aKsrJRSgHbtZ1gwoHstqkV+DeCTfElCLYkvigzI5Nmyib5Zp4vkwy2ZLWQjaNjm7
 mqRI7PugUkU=
 =c8XB
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Seqlock optimizations that arose in a perf context and were merged
  into the perf tree:

   - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
   - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
   - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
     Baghdasaryan)
   - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)

  Core perf enhancements:

   - Reduce 'struct page' footprint of perf by mapping pages in advance
     (Lorenzo Stoakes)
   - Save raw sample data conditionally based on sample type (Yabin Cui)
   - Reduce sampling overhead by checking sample_type in
     perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
     Cui)
   - Export perf_exclude_event() (Namhyung Kim)

  Uprobes scalability enhancements: (Andrii Nakryiko)

   - Simplify find_active_uprobe_rcu() VMA checks
   - Add speculative lockless VMA-to-inode-to-uprobe resolution
   - Simplify session consumer tracking
   - Decouple return_instance list traversal and freeing
   - Ensure return_instance is detached from the list before freeing
   - Reuse return_instances between multiple uretprobes within task
   - Guard against kmemdup() failing in dup_return_instance()

  AMD core PMU driver enhancements:

   - Relax privilege filter restriction on AMD IBS (Namhyung Kim)

  AMD RAPL energy counters support: (Dhananjay Ugwekar)

   - Introduce topology_logical_core_id() (K Prateek Nayak)
   - Remove the unused get_rapl_pmu_cpumask() function
   - Remove the cpu_to_rapl_pmu() function
   - Rename rapl_pmu variables
   - Make rapl_model struct global
   - Add arguments to the init and cleanup functions
   - Modify the generic variable names to *_pkg*
   - Remove the global variable rapl_msrs
   - Move the cntr_mask to rapl_pmus struct
   - Add core energy counter support for AMD CPUs

  Intel core PMU driver enhancements:

   - Support RDPMC 'metrics clear mode' feature (Kan Liang)
   - Clarify adaptive PEBS processing (Kan Liang)
   - Factor out functions for PEBS records processing (Kan Liang)
   - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)

  Intel uncore driver enhancements: (Kan Liang)

   - Convert buggy pmu->func_id use to pmu->registered
   - Support more units on Granite Rapids"

* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  perf: map pages in advance
  perf/x86/intel/uncore: Support more units on Granite Rapids
  perf/x86/intel/uncore: Clean up func_id
  perf/x86/intel: Support RDPMC metrics clear mode
  uprobes: Guard against kmemdup() failing in dup_return_instance()
  perf/x86: Relax privilege filter restriction on AMD IBS
  perf/core: Export perf_exclude_event()
  uprobes: Reuse return_instances between multiple uretprobes within task
  uprobes: Ensure return_instance is detached from the list before freeing
  uprobes: Decouple return_instance list traversal and freeing
  uprobes: Simplify session consumer tracking
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  mm: introduce mmap_lock_speculate_{try_begin|retry}
  mm: convert mm_lock_seq to a proper seqcount
  mm/gup: Use raw_seqcount_try_begin()
  seqlock: add raw_seqcount_try_begin
  perf/x86/rapl: Add core energy counter support for AMD CPUs
  perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
  perf/x86/rapl: Remove the global variable rapl_msrs
  ...
2025-01-21 10:52:03 -08:00
Linus Torvalds
1cbfb828e0 for-6.14/block-20250118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
 Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
 uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
 QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
 Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
 YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
 ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
 bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
 rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
 CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
 h3Rtbh+Pgg==
 =EXYs
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
Steven Rostedt
22412b72ca tracing: Rename update_cache() to update_mod_cache()
The static function in trace_events.c called update_cache() is too generic
and conflicts with the function defined in arch/openrisc/include/asm/pgtable.h

Rename it to update_mod_cache() to make it less generic.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250120172756.4ecfb43f@batman.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501210550.Ufrj5CRn-lkp@intel.com/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20 19:09:16 -05:00
Linus Torvalds
1a89a6924b kernel-6.14-rc1.pid
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pR0wAKCRCRxhvAZXjc
 ojb2AQD5QfpTEX/ju1TkenTvoNl+JfnIjaVSY40Lm9DWYzmCMAEAuRvf5WRIV713
 00/RVOrUvsLobzhmnk0yw53EQ5A+pA0=
 =2NDA
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pid_max namespacing update from Christian Brauner:
 "The pid_max sysctl is a global value. For a long time the default
  value has been 65535 and during the pidfd dicussions Linus proposed to
  bump pid_max by default. Based on this discussion systemd started
  bumping pid_max to 2^22. So all new systems now run with a very high
  pid_max limit with some distros having also backported that change.

  The decision to bump pid_max is obviously correct. It just doesn't
  make a lot of sense nowadays to enforce such a low pid number. There's
  sufficient tooling to make selecting specific processes without typing
  really large pid numbers available.

  In any case, there are workloads that have expections about how large
  pid numbers they accept. Either for historical reasons or
  architectural reasons. One concreate example is the 32-bit version of
  Android's bionic libc which requires pid numbers less than 65536.
  There are workloads where it is run in a 32-bit container on a 64-bit
  kernel. If the host has a pid_max value greater than 65535 the libc
  will abort thread creation because of size assumptions of
  pthread_mutex_t.

  That's a fairly specific use-case however, in general specific
  workloads that are moved into containers running on a host with a new
  kernel and a new systemd can run into issues with large pid_max
  values. Obviously making assumptions about the size of the allocated
  pid is suboptimal but we have userspace that does it.

  Of course, giving containers the ability to restrict the number of
  processes in their respective pid namespace indepent of the global
  limit through pid_max is something desirable in itself and comes in
  handy in general.

  Independent of motivating use-cases the existence of pid namespaces
  makes this also a good semantical extension and there have been prior
  proposals pushing in a similar direction. The trick here is to
  minimize the risk of regressions which I think is doable. The fact
  that pid namespaces are hierarchical will help us here.

  What we mostly care about is that when the host sets a low pid_max
  limit, say (crazy number) 100 that no descendant pid namespace can
  allocate a higher pid number in its namespace. Since pid allocation is
  hierarchial this can be ensured by checking each pid allocation
  against the pid namespace's pid_max limit. This means if the
  allocation in the descendant pid namespace succeeds, the ancestor pid
  namespace can reject it. If the ancestor pid namespace has a higher
  limit than the descendant pid namespace the descendant pid namespace
  will reject the pid allocation. The ancestor pid namespace will
  obviously not care about this.

  All in all this means pid_max continues to enforce a system wide limit
  on the number of processes but allows pid namespaces sufficient leeway
  in handling workloads with assumptions about pid values and allows
  containers to restrict the number of processes in a pid namespace
  through the pid_max interface"

* tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tests/pid_namespace: add pid_max tests
  pid: allow pid_max to be set per pid namespace
2025-01-20 10:29:11 -08:00
Steven Rostedt
a925df6f50 tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES
A typo was introduced when adding the ":mod:" command that did
a "#if CONFIG_MODULES" instead of a "#ifdef CONFIG_MODULES".
Fix it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250120125745.4ac90ca6@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501190121.E2CIJuUj-lkp@intel.com/
Fixes: b355247df1 ("tracing: Cache ":mod:" events for modules not loaded yet")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20 13:03:23 -05:00
Steven Rostedt
31f505dc70 ftrace: Implement :mod: cache filtering on kernel command line
Module functions can be set to set_ftrace_filter before the module is
loaded.

  # echo :mod:snd_hda_intel > set_ftrace_filter

This will enable all the functions for the module snd_hda_intel. If that
module is not loaded, it is "cached" in the trace array for when the
module is loaded, its functions will be traced.

But this is not implemented in the kernel command line. That's because the
kernel command line filtering is added very early in boot up as it is
needed to be done before boot time function tracing can start, which is
also available very early in boot up. The code used by the
"set_ftrace_filter" file can not be used that early as it depends on some
other initialization to occur first. But some of the functions can.

Implement the ":mod:" feature of "set_ftrace_filter" in the kernel command
line parsing. Now function tracing on just a single module that is loaded
at boot up can be done.

Adding:

 ftrace=function ftrace_filter=:mod:sna_hda_intel

To the kernel command line will only enable the sna_hda_intel module
functions when the module is loaded, and it will start tracing.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250116175832.34e39779@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 21:27:10 -05:00
Masami Hiramatsu (Google)
8275637215 tracing: Adopt __free() and guard() for trace_fprobe.c
Adopt __free() and guard() for trace_fprobe.c to remove gotos.

Link: https://lore.kernel.org/173708043449.319651.12242878905778792182.stgit@mhiramat.roam.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 21:27:07 -05:00
Steven Rostedt
b355247df1 tracing: Cache ":mod:" events for modules not loaded yet
When the :mod: command is written into /sys/kernel/tracing/set_event (or
that file within an instance), if the module specified after the ":mod:"
is not yet loaded, it will store that string internally. When the module
is loaded, it will enable the events as if the module was loaded when the
string was written into the set_event file.

This can also be useful to enable events that are in the init section of
the module, as the events are enabled before the init section is executed.

This also works on the kernel command line:

 trace_event=:mod:<module>

Will enable the events for <module> when it is loaded.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250116143533.514730995@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 09:41:08 -05:00
Steven Rostedt
4c86bc531e tracing: Add :mod: command to enabled module events
Add a :mod: command to enable only events from a given module from the
set_events file.

  echo '*:mod:<module>' > set_events

Or

  echo ':mod:<module>' > set_events

Will enable all events for that module. Specific events can also be
enabled via:

  echo '<event>:mod:<module>' > set_events

Or

  echo '<system>:<event>:mod:<module>' > set_events

Or

  echo '*:<event>:mod:<module>' > set_events

The ":mod:" keyword is consistent with the function tracing filter to
enable functions from a given module.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250116143533.214496360@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16 09:41:07 -05:00
Puranjay Mohan
87c544108b bpf: Send signals asynchronously if !preemptible
BPF programs can execute in all kinds of contexts and when a program
running in a non-preemptible context uses the bpf_send_signal() kfunc,
it will cause issues because this kfunc can sleep.
Change `irqs_disabled()` to `!preemptible()`.

Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/
Fixes: 1bc7896e9e ("bpf: Fix deadlock with rq_lock in bpf_send_signal()")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-15 13:44:08 -08:00
Shrikanth Hegde
24e0e61040 tracing: Print lazy preemption model
Print lazy preemption model in ftrace header when latency-format=1.

 # cat /sys/kernel/debug/sched/preempt
 none voluntary full (lazy)

Without patch:
  latency: 0 us, #232946/232946, CPU#40 | (M:unknown VP:0, KP:0, SP:0 HP:0 #P:80)
                                             ^^^^^^^

With Patch:
  latency: 0 us, #1897938/25566788, CPU#16 | (M:lazy VP:0, KP:0, SP:0 HP:0 #P:80)
                                                ^^^^

Now that lazy preemption is part of the kernel, make sure the tracing
infrastructure reflects that.

Link: https://lore.kernel.org/20250103093647.575919-1-sshegde@linux.ibm.com
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14 09:44:33 -05:00
Steven Rostedt
a485ea9e3e tracing: Fix irqsoff and wakeup latency tracers when using function graph
The function graph tracer has become generic so that kretprobes and BPF
can use it along with function graph tracing itself. Some of the
infrastructure was specific for function graph tracing such as recording
the calltime and return time of the functions. Calling the clock code on a
high volume function does add overhead. The calculation of the calltime
was removed from the generic code and placed into the function graph
tracer itself so that the other users did not incur this overhead as they
did not need that timestamp.

The calltime field was still kept in the generic return entry structure
and the function graph return entry callback filled it as that structure
was passed to other code.

But this broke both irqsoff and wakeup latency tracer as they still
depended on the trace structure containing the calltime when the option
display-graph is set as it used some of those same functions that the
function graph tracer used. But now the calltime was not set and was just
zero. This caused the calculation of the function time to be the absolute
value of the return timestamp and not the length of the function.

 # cd /sys/kernel/tracing
 # echo 1 > options/display-graph
 # echo irqsoff > current_tracer

The tracers went from:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   4)    <idle>-0    |  d..1. |   0.000 us    |  irqentry_enter();
        3 us |   4)    <idle>-0    |  d..2. |               |  irq_enter_rcu() {
        4 us |   4)    <idle>-0    |  d..2. |   0.431 us    |    preempt_count_add();
        5 us |   4)    <idle>-0    |  d.h2. |               |    tick_irq_enter() {
        5 us |   4)    <idle>-0    |  d.h2. |   0.433 us    |      tick_check_oneshot_broadcast_this_cpu();
        6 us |   4)    <idle>-0    |  d.h2. |   2.426 us    |      ktime_get();
        9 us |   4)    <idle>-0    |  d.h2. |               |      tick_nohz_stop_idle() {
       10 us |   4)    <idle>-0    |  d.h2. |   0.398 us    |        nr_iowait_cpu();
       11 us |   4)    <idle>-0    |  d.h1. |   1.903 us    |      }
       11 us |   4)    <idle>-0    |  d.h2. |               |      tick_do_update_jiffies64() {
       12 us |   4)    <idle>-0    |  d.h2. |               |        _raw_spin_lock() {
       12 us |   4)    <idle>-0    |  d.h2. |   0.360 us    |          preempt_count_add();
       13 us |   4)    <idle>-0    |  d.h3. |   0.354 us    |          do_raw_spin_lock();
       14 us |   4)    <idle>-0    |  d.h2. |   2.207 us    |        }
       15 us |   4)    <idle>-0    |  d.h3. |   0.428 us    |        calc_global_load();
       16 us |   4)    <idle>-0    |  d.h3. |               |        _raw_spin_unlock() {
       16 us |   4)    <idle>-0    |  d.h3. |   0.380 us    |          do_raw_spin_unlock();
       17 us |   4)    <idle>-0    |  d.h3. |   0.334 us    |          preempt_count_sub();
       18 us |   4)    <idle>-0    |  d.h1. |   1.768 us    |        }
       18 us |   4)    <idle>-0    |  d.h2. |               |        update_wall_time() {
      [..]

To:

 #   REL TIME      CPU  TASK/PID       ||||     DURATION                  FUNCTION CALLS
 #      |          |     |    |        ||||      |   |                     |   |   |   |
        0 us |   5)    <idle>-0    |  d.s2. |   0.000 us    |  _raw_spin_lock_irqsave();
        0 us |   5)    <idle>-0    |  d.s3. |   312159583 us |      preempt_count_add();
        2 us |   5)    <idle>-0    |  d.s4. |   312159585 us |      do_raw_spin_lock();
        3 us |   5)    <idle>-0    |  d.s4. |               |      _raw_spin_unlock() {
        3 us |   5)    <idle>-0    |  d.s4. |   312159586 us |        do_raw_spin_unlock();
        4 us |   5)    <idle>-0    |  d.s4. |   312159587 us |        preempt_count_sub();
        4 us |   5)    <idle>-0    |  d.s2. |   312159587 us |      }
        5 us |   5)    <idle>-0    |  d.s3. |               |      _raw_spin_lock() {
        5 us |   5)    <idle>-0    |  d.s3. |   312159588 us |        preempt_count_add();
        6 us |   5)    <idle>-0    |  d.s4. |   312159589 us |        do_raw_spin_lock();
        7 us |   5)    <idle>-0    |  d.s3. |   312159590 us |      }
        8 us |   5)    <idle>-0    |  d.s4. |   312159591 us |      calc_wheel_index();
        9 us |   5)    <idle>-0    |  d.s4. |               |      enqueue_timer() {
        9 us |   5)    <idle>-0    |  d.s4. |               |        wake_up_nohz_cpu() {
       11 us |   5)    <idle>-0    |  d.s4. |               |          native_smp_send_reschedule() {
       11 us |   5)    <idle>-0    |  d.s4. |   312171987 us |            default_send_IPI_single_phys();
    12408 us |   5)    <idle>-0    |  d.s3. |   312171990 us |          }
    12408 us |   5)    <idle>-0    |  d.s3. |   312171991 us |        }
    12409 us |   5)    <idle>-0    |  d.s3. |   312171991 us |      }

Where the calculation of the time for each function was the return time
minus zero and not the time of when the function returned.

Have these tracers also save the calltime in the fgraph data section and
retrieve it again on the return to get the correct timings again.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250113183124.61767419@gandalf.local.home
Fixes: f1f36e22be ("ftrace: Have calltime be saved in the fgraph storage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14 09:38:09 -05:00
Jeongjun Park
6e31b759b0 ring-buffer: Make reading page consistent with the code logic
In the loop of __rb_map_vma(), the 's' variable is calculated from the
same logic that nr_pages is and they both come from nr_subbufs. But the
relationship is not obvious and there's a WARN_ON_ONCE() around the 's'
variable to make sure it never becomes equal to nr_subbufs within the
loop. If that happens, then the code is buggy and needs to be fixed.

The 'page' variable is calculated from cpu_buffer->subbuf_ids[s] which is
an array of 'nr_subbufs' entries. If the code becomes buggy and 's'
becomes equal to or greater than 'nr_subbufs' then this will be an out of
bounds hit before the WARN_ON() is triggered and the code exiting safely.

Make the 'page' initialization consistent with the code logic and assign
it after the out of bounds check.

Link: https://lore.kernel.org/20250110162612.13983-1-aha310510@gmail.com
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
[ sdr: rewrote change log ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 16:05:43 -05:00
Vincent Donnefort
0568c6ebf0 ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
Currently there are two ways of identifying an empty ring-buffer. One
relying on the current status of the commit / reader page
(rb_per_cpu_empty()) and the other on the write and read counters
(rb_num_of_entries() used in rb_get_reader_page()).

with rb_num_of_entries(). This intends to ease later
introduction of ring-buffer writers which are out of the kernel control
and with whom, the only information available is through the meta-page
counters.

Link: https://lore.kernel.org/20250108114536.627715-2-vdonnefort@google.com
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 15:39:50 -05:00
Masami Hiramatsu (Google)
4f7caaa2f9 bpf: Use ftrace_get_symaddr() for kprobe_multi probes
Add ftrace_get_entry_ip() which is only for ftrace based probes, and use
it for kprobe multi probes because they are based on fprobe which uses
ftrace instead of kprobes.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/173566081414.878879.10631096557346094362.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13 15:04:30 -05:00
Masami Hiramatsu (Google)
927054606d tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
Simplify __trace_kprobe_create() by removing gotos.

Link: https://lore.kernel.org/all/173643301102.1514810.6149004416601259466.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:01:14 +09:00
Masami Hiramatsu (Google)
7dcc352078 tracing: Use __free() for kprobe events to cleanup
Use __free() in trace_kprobe.c to cleanup code.

Link: https://lore.kernel.org/all/173643299989.1514810.2924926552980462072.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:01:01 +09:00
Masami Hiramatsu (Google)
4af0532a0f tracing: Use __free() in trace_probe for cleanup
Use __free() in trace_probe to cleanup some gotos.

Link: https://lore.kernel.org/all/173643298860.1514810.7267350121047606213.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 09:00:13 +09:00
Masami Hiramatsu (Google)
4e83017e4c tracing/eprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in eprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289890996.73724.17421347964110362029.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
f8821732dc tracing/uprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in uprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289889911.73724.12457932738419630525.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
2cba0070cd tracing/kprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in kprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289888883.73724.6586200652276577583.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)
30c8fd31c5 tracing/kprobes: Fix to free objects when failed to copy a symbol
In __trace_kprobe_create(), if something fails it must goto error block
to free objects. But when strdup() a symbol, it returns without that.
Fix it to goto the error block to free objects correctly.

Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/

Fixes: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 08:57:18 +09:00
Jiri Olsa
2ebadb60cb bpf: Return error for missed kprobe multi bpf program execution
When kprobe multi bpf program can't be executed due to recursion check,
we currently return 0 (success) to fprobe layer where it's ignored for
standard kprobe multi probes.

For kprobe session the success return value will make fprobe layer to
install return probe and try to execute it as well.

But the return session probe should not get executed, because the entry
part did not run. FWIW the return probe bpf program most likely won't get
executed, because its recursion check will likely fail as well, but we
don't need to run it in the first place.. also we can make this clear
and obvious.

It also affects missed counts for kprobe session program execution, which
are now doubled (extra count for not executed return probe).

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:39:58 -08:00
Pu Lehui
ca3c4f646a bpf: Move out synchronize_rcu_tasks_trace from mutex CS
Commit ef1b808e3b ("bpf: Fix UAF via mismatching bpf_prog/attachment
RCU flavors") resolved a possible UAF issue in uprobes that attach
non-sleepable bpf prog by explicitly waiting for a tasks-trace-RCU grace
period. But, in the current implementation, synchronize_rcu_tasks_trace
is included within the mutex critical section, which increases the
length of the critical section and may affect performance. So let's move
out synchronize_rcu_tasks_trace from mutex CS.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250104013946.1111785-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:38:41 -08:00
Masami Hiramatsu (Google)
66fc6f521a tracing/hist: Support POLLPRI event for poll on histogram
Since POLLIN will not be flushed until the hist file is read, the user
needs to repeatedly read() and poll() on the hist file for monitoring the
event continuously. But the read() is somewhat redundant when the user is
only monitoring for event updates.

Add POLLPRI poll event on the hist file so the event returns when a
histogram is updated after open(), poll() or read(). Thus it is possible
to wait for the next event without having to issue a read().

Cc: Shuah Khan <shuah@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07 11:46:32 -05:00
Masami Hiramatsu (Google)
1bd13edbbe tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken
up when the histogram is updated with POLLIN.

Currently, there is no way to wait for a specific event in userspace.
So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`.
But it is not a good idea to peek at the `trace` for an event that
randomly happens. And `trace_pipe` is not coming back until a page is
filled with events.

This allows a user to wait for a specific event on the `hist` file. User
can set a histogram trigger on the event which they want to monitor
and poll() on its `hist` file. Since this poll() returns POLLIN, the next
poll() will return soon unless a read() happens on that hist file.

NOTE: To read the hist file again, you must set the file offset to 0,
but just for monitoring the event, you may not need to read the
histogram.

Cc: Shuah Khan <shuah@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07 11:44:49 -05:00
Steven Rostedt
22bec11a56 tracing: Fix using ret variable in tracing_set_tracer()
When the function tracing_set_tracer() switched over to using the guard()
infrastructure, it did not need to save the 'ret' variable and would just
return the value when an error arised, instead of setting ret and jumping
to an out label.

When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the
"ret" variable to be initialized to zero and had set 'ret' while holding
an arch_spin_lock() (not used by guard), and then upon releasing the lock
it would check 'ret' and exit if set. But because ret was only set when an
error occurred while holding the locks, 'ret' would be used uninitialized
if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should
be self contain. Make sure 'ret' is also set when no error occurred.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/
Fixes: d33b10c0c7 ("tracing: Switch trace.c code over to use guard()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-07 11:39:46 -05:00
Linus Torvalds
e30dd219c7 Fixes for ftrace in v6.13:
- Add needed READ_ONCE() around access to the fgraph array element
 
   The updates to the fgraph array can happen when callbacks are registered
   and unregistered. The __ftrace_return_to_handler() can handle reading
   either the old value or the new value. But once it reads that value
   it must stay consistent otherwise the check that looks to see if the
   value is a stub may show false, but if the compiler decides to re-read
   after that check, it can be true which can cause the code to crash
   later on.
 
 - Make function profiler use the top level ops for filtering again
 
   When function graph became available for instances, its filter ops became
   independent from the top level set_ftrace_filter. In the process the
   function profiler received its own filter ops as well. But the function
   profiler uses the top level set_ftrace_filter file and does not have one
   of its own. In giving it its own filter ops, it lost any user interface
   it once had. Make it use the top level set_ftrace_filter file again.
   This fixes a regression.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ3cR4RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qjxfAQCPhNztdmGmEYmuBtONPHwejidWnuJ6
 Rl2mQxEbp40OUgD+JvSWofhRsvtXWlymqZ9j+dKMegLqMeq834hB0LK4NAg=
 =+KqV
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:

 - Add needed READ_ONCE() around access to the fgraph array element

   The updates to the fgraph array can happen when callbacks are
   registered and unregistered. The __ftrace_return_to_handler() can
   handle reading either the old value or the new value. But once it
   reads that value it must stay consistent otherwise the check that
   looks to see if the value is a stub may show false, but if the
   compiler decides to re-read after that check, it can be true which
   can cause the code to crash later on.

 - Make function profiler use the top level ops for filtering again

   When function graph became available for instances, its filter ops
   became independent from the top level set_ftrace_filter. In the
   process the function profiler received its own filter ops as well.
   But the function profiler uses the top level set_ftrace_filter file
   and does not have one of its own. In giving it its own filter ops, it
   lost any user interface it once had. Make it use the top level
   set_ftrace_filter file again. This fixes a regression.

* tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Fix function profiler's filtering functionality
  fgraph: Add READ_ONCE() when accessing fgraph_array[]
2025-01-03 10:04:43 -08:00
Kohei Enju
789a8cff8d ftrace: Fix function profiler's filtering functionality
Commit c132be2c4f ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via
function_profile_enabled) has been showing statistics for all functions,
ignoring set_ftrace_filter settings.

While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency.  This patch
modifies the function profiler to use the global filter, fixing the
filtering functionality.

Before (filtering not working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  schedule                               314    22290594 us     70989.15 us
     40372231 us
  x64_sys_call                          1527    8762510 us      5738.382 us
     3414354 us
  schedule_hrtimeout_range               176    8665356 us      49234.98 us
     405618876 us
  __x64_sys_ppoll                        324    5656635 us      17458.75 us
     19203976 us
  do_sys_poll                            324    5653747 us      17449.83 us
     19214945 us
  schedule_timeout                        67    5531396 us      82558.15 us
     2136740827 us
  __x64_sys_pselect6                      12    3029540 us      252461.7 us
     63296940171 us
  do_pselect.constprop.0                  12    3029532 us      252461.0 us
     63296952931 us
```

After (filtering working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  vfs_write                              462    68476.43 us     148.217 us
     25874.48 us
  vfs_read                               641    9611.356 us     14.994 us
     28868.07 us
  vfs_fstat                              890    878.094 us      0.986 us
     1.667 us
  vfs_fstatat                            227    757.176 us      3.335 us
     18.928 us
  vfs_statx                              226    610.610 us      2.701 us
     17.749 us
  vfs_getattr_nosec                     1187    460.919 us      0.388 us
     0.326 us
  vfs_statx_path                         297    343.287 us      1.155 us
     11.116 us
  vfs_rename                               6    291.575 us      48.595 us
     9889.236 us
```

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com
Fixes: c132be2c4f ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-02 17:21:33 -05:00