As described in the old comment dating back to
commit 6610e0893b ("RTC: Rework RTC code to use timerqueue for events")
from 2010, we have been living with a race window when setting alarm
with an expiry in the near future (i.e. next second).
With 1 second resolution, it can happen that the second ticks after the
check for the timer having expired, but before the alarm is actually set.
When this happen, no alarm IRQ is generated, at least not with some RTC
chips (isl12022 is an example of this).
With UIE RTC timer being implemented on top of alarm irq, being re-armed
every second, UIE will occasionally fail to work, as an alarm irq lost
due to this race will stop the re-arming loop.
For now, I have limited the additional expiry check to only be done for
alarms set to next seconds. I expect it should be good enough, although I
don't know if we can now for sure that systems with loads could end up
causing the same problems for alarms set 2 seconds or even longer in the
future.
I haven't been able to reproduce the problem with this check in place.
Cc: stable@vger.kernel.org
Signed-off-by: Esben Haabendal <esben@geanix.com>
Link: https://lore.kernel.org/r/20250516-rtc-uie-irq-fixes-v2-1-3de8e530a39e@geanix.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
- Have osnoise tracer use memdup_user_nul()
The function osnoise_cpus_write() open codes a kmalloc() and then
a copy_from_user() and then adds a nul byte at the end which is the
same as simply using memdup_user_nul().
- Fix wakeup and irq tracers when failing to acquire calltime
When the wakeup and irq tracers use the function graph tracer for
tracing function times, it saves a timestamp into the fgraph shadow
stack. It is possible that this could fail to be stored. If that
happens, it exits the routine early. These functions also disable
nesting of the operations by incremeting the data "disable" counter.
But if the calltime exits out early, it never increments the counter
back to what it needs to be.
Since there's only a couple of lines of code that does work after
acquiring the calltime, instead of exiting out early, reverse the
if statement to be true if calltime is acquired, and place the code
that is to be done within that if block. The clean up will always
be done after that.
- Fix ring_buffer_map() return value on failure of __rb_map_vma()
If __rb_map_vma() fails in ring_buffer_map(), it does not return
an error. This means the caller will be working against a bad vma
mapping. Have ring_buffer_map() return an error when __rb_map_vma()
fails.
- Fix regression of writing to the trace_marker file
A bug fix was made to change __copy_from_user_inatomic() to
copy_from_user_nofault() in the trace_marker write function.
The trace_marker file is used by applications to write into
it (usually with a file descriptor opened at the start of the
program) to record into the tracing system. It's usually used
in critical sections so the write to trace_marker is highly
optimized.
The reason for copying in an atomic section is that the write
reserves space on the ring buffer and then writes directly into
it. After it writes, it commits the event. The time between
reserve and commit must have preemption disabled.
The trace marker write does not have any locking nor can it
allocate due to the nature of it being a critical path.
Unfortunately, converting __copy_from_user_inatomic() to
copy_from_user_nofault() caused a regression in Android.
Now all the writes from its applications trigger the fault that
is rejected by the _nofault() version that wasn't rejected by
the _inatomic() version. Instead of getting data, it now just
gets a trace buffer filled with:
tracing_mark_write: <faulted>
To fix this, on opening of the trace_marker file, allocate
per CPU buffers that can be used by the write call. Then
when entering the write call, do the following:
preempt_disable();
cpu = smp_processor_id();
buffer = per_cpu_ptr(cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
This works similarly to seqcount. As it must enabled preemption
to do a copy_from_user() into a per CPU buffer, if it gets
preempted, the buffer could be corrupted by another task.
To handle this, read the number of context switches of the current
CPU, disable migration, enable preemption, copy the data from
user space, then immediately disable preemption again.
If the number of context switches is the same, the buffer
is still valid. Otherwise it must be assumed that the buffer may
have been corrupted and it needs to try again.
Now the trace_marker write can get the user data even if it has
to fault it in, and still not grab any locks of its own.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaOfVOhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qq6oAP9y+zxuouWjtXIz9/z++aykFgKCkeau
XHSSdJdn4R+AQgEA4SE0UWKH0F6Bg7qwyocahMMQ1tIJRrpihfNrKBUmmQ4=
=wDGp
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing clean up and fixes from Steven Rostedt:
- Have osnoise tracer use memdup_user_nul()
The function osnoise_cpus_write() open codes a kmalloc() and then a
copy_from_user() and then adds a nul byte at the end which is the
same as simply using memdup_user_nul().
- Fix wakeup and irq tracers when failing to acquire calltime
When the wakeup and irq tracers use the function graph tracer for
tracing function times, it saves a timestamp into the fgraph shadow
stack. It is possible that this could fail to be stored. If that
happens, it exits the routine early. These functions also disable
nesting of the operations by incremeting the data "disable" counter.
But if the calltime exits out early, it never increments the counter
back to what it needs to be.
Since there's only a couple of lines of code that does work after
acquiring the calltime, instead of exiting out early, reverse the if
statement to be true if calltime is acquired, and place the code that
is to be done within that if block. The clean up will always be done
after that.
- Fix ring_buffer_map() return value on failure of __rb_map_vma()
If __rb_map_vma() fails in ring_buffer_map(), it does not return an
error. This means the caller will be working against a bad vma
mapping. Have ring_buffer_map() return an error when __rb_map_vma()
fails.
- Fix regression of writing to the trace_marker file
A bug fix was made to change __copy_from_user_inatomic() to
copy_from_user_nofault() in the trace_marker write function. The
trace_marker file is used by applications to write into it (usually
with a file descriptor opened at the start of the program) to record
into the tracing system. It's usually used in critical sections so
the write to trace_marker is highly optimized.
The reason for copying in an atomic section is that the write
reserves space on the ring buffer and then writes directly into it.
After it writes, it commits the event. The time between reserve and
commit must have preemption disabled.
The trace marker write does not have any locking nor can it allocate
due to the nature of it being a critical path.
Unfortunately, converting __copy_from_user_inatomic() to
copy_from_user_nofault() caused a regression in Android. Now all the
writes from its applications trigger the fault that is rejected by
the _nofault() version that wasn't rejected by the _inatomic()
version. Instead of getting data, it now just gets a trace buffer
filled with:
tracing_mark_write: <faulted>
To fix this, on opening of the trace_marker file, allocate per CPU
buffers that can be used by the write call. Then when entering the
write call, do the following:
preempt_disable();
cpu = smp_processor_id();
buffer = per_cpu_ptr(cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
This works similarly to seqcount. As it must enabled preemption to do
a copy_from_user() into a per CPU buffer, if it gets preempted, the
buffer could be corrupted by another task.
To handle this, read the number of context switches of the current
CPU, disable migration, enable preemption, copy the data from user
space, then immediately disable preemption again. If the number of
context switches is the same, the buffer is still valid. Otherwise it
must be assumed that the buffer may have been corrupted and it needs
to try again.
Now the trace_marker write can get the user data even if it has to
fault it in, and still not grab any locks of its own.
* tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have trace_marker use per-cpu data to read user space
ring buffer: Propagate __rb_map_vma return value to caller
tracing: Fix irqoff tracers on failure of acquiring calltime
tracing: Fix wakeup tracers on failure of acquiring calltime
tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
- polling fix for trans fd that ought to have been fixed otherwise back
in March, but apparently came back somewhere else...
- USB transport buffer overflow fix
- Some dentry lifetime rework to handle metadata update for currently
opened files in uncached mode, or inode type change in cached mode
- a double-put on invalid flush found by syzbot
- and finally /sys/fs/9p/caches not advancing buffer and overwriting
itself for large contents
Thanks to everyone involved!
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmjns5cACgkQq06b7GqY
5nDGLg/+Ltsiwj51CDwyJxFPX4ki1jM3j2iA8BNpdjjzCev6Jp1vuWBPiV+sTtgM
zKmIw9joyvzaIXKzA3h7tMe7h7y8H4p7KZyD3YSDSNW5XIb5UYXeQwYVw39OxENk
nVQkQ41H+bPepmz3R0t/F4la9myVLBtfdcPWJN+JfkOLgdeemCGf+TShJ6F+9Qaq
PTd3UKfdv8JRlKQmeLH7pPlBoWYRoC2Tq3pZbyC3D1A50bWkTfxusW0k3MUzRyHO
t/jNbhHt0ai+densQ6O5M+ALMI7KIIzR+obVaBHfzSUcwFeGhmW2x84Mo0qX6YMF
0B/fY3mCXi8QO6my1hfwmbRtY5mEV967wM/uF7E/SuvyNVizOMBZn8FgWfVPVszr
ix1iNBgQJdoOEMbHvDfCAtGNorFrMiybyLvoabL7YCwFo68LdYqj2br3/feajoYM
5g46nUcidH4fVBLtiBGE8w0ko8aa3zLB4iu/fdATuEZ1r+gPt40SWo1mKahoCgY3
BOtYxEZTnd4l4GPJDQIaYgVckgSiex0xcE9pRJ5LtBwdyw9g4jwfb39WHx4MT2sE
u5xWDbadU/ZM6ppLJ/iBkotvU9D7ohKTviN1IfCwjqL7IMBNz5c+nBwLQZWK4YTR
2ySKpv1VOMROglt1KvnHfdpmdJ+CjAJLLNE+XSz9AHXinK5v8aM=
=bUN1
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.18-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"A bunch of unrelated fixes:
- polling fix for trans fd that ought to have been fixed otherwise
back in March, but apparently came back somewhere else...
- USB transport buffer overflow fix
- Some dentry lifetime rework to handle metadata update for currently
opened files in uncached mode, or inode type change in cached mode
- a double-put on invalid flush found by syzbot
- and finally /sys/fs/9p/caches not advancing buffer and overwriting
itself for large contents
Thanks to everyone involved!"
* tag '9p-for-6.18-rc1' of https://github.com/martinetd/linux:
9p: sysfs_init: don't hardcode error to ENOMEM
9p: fix /sys/fs/9p/caches overwriting itself
9p: clean up comment typos
9p/trans_fd: p9_fd_request: kick rx thread if EPOLLIN
net/9p: fix double req put in p9_fd_cancelled
net/9p: Fix buffer overflow in USB transport layer
fs/9p: Add p9_debug(VFS) in d_revalidate
fs/9p: Invalidate dentry if inode type change detected in cached mode
fs/9p: Refresh metadata in d_revalidate for uncached mode too
Current release - regressions:
- mlx5: fix pre-2.40 binutils assembler error
Current release - new code bugs:
- net: psp: don't assume reply skbs will have a socket
- eth: fbnic: fix missing programming of the default descriptor
Previous releases - regressions:
- page_pool: fix PP_MAGIC_MASK to avoid crashing on some 32-bit arches
- tcp:
- take care of zero tp->window_clamp in tcp_set_rcvlowat()
- don't call reqsk_fastopen_remove() in tcp_conn_request().
- eth: ice: release xa entry on adapter allocation failure
- eth: usb: asix: hold PM usage ref to avoid PM/MDIO + RTNL deadlock
Previous releases - always broken:
- netfilter: validate objref and objrefmap expressions
- sctp: fix a null dereference in sctp_disposition sctp_sf_do_5_1D_ce()
- eth: mlx4: prevent potential use after free in mlx4_en_do_uc_filter()
- eth: mlx5: prevent tunnel mode conflicts between FDB and NIC IPsec tables
- eth: ocelot: fix use-after-free caused by cyclic delayed work
Misc:
- add support for MediaTek PCIe 5G HP DRMR-H01
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjnlD0SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkQnYP/iQAwtLSyE2lqdP+OsfnwhdpeeMpy2B/
rfN7hk4OS1Rs1WBEf+Y45Lm5yC2poRleEf3QrixcqAHrtr4SED1JgZynjzjtYJ6c
mmCdzZPZMfBEVG7rcOusSMI+ebWCA1yvSzjqw6CctwzW6Sac4S3KRyYzUNuTJ+De
3APf6NptD324veJuyP01opafaExBpu21DfaEXeu4kP030S74J5TMfjt7dNvVpYDt
qwb7s6syJ9O+U3INmi6SVxJY9711Cj8tjUXzz94FWc/kC28D3UnIKUO1tL/qNthe
do0OhNa1YlsmoGJuRthxQa9ubyScnP6hKcy8VlanvuDya0JVyzdod/RDO6A2L9bp
bdJIM2wnLd4Sqe7QcGPBFj0KNahgqG1lzj0uHKhrWBetp+5wWySuk9Mohg8yZaGP
w53X0+WhAeferpcdMPFbwBL9pcgmHo2FxDMaHKcIrIAFb+TIzSZA6CXWB2RFRSgH
smgQ9EYUbfUtP6nZCs1lfOux7CdN94tuKA9jnUXFrlIda44akxVe/wVi9bRazrVx
TcpF0BEeQTyM4wCoVsk/JzTNeK2VZtj8i1B7TrZvR68UkwGrXm8DB4xG42LxdRaw
PEtu8mbUGiWtkojmUAVpYc/TfMdQOVyBUijd2uAe1pXV4yPDz5cWNkEPX8vvAiRu
rIe8v1VAz+gg
=LlXo
-----END PGP SIGNATURE-----
Merge tag 'net-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- mlx5: fix pre-2.40 binutils assembler error
Current release - new code bugs:
- net: psp: don't assume reply skbs will have a socket
- eth: fbnic: fix missing programming of the default descriptor
Previous releases - regressions:
- page_pool: fix PP_MAGIC_MASK to avoid crashing on some 32-bit arches
- tcp:
- take care of zero tp->window_clamp in tcp_set_rcvlowat()
- don't call reqsk_fastopen_remove() in tcp_conn_request()
- eth:
- ice: release xa entry on adapter allocation failure
- usb: asix: hold PM usage ref to avoid PM/MDIO + RTNL deadlock
Previous releases - always broken:
- netfilter: validate objref and objrefmap expressions
- sctp: fix a null dereference in sctp_disposition sctp_sf_do_5_1D_ce()
- eth:
- mlx4: prevent potential use after free in mlx4_en_do_uc_filter()
- mlx5: prevent tunnel mode conflicts between FDB and NIC IPsec tables
- ocelot: fix use-after-free caused by cyclic delayed work
Misc:
- add support for MediaTek PCIe 5G HP DRMR-H01"
* tag 'net-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
net: airoha: Fix loopback mode configuration for GDM2 port
selftests: drv-net: pp_alloc_fail: add necessary optoins to config
selftests: drv-net: pp_alloc_fail: lower traffic expectations
selftests: drv-net: fix linter warnings in pp_alloc_fail
eth: fbnic: fix reporting of alloc_failed qstats
selftests: drv-net: xdp: add test for interface level qstats
selftests: drv-net: xdp: rename netnl to ethnl
eth: fbnic: fix saving stats from XDP_TX rings on close
eth: fbnic: fix accounting of XDP packets
eth: fbnic: fix missing programming of the default descriptor
selftests: netfilter: query conntrack state to check for port clash resolution
selftests: netfilter: nft_fib.sh: fix spurious test failures
bridge: br_vlan_fill_forward_path_pvid: use br_vlan_group_rcu()
netfilter: nft_objref: validate objref and objrefmap expressions
net: pse-pd: tps23881: Fix current measurement scaling
net/mlx5: fix pre-2.40 binutils assembler error
net/mlx5e: Do not fail PSP init on missing caps
net/mlx5e: Prevent tunnel reformat when tunnel mode not allowed
net/mlx5: Prevent tunnel mode conflicts between FDB and NIC IPsec tables
net: usb: asix: hold PM usage ref to avoid PM/MDIO + RTNL deadlock
...
- Compile the decompressor with -Wno-pointer-sign flag to avoid
a clang warning
- Fix incomplete conversion to flag output macros in __xsch(),
to avoid always zero return value instead of the expected
condition code
- Remove superfluous newlines from inline assemblies to improve
compiler inlining decisions
- Expose firmware provided UID Checking state in sysfs regardless
of the device presence or state
- CIO does not unregister subchannels when the attached device is
invalid or unavailable. Update the purge function to remove I/O
subchannels if the device number is found on cio_ignore list
- Consolidate PAI crypto allocation and cleanup paths
- The uv_get_secret_metadata() function has been removed some few
months ago, remove also the function mention it in a comment
-----BEGIN PGP SIGNATURE-----
iI0EABYKADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCaOZurhccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8A7aAPwJ4hgGHrZY513Kk90eAYGcW7mL
k7L4Q5kJjQ9M1Y4eTgEAjr3BQLzshpYJVVDxuivZhYSPNOe7MJmolVfZroNv/AE=
=A/vb
-----END PGP SIGNATURE-----
Merge tag 's390-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Compile the decompressor with -Wno-pointer-sign flag to avoid a clang
warning
- Fix incomplete conversion to flag output macros in __xsch(), to avoid
always zero return value instead of the expected condition code
- Remove superfluous newlines from inline assemblies to improve
compiler inlining decisions
- Expose firmware provided UID Checking state in sysfs regardless of
the device presence or state
- CIO does not unregister subchannels when the attached device is
invalid or unavailable. Update the purge function to remove I/O
subchannels if the device number is found on cio_ignore list
- Consolidate PAI crypto allocation and cleanup paths
- The uv_get_secret_metadata() function has been removed some few
months ago, remove also the function mention it in a comment
* tag 's390-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/uv: Fix comment of uv_find_secret() function
s390/pai_crypto: Consolidate PAI crypto allocation and cleanup paths
s390/cio: Update purge function to unregister the unused subchannels
s390/pci: Expose firmware provided UID Checking state in sysfs
s390: Remove superfluous newlines from inline assemblies
s390/cio/ioasm: Fix __xsch() condition code handling
s390: Add -Wno-pointer-sign to KBUILD_CFLAGS_DECOMPRESSOR
-----BEGIN PGP SIGNATURE-----
iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmjngFEbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYia2DMH+wdy8KaYLVstuYYI3AuX
7lilQCqEw7yKNrU2pS/yCOGRg2a+X5qFkMSRji9S1htALuyu8VS2X2MZrqgE0Q1y
oUKtZZuEaX80SPd2XhwpmsvCIo4V+bLtokHl2SsMD9tEV1AxKnO8UO5zMvxeqz8a
8XEfiGHU4oAm678hLhnQovA7akgn37uKhzFglevStwRVYELhxv5RNEoH6QsVKjr4
H7oTXQoYL4T+H84VDPXd8c26g3rhJ7fP1t0AthsZypOMoiiHDW08J03Qw9UQT+Hc
8NqgTgltczEvm0WExQ4xnCq6QZbwQLaPoulGFxP1iE4Ii0HLqU9SDkhv8i4fymBI
Fbk=
=1Zqd
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fixes for several corner cases in error paths and debugging options,
related to the new kmalloc_nolock() functionality (Kuniyuki Iwashima,
Ran Xiaokai)
* tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slub: Don't call lockdep_unregister_key() for immature kmem_cache.
slab: Fix using this_cpu_ptr() in preemptible context
slab: Add allow_spin check to eliminate kmemleak warnings
We can do the same cleanup on laundromat.
On invalidate_all_cached_dirs(), run laundromat worker with 0 timeout
and flush it for immediate + sync cleanup.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove redudant assignment of @rc as it will be overwritten by the
following cifs_file_flush() call.
Reported-by: Steve French <stfrench@microsoft.com>
Addresses-Coverity: 1665925
Fixes: 210627b0aca9 ("smb: client: fix missing timestamp updates with O_TRUNC")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
AIO+DIO may extend the file size, hence we need to make sure ->i_size
is stable across the entire fallocate(2) operation, otherwise it would
become a truncate and then inode size reduced back down when it
finishes.
Fix this by calling netfs_wait_for_outstanding_io() right after
acquiring ->i_rwsem exclusively in cifs_fallocate() and then guarantee
a stable ->i_size across fallocate(2).
Also call netfs_wait_for_outstanding_io() after truncating pagecache
to avoid any potential races with writeback.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Fixes: 210627b0aca9 ("smb: client: fix missing timestamp updates with O_TRUNC")
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Don't reuse open handle when changing timestamps to prevent the server
from disabling automatic timestamp updates as per MS-FSA 2.1.4.17.
---8<---
import os
import time
filename = '/mnt/foo'
def print_stat(prefix):
st = os.stat(filename)
print(prefix, ': ', time.ctime(st.st_atime), time.ctime(st.st_ctime))
fd = os.open(filename, os.O_CREAT|os.O_TRUNC|os.O_WRONLY, 0o644)
print_stat('old')
os.utime(fd, None)
time.sleep(2)
os.write(fd, b'foo')
os.close(fd)
time.sleep(2)
print_stat('new')
---8<---
Before patch:
$ mount.cifs //srv/share /mnt -o ...
$ python3 run.py
old : Fri Oct 3 14:01:21 2025 Fri Oct 3 14:01:21 2025
new : Fri Oct 3 14:01:21 2025 Fri Oct 3 14:01:21 2025
After patch:
$ mount.cifs //srv/share /mnt -o ...
$ python3 run.py
old : Fri Oct 3 17:03:34 2025 Fri Oct 3 17:03:34 2025
new : Fri Oct 3 17:03:36 2025 Fri Oct 3 17:03:36 2025
Fixes: b6f2a0f89d ("cifs: for compound requests, use open handle if possible")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: Frank Sorenson <sorenson@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Mask off ATTR_MTIME|ATTR_CTIME bits on ATTR_SIZE (e.g. ftruncate(2))
to prevent the client from sending set info calls and then disabling
automatic timestamp updates on server side as per MS-FSA 2.1.4.17.
---8<---
import os
import time
filename = '/mnt/foo'
def print_stat(prefix):
st = os.stat(filename)
print(prefix, ': ', time.ctime(st.st_atime), time.ctime(st.st_ctime))
fd = os.open(filename, os.O_CREAT|os.O_TRUNC|os.O_WRONLY, 0o644)
print_stat('old')
os.ftruncate(fd, 10)
time.sleep(2)
os.write(fd, b'foo')
os.close(fd)
time.sleep(2)
print_stat('new')
---8<---
Before patch:
$ mount.cifs //srv/share /mnt -o ...
$ python3 run.py
old : Fri Oct 3 13:47:03 2025 Fri Oct 3 13:47:03 2025
new : Fri Oct 3 13:47:00 2025 Fri Oct 3 13:47:03 2025
After patch:
$ mount.cifs //srv/share /mnt -o ...
$ python3 run.py
old : Fri Oct 3 13:48:39 2025 Fri Oct 3 13:48:39 2025
new : Fri Oct 3 13:48:41 2025 Fri Oct 3 13:48:41 2025
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: Frank Sorenson <sorenson@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Don't call ->set_file_info() on open handle to prevent the server from
stopping [cm]time updates automatically as per MS-FSA 2.1.4.17.
Fix this by checking for ATTR_OPEN bit earlier in cifs_setattr() to
prevent ->set_file_info() from being called when opening a file with
O_TRUNC. Do the truncation in ->open() instead.
This also saves two roundtrips when opening a file with O_TRUNC and
there are currently no open handles to be reused.
Before patch:
$ mount.cifs //srv/share /mnt -o ...
$ cd /mnt
$ exec 3>foo; stat -c 'old: %z %y' foo; sleep 2; echo test >&3; exec 3>&-; sleep 2; stat -c 'new: %z %y' foo
old: 2025-10-03 13:26:23.151030500 -0300 2025-10-03 13:26:23.151030500 -0300
new: 2025-10-03 13:26:23.151030500 -0300 2025-10-03 13:26:23.151030500 -0300
After patch:
$ mount.cifs //srv/share /mnt -o ...
$ cd /mnt
$ exec 3>foo; stat -c 'old: %z %y' foo; sleep 2; echo test >&3; exec 3>&-; sleep 2; stat -c 'new: %z %y' foo
$ exec 3>foo; stat -c 'old: %z %y' foo; sleep 2; echo test >&3; exec 3>&-; sleep 2; stat -c 'new: %z %y' foo
old: 2025-10-03 13:28:13.911933800 -0300 2025-10-03 13:28:13.911933800 -0300
new: 2025-10-03 13:28:26.647492700 -0300 2025-10-03 13:28:26.647492700 -0300
Reported-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
The return value of copy_to_iter() function will never be negative,
it is the number of bytes copied, or zero if nothing was copied.
Update the check to treat 0 as an error, and return -1 in that case.
Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Acked-by: Tom Talpey <tom@talpey.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb2_copychunk_range() used to send a single SRV_COPYCHUNK per
SRV_COPYCHUNK_COPY IOCTL.
Implement variable Chunks[] array in struct copychunk_ioctl and fill it
with struct copychunk (MS-SMB2 2.2.31.1.1), bounded by server-advertised
limits.
This reduces the number of IOCTL requests for large copies.
While we are at it, rename a couple variables to follow the terminology
used in the specification.
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Statements from an if branch and the end of this function implementation
were equivalent.
Thus delete duplicate source code.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>
Convert the Devicetree binding documentation for hisilicon,hix5hd2-i2c
from plain text to DT binding schema.
Signed-off-by: Kael D'Alcamo <dev@kael-k.io>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Add "INTC10D1" ACPI device-id for MTL-CVF devices, like the Dell Latitude
7450.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2368506
Signed-off-by: Hans de Goede <hansg@kernel.org>
Acked-by: Israel Cepeda <israel.a.cepeda.lopez@intel.com>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Add missing configuration for loopback mode in airhoha_set_gdm2_loopback
routine.
Fixes: 9cd451d414 ("net: airoha: Add loopback support for GDM2")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251008-airoha-loopback-mode-fix-v2-1-045694fe7f60@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski says:
====================
eth: fbnic: fix XDP_TX and XDP vs qstats
Fix XDP_TX hangs and adjust the XDP statistics to match the definition
of qstats. The three problems are somewhat distinct.
XDP_TX hangs is a simple coding bug (patch 1).
The accounting of XDP packets is all over the place. Fix it to obey
qstat rules (packets seen by XDP always counted as Rx packets).
Patch 2 fixes the basic accounting, patch 3 touches up saving
the stats when rings are freed.
Patch 6 corrects reporting of alloc_fail stats which prevented
the pp_alloc_fail test from passing.
Patches 4, 5, 7, 8, 9 add or fix related test cases.
v2:
- [patch 2] remove now unnecessary byte adjustment
- [patch 8] use seen_fails more
v1: https://lore.kernel.org/20251003233025.1157158-1-kuba@kernel.org
Testing on fbnic below:
$ ./tools/testing/selftests/drivers/net/hw/pp_alloc_fail.py
TAP version 13
1..1
fbnic-err: bad MMIO read address 0x80074
fbnic-err: bad MMIO read address 0x80074
# Seen: pkts:20605 fails:40 (pass thrs:12)
# ethtool -G change retval: success
ok 1 pp_alloc_fail.test_pp_alloc
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
$ ./tools/testing/selftests/drivers/net/xdp.py
TAP version 13
1..13
ok 1 xdp.test_xdp_native_pass_sb
ok 2 xdp.test_xdp_native_pass_mb
ok 3 xdp.test_xdp_native_drop_sb
ok 4 xdp.test_xdp_native_drop_mb
ok 5 xdp.test_xdp_native_tx_sb
ok 6 xdp.test_xdp_native_tx_mb
# Failed run: pkt_sz 2048, offset 1. Last successful run: pkt_sz 1024, offset 256. Reason: Adjustment failed
ok 7 xdp.test_xdp_native_adjst_tail_grow_data
ok 8 xdp.test_xdp_native_adjst_tail_shrnk_data
# Failed run: pkt_sz 512, offset -256. Last successful run: pkt_sz 512, offset -128. Reason: Adjustment failed
ok 9 xdp.test_xdp_native_adjst_head_grow_data
# Failed run: pkt_sz (2048) > HDS threshold (1536) and offset 64 > 48
ok 10 xdp.test_xdp_native_adjst_head_shrnk_data
ok 11 xdp.test_xdp_native_qstats_pass
ok 12 xdp.test_xdp_native_qstats_drop
ok 13 xdp.test_xdp_native_qstats_tx
# Totals: pass:13 fail:0 xfail:0 xpass:0 skip:0 error:0
====================
Link: https://patch.msgid.link/20251007232653.2099376-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add kernel config for error injection as needed by pp_alloc_fail.py
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 9da271f825 ("selftests: drv-net-hw: add test for memory allocation failures with page pool")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-10-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lower the expected level of traffic in the pp_alloc_fail test
and calculate failure counter thresholds based on the traffic
rather than using a fixed constant.
We only have "QEMU HW" in NIPA right now, and the test (due to
debug dependencies) only works on debug kernels in the first place.
We need some place for it to pass otherwise it seems to be bit
rotting. So lower the traffic threshold so that it passes on QEMU
and with a debug kernel...
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-9-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fix linter warnings, it's a bit hard to check for new ones otherwise.
W0311: Bad indentation. Found 16 spaces, expected 12 (bad-indentation)
C0114: Missing module docstring (missing-module-docstring)
W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
C0116: Missing function or method docstring (missing-function-docstring)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-8-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rx processing under normal circumstances has 3 rings - 2 buffer
rings (heads, payloads) and a completion ring. All the rings
have a struct fbnic_ring. Make sure we expose alloc_failed
counter from the buffer rings, previously only the alloc_failed
from the completion ring was reported, even tho all ring types
may increment this counter (buffer rings in __fbnic_fill_bdq()).
This makes the pp_alloc_fail.py test pass, it expects the qstat
to be incrementing as page pool injections happen.
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 67dc4eb5fc ("eth: fbnic: report software Rx queue stats")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-7-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Send a non-trivial number of packets and make sure that they
are counted correctly in qstats. Per qstats specification
XDP is the first layer of the stack so we should see Rx and Tx
counters go up for packets which went thru XDP.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-6-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Test uses "netnl" for the ethtool family which is quite confusing
(one would expect netdev family would use this name).
No functional changes.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-5-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When rings are freed - stats get added to the device level stat
structs. Save the stats from the XDP_TX ring just as Tx stats.
Previously they would be saved to Rx and Tx stats. So we'd not
see XDP_TX packets as Rx during runtime but after an down/up cycle
the packets would appear in stats.
Correct the helper used by ethtool code which does a runtime
config switch.
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 5213ff0863 ("eth: fbnic: Collect packet statistics for XDP")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-4-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Make XDP-handled packets appear in the Rx stats. The driver has been
counting XDP_TX packets on the Tx ring, but there wasn't much accounting
on the Rx side (the Rx bytes appear to be incremented on XDP_TX but
XDP_DROP / XDP_ABORT are only counted as Rx drops).
Counting XDP_TX packets (not just bytes) in Rx stats looks like
a simple bug of omission.
The XDP_DROP handling appears to be intentional. Whether XDP_DROP
packets should be counted in interface-level Rx stats is a bit
unclear historically. When we were defining qstats, however,
we clarified based on operational experience that in this context:
name: rx-packets
doc: |
Number of wire packets successfully received and passed to the stack.
For drivers supporting XDP, XDP is considered the first layer
of the stack, so packets consumed by XDP are still counted here.
fbnic does not obey this requirement. Since XDP support has been added
in current release cycle, instead of splitting interface and qstat
handling - make them both follow the qstat definition.
Another small tweak here is that we count bytes as received on the wire
rather than post-XDP bytes (xdp_get_buff_len() vs skb->len).
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 5213ff0863 ("eth: fbnic: Collect packet statistics for XDP")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
XDP_TX typically uses no offloads. To optimize XDP we added a "default
descriptor" feature to the chip, which allows us to send XDP frames with
just the buffer descriptors (DMA address + length). All the metadata
descriptors are derived from the queue config.
Commit under Fixes missed adding setting the defaults up when transplanting
the code from the prototype driver. Importantly after reset the "request
completion" bit is not set. Packets still get sent but there's no
completion, so ring is not cleaned up. We can send one ring's worth
of packets and then will start dropping all frames that got the XDP_TX
action from the XDP prog.
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 168deb7b31 ("eth: fbnic: Add support for XDP_TX action")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251007232653.2099376-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmjmXroNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AEPmD/wJVaVNhEgmqY0abipD7Sx59ygRIVn6JSLGtimS
UO8G9S3cr3nT4i2wr1MYNiUQyjs18u4Q73kr8tKalXV99E05OVW1tWAeAztQDeZM
UNDf44Of1XwwlthIuQa7vyt4aqxmGhQDXKcj1cx2ZHjYQbJ3GshoMAZq90iKKqXN
Qlbpy216P71KzaN214bKDSgx8ffoWDRxcQwnbY4EWzMErKJqr5I6zLoJo9hMms0X
JpEqkUaY4PdackQJmBEoOaMphiG9H0u16PtEycfu++YrM/Xf5CJpiCFkp9mjZs+E
iNpLNF70phCA2ih3mwZFzfgPjYCXNSElOt0lGJFZVTVlGb8HLfBRNxKRMKsaCeZc
THZSPo2KMPI+UZle5Cyj9WNLWoVeTNtc2BlmMbSiTRRV5BWp1z20zgfymS++t0F7
hul2wWT7EH0cD87lJ/YUaGcM60nTZuNz7dKchTICwVyWpvCh0W/IhHM1JavfdPg2
gGCyPPsy1NKwXLCBym/DpKfIIwjR3bqH4s9miZRHt1a2PVX+JWxK72D0RasWPISz
cNs6bzQhfi6Q2l6gFJ8iGWWVr+L7J5+JmLdp8G4yQaJ+Fa7LggCIVMGOpl4QD7oA
lILvaXdfi402Oqkcho02gbtC0iDgVA8Grs28r/6KBsVFseejClvh7LQe9YLoFNZT
Fp7oAQ==
=u19C
-----END PGP SIGNATURE-----
Merge tag 'nf-25-10-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
The following patchset contains Netfilter fixes for *net*:
1) Fix crash (call recursion) when nftables synproxy extension is used
in an object map. When this feature was added in v5.4 the required
hook call validation was forgotten.
Fix from Fernando Fernandez Mancera.
2) bridge br_vlan_fill_forward_path_pvid uses incorrect
rcu_dereference_protected(); we only have rcu read lock but not
RTNL. Fix from Eric Woudstra.
Last two patches address flakes in two existing selftests.
netfilter pull request nf-25-10-08
* tag 'nf-25-10-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: query conntrack state to check for port clash resolution
selftests: netfilter: nft_fib.sh: fix spurious test failures
bridge: br_vlan_fill_forward_path_pvid: use br_vlan_group_rcu()
netfilter: nft_objref: validate objref and objrefmap expressions
====================
Link: https://patch.msgid.link/20251008125942.25056-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move the ssize check to the start in essiv_aead_crypt so that
it's also checked for decryption and in-place encryption.
Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg>
Fixes: be1eb7f78a ("crypto: essiv - create wrapper template for ESSIV generation")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- Extended 'perf annotate' with DWARF type information (--code-with-type)
integration in the TUI, including a 'T' hotkey to toggle it.
- Enhanced 'perf bench mem' with new mmap() workloads and control over
page/chunk sizes.
- Fix 'perf stat' error handling to correctly display unsupported events.
- Improved support for Clang cross-compilation.
- Refactored LLVM and Capstone disasm for modularity.
- Introduced the :X modifier to exclude an event from automatic regrouping.
- Adjusted KVM sampling defaults to use the "cycles" event to prevent failures.
- Added comprehensive support for decoding PowerPC Dispatch Trace Log (DTL).
- Updated Arm SPE tracing logic for better analysis of memory and snoop
details.
- Synchronized Intel PMU events and metrics with TMA 5.1 across multiple
processor generations.
- Converted dependencies like libperl and libtracefs to be opt-in.
- Handle more Rust symbols in kallsyms ('N', debugging).
- Improve the python binding to allow for python based tools to use more
of the libraries, add a 'ilist' utility to test those new bindings.
- Various 'perf test' fixes.
- Kan Liang no longer a perf tools reviewer.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCaObIdAAKCRCyPKLppCJ+
JyM+AQCWCqdMdiOrJfsqwBAthJmLA2j+haprucR9b2XAi0CLTAD8DGaax3XQbIxM
3D6PUd6/qschIy0f77eYqCYjVQXJkQM=
=ibgu
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v6.18-1-2025-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Arnaldo Carvalho de Melo:
- Extended 'perf annotate' with DWARF type information
(--code-with-type) integration in the TUI, including a 'T'
hotkey to toggle it
- Enhanced 'perf bench mem' with new mmap() workloads and control
over page/chunk sizes
- Fix 'perf stat' error handling to correctly display unsupported
events
- Improved support for Clang cross-compilation
- Refactored LLVM and Capstone disasm for modularity
- Introduced the :X modifier to exclude an event from automatic
regrouping
- Adjusted KVM sampling defaults to use the "cycles" event to prevent
failures
- Added comprehensive support for decoding PowerPC Dispatch Trace Log
(DTL)
- Updated Arm SPE tracing logic for better analysis of memory and snoop
details
- Synchronized Intel PMU events and metrics with TMA 5.1 across
multiple processor generations
- Converted dependencies like libperl and libtracefs to be opt-in
- Handle more Rust symbols in kallsyms ('N', debugging)
- Improve the python binding to allow for python based tools to use
more of the libraries, add a 'ilist' utility to test those new
bindings
- Various 'perf test' fixes
- Kan Liang no longer a perf tools reviewer
* tag 'perf-tools-for-v6.18-1-2025-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (192 commits)
perf tools: Fix arm64 libjvmti build by generating unistd_64.h
perf tests: Don't retest sections in "Object code reading"
perf docs: Document building with Clang
perf build: Support build with clang
perf test coresight: Dismiss clang warning for unroll loop thread
perf test coresight: Dismiss clang warning for thread loop
perf test coresight: Dismiss clang warning for memcpy thread
perf build: Disable thread safety analysis for perl header
perf build: Correct CROSS_ARCH for clang
perf python: split Clang options when invoking Popen
tools build: Align warning options with perf
perf disasm: Remove unused evsel from 'struct annotate_args'
perf srcline: Fallback between addr2line implementations
perf disasm: Make ins__scnprintf() and ins__is_nop() static
perf dso: Clean up read_symbol() error handling
perf dso: Support BPF programs in dso__read_symbol()
perf dso: Move read_symbol() from llvm/capstone to dso
perf llvm: Reduce LLVM initialization
perf check: Add libLLVM feature
perf parse-events: Fix parsing of >30kb event strings
...
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.
Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/
The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:
tracing_mark_write: <faulted>
After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.
Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.
A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:
preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.
By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This patch adds information about Ceph bug tracking system.
[ idryomov: add the same for RBD, don't mention include/linux/ceph/
again ]
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The refactoring in 4292a1e45f ("PCI: Refactor distributing available
memory to use loops") switched pci_bus_distribute_available_resources() to
operate on an array of bridge windows. That accidentally looked up bus
resources via pci_bus_resource_n() and then passed those pointers to helper
routines that expect the resource to belong to the device. As soon as we
execute that code, pci_resource_num() warned because the resource wasn't in
the bridge's resource array.
This happens on my AMD Strix Halo machine with Thunderbolt device; the
error message is shown below:
WARNING: CPU: 6 PID: 272 at drivers/pci/pci.h:471 pci_bus_distribute_available_resources+0x6ad/0x6d0
CPU: 6 UID: 0 PID: 272 Comm: irq/33-pciehp Not tainted 6.17.0+ #1 PREEMPT(voluntary)
Hardware name: PELADN YO Series/YO1, BIOS 1.04 05/15/2025
RIP: 0010:pci_bus_distribute_available_resources+0x6ad/0x6d0
Call Trace:
pci_bus_distribute_available_resources+0x590/0x6d0
pci_bridge_distribute_available_resources+0x62/0xb0
pci_assign_unassigned_bridge_resources+0x65/0x1b0
pciehp_configure_device+0x92/0x160
pciehp_handle_presence_or_link_change+0x1b5/0x350
pciehp_ist+0x147/0x1c0
Fix the regression by always fetching the resource directly from the bridge
with pci_resource_n(bridge, PCI_BRIDGE_RESOURCES + i). This restores the
original behaviour while keeping the refactored structure. Then we can
successfully assign resources to the Thunderbolt device.
Fixes: 4292a1e45f ("PCI: Refactor distributing available memory to use loops")
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Closes: https://lore.kernel.org/r/dd551b81-9e81-480b-aab3-7cf8b8bbc1d0@panix.com
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
[bhelgaas: trim timestamps, etc from commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-By: Kenneth R. Crudup <kenny@panix.com>
Link: https://lore.kernel.org/r/F833CC81-7C60-48FC-A31C-B9999DCC6FA2@icloud.com
Link: https://patch.msgid.link/tencent_8C54420E1B0FF8D804C1B4651DF970716309@qq.com
The mds auth caps check should also validate the
fsname along with the associated caps. Not doing
so would result in applying the mds auth caps of
one fs on to the other fs in a multifs ceph cluster.
The bug causes multiple issues w.r.t user
authentication, following is one such example.
Steps to Reproduce (on vstart cluster):
1. Create two file systems in a cluster, say 'fsname1' and 'fsname2'
2. Authorize read only permission to the user 'client.usr' on fs 'fsname1'
$ceph fs authorize fsname1 client.usr / r
3. Authorize read and write permission to the same user 'client.usr' on fs 'fsname2'
$ceph fs authorize fsname2 client.usr / rw
4. Update the keyring
$ceph auth get client.usr >> ./keyring
With above permssions for the user 'client.usr', following is the
expectation.
a. The 'client.usr' should be able to only read the contents
and not allowed to create or delete files on file system 'fsname1'.
b. The 'client.usr' should be able to read/write on file system 'fsname2'.
But, with this bug, the 'client.usr' is allowed to read/write on file
system 'fsname1'. See below.
5. Mount the file system 'fsname1' with the user 'client.usr'
$sudo bin/mount.ceph usr@.fsname1=/ /kmnt_fsname1_usr/
6. Try creating a file on file system 'fsname1' with user 'client.usr'. This
should fail but passes with this bug.
$touch /kmnt_fsname1_usr/file1
7. Mount the file system 'fsname1' with the user 'client.admin' and create a
file.
$sudo bin/mount.ceph admin@.fsname1=/ /kmnt_fsname1_admin
$echo "data" > /kmnt_fsname1_admin/admin_file1
8. Try removing an existing file on file system 'fsname1' with the user
'client.usr'. This shoudn't succeed but succeeds with the bug.
$rm -f /kmnt_fsname1_usr/admin_file1
For more information, please take a look at the corresponding mds/fuse patch
and tests added by looking into the tracker mentioned below.
v2: Fix a possible null dereference in doutc
v3: Don't store fsname from mdsmap, validate against
ceph_mount_options's fsname and use it
v4: Code refactor, better warning message and
fix possible compiler warning
[ Slava.Dubeyko: "fsname check failed" -> "fsname mismatch" ]
Link: https://tracker.ceph.com/issues/72167
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has reported potential issue
in ceph_alloc_readdir_reply_buffer() [1]. If order could
be negative one, then it expects the issue in the logic:
num_entries = (PAGE_SIZE << order) / size;
Technically speaking, this logic [2] should prevent from
making the order variable negative:
if (!rinfo->dir_entries)
return -ENOMEM;
However, the allocation logic requires some cleanup.
This patch makes sure that calculated bytes count
will never exceed ULONG_MAX before get_order()
calculation. And it adds the checking of order
variable on negative value to guarantee that second
half of the function's code will never operate by
negative value of order variable even if something
will be wrong or to be changed in the first half of
the function's logic.
v2
Alex Markuze suggested to add unlikely() macro
for introduced condition checks.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1198252
[2] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/mds_client.c#L2553
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected a potential dereference of
an explicit NULL value in ceph_fill_trace() [1].
The variable in is declared in the beggining of
ceph_fill_trace() [2]:
struct inode *in = NULL;
However, the initialization of the variable is happening under
condition [3]:
if (rinfo->head->is_target) {
<skipped>
in = req->r_target_inode;
<skipped>
}
Potentially, if rinfo->head->is_target == FALSE, then
in variable continues to be NULL and later the dereference of
NULL value could happen in ceph_fill_trace() logic [4,5]:
else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP ||
req->r_op == CEPH_MDS_OP_MKSNAP) &&
test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags) &&
!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags)) {
<skipped>
ihold(in);
err = splice_dentry(&req->r_dentry, in);
if (err < 0)
goto done;
}
This patch adds the checking of in variable for NULL value
and it returns -EINVAL error code if it has NULL value.
v2
Alex Markuze suggested to add unlikely macro
in the checking condition.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1141197
[2] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1522
[3] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1629
[4] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1745
[5] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1777
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This moves the list_empty() checks from the two callers (v1 and v2)
into the base messenger.c library. Now the v1/v2 specializations do
not need to know about con->out_queue; that implementation detail is
now hidden behind the ceph_con_get_out_msg() function.
[ idryomov: instead of changing prepare_write_message() to return
a bool, move ceph_con_get_out_msg() call out to arrive to the same
pattern as in messenger_v2.c ]
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This pointer is in a register anyway, so let's use that instead of
reloading from memory everywhere.
[ idryomov: formatting ]
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The caller in messenger_v1.c loads it anyway, so let's keep the
pointer in the register instead of reloading it from memory. This
eliminates a tiny bit of unnecessary overhead.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected potential
race conditions in ceph_block_o_direct(), ceph_start_io_read(),
ceph_block_buffered(), and ceph_start_io_direct() [1 - 4].
The CID 1590942, 1590665, 1589664, 1590377 contain explanation:
"The value of the shared data will be determined by
the interleaving of thread execution. Thread shared data is accessed
without holding an appropriate lock, possibly causing
a race condition (CWE-366)".
This patch reworks the pattern of accessing/modification of
CEPH_I_ODIRECT flag by means of adding smp_mb__before_atomic()
before reading the status of CEPH_I_ODIRECT flag and
smp_mb__after_atomic() after clearing set/clear this flag.
Also, it was reworked the pattern of using of ci->i_ceph_lock
in ceph_block_o_direct(), ceph_start_io_read(),
ceph_block_buffered(), and ceph_start_io_direct() methods.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590942
[2] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590665
[3] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1589664
[4] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590377
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The wake_up_bit() is called in ceph_async_unlink_cb(),
wake_async_create_waiters(), and ceph_finish_async_create().
It makes sense to switch on clear_bit() function, because
it makes the code much cleaner and easier to understand.
More important rework is the adding of smp_mb__after_atomic()
memory barrier after the bit modification and before
wake_up_bit() call. It can prevent potential race condition
of accessing the modified bit in other threads. Luckily,
clear_and_wake_up_bit() already implements the required
functionality pattern:
static inline void clear_and_wake_up_bit(int bit, unsigned long *word)
{
clear_bit_unlock(bit, word);
/* See wake_up_bit() for which memory barrier you need to use. */
smp_mb__after_atomic();
wake_up_bit(word, bit);
}
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected potential
race condition in ceph_ioctl_lazyio() [1].
The CID 1591046 contains explanation: "Check of thread-shared
field evades lock acquisition (LOCK_EVASION). Thread1 sets
fmode to a new value. Now the two threads have an inconsistent
view of fmode and updates to fields correlated with fmode
may be lost. The data guarded by this critical section may
be read while in an inconsistent state or modified by multiple
racing threads. In ceph_ioctl_lazyio: Checking the value of
a thread-shared field outside of a locked region to determine
if a locked operation involving that thread shared field
has completed. (CWE-543)".
The patch places fi->fmode field access under ci->i_ceph_lock
protection. Also, it introduces the is_file_already_lazy
variable that is set under the lock and it is checked later
out of scope of critical section.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1591046
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected overflowed constant
issue in ceph_do_objects_copy() [1]. The CID 1624308
defect contains explanation: "The overflowed value due to
arithmetic on constants is too small or unexpectedly
negative, causing incorrect computations. Expression bytes,
which is equal to -95, where ret is known to be equal to -95,
underflows the type that receives it, an unsigned integer
64 bits wide. In ceph_do_objects_copy: Integer overflow occurs
in arithmetic on constant operands (CWE-190)".
The patch changes the type of bytes variable from size_t
to ssize_t with the goal of to be capable to receive
negative values.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1624308
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected the wrong sizeof
argument in register_session() [1]. The CID 1598909 defect
contains explanation: "The wrong sizeof value is used in
an expression or as argument to a function. The result is
an incorrect value that may cause unexpected program behaviors.
In register_session: The sizeof operator is invoked on
the wrong argument (CWE-569)".
The patch introduces a ptr_size variable that is initialized
by sizeof(struct ceph_mds_session *). And this variable is used
instead of sizeof(void *) in the code.
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1598909
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The Coverity Scan service has detected the calling of
wait_for_completion_killable() without checking the return
value in ceph_lock_wait_for_completion() [1]. The CID 1636232
defect contains explanation: "If the function returns an error
value, the error value may be mistaken for a normal value.
In ceph_lock_wait_for_completion(): Value returned from
a function is not checked for errors before being used. (CWE-252)".
The patch adds the checking of wait_for_completion_killable()
return value and return the error code from
ceph_lock_wait_for_completion().
[1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1636232
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This allows killing processes that wait for a lock when one process is
stuck waiting for the Ceph server. This is similar to the NFS commit
38a125b315 ("fs/nfs/io: make nfs_start_io_*() killable").
[ idryomov: drop comment on include, formatting ]
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>