Commit Graph

2912 Commits

Author SHA1 Message Date
Petr Tesarik
ff56a3e2a8 timers/migration: Clean up the loop in tmigr_quick_check()
Make the logic easier to follow:

  - Remove the final return statement, which is never reached, and move the
    actual walk-terminating return statement out of the do-while loop.

  - Remove the else-clause to reduce indentation. If a non-lonely group is
    encountered during the walk, the loop is immediately terminated with a
    return statement anyway; no need for an else.

Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250606124818.455560-1-ptesarik@suse.com
2025-06-12 21:03:45 +02:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
b1456f6dc1 Updates for the time/timer core code:
- Rework the initialization of the posix-timer kmem_cache and move the
     cache pointer into the timer_data structure to prevent false sharing.
 
   - Switch the alarmtimer code to lock guards.
 
   - Improve the CPU selection criteria in the per CPU validation of the
     clocksource watchdog to avoid arbitrary selections (or omissions) on
     systems with a small number of CPUs.
 
   - The usual cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgwkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofL9D/9aiPT3UNkEVJzzQMIjeghi2foKcyqW
 ut+0XH0+8bsFrzNKPs/PL9chIcblamm63FjwKmVxKaTFakP9omGUEMKAXIcy7L10
 UWoQ7kSLvN/+3RB4JwOavtkNtdxkcjhDso+pd1VP0t7BQ5EsFRg4zkGHx1+PO/8C
 H1URzpfmYLZWBPvIHfvgwFy5PAwwppehDynbxrR8uatg8kLvXUUGQRu/yrOYrqx8
 7a/4jFkh75QdsezYOrS6yMjCS0qEeg6l37AW1WLQplZqHxJ4Mmwx9aL890KTQkXO
 MZhtcZ1Iqa/7KdDNw1yzaW9T9t5RzND5IwEbBrLVBoeQft+P/Y3Grax5pHh+Gt8u
 Sj4+4OiyhxQbhOcKGKjTr5pnHc//+jlm1QyLd3Ri6GL+mZB0JTnfmsFDRkhcWORN
 05NcPxfganbJdENStYXZYuEIMXKnp1si8JaEbyI0AfgN8hpWITbVLnMZ65ngdOZ9
 ym2HY2+3V5uCPPxegV2HdYX8VfaIRekiJAJ+Ttt3shG3+QflUNVSL4C6tv+lNNqa
 PPV543ojVmlYWximdhc9KhT12yevhUWiuoFK50lndgbL9vDs471Ua/KXqrf/um0h
 j8n49t8Ioq1Ht7g+olN0P3L+w0qSPM2oMWFnHh6bKTQniBrnkRDOtxQC6FC1pwEI
 JVnpRMAAuSmjVQ==
 =0+bf
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:
 "Updates for the time/timer core code:

   - Rework the initialization of the posix-timer kmem_cache and move
     the cache pointer into the timer_data structure to prevent false
     sharing

   - Switch the alarmtimer code to lock guards

   - Improve the CPU selection criteria in the per CPU validation of the
     clocksource watchdog to avoid arbitrary selections (or omissions)
     on systems with a small number of CPUs

   - The usual cleanups and improvements"

* tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/nohz: Remove unused tick_nohz_full_add_cpus_to()
  clocksource: Fix the CPUs' choice in the watchdog per CPU verification
  alarmtimer: Switch spin_{lock,unlock}_irqsave() to guards
  alarmtimer: Remove dead return value in clock2alarm()
  time/jiffies: Change register_refined_jiffies() to void __init
  timers: Remove unused __round_jiffies(_up)
  posix-timers: Initialize cache early and move pointer into __timer_data
2025-05-27 09:04:15 -07:00
Linus Torvalds
5e8bbb2caa Another set of timer API cleanups:
- Convert init_timer*(), try_to_del_timer_sync() and
    destroy_timer_on_stack() over to the canonical timer_*() namespace
    convention.
 
 There are is another large converstion pending, which has not been included
 because it would have caused a gazillion of merge conflicts in next. The
 conversion scripts will be run towards the end of the merge window and a
 pull request sent once all conflict dependencies have been merged.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgTkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodwVD/97rF1Juqm1JZNIZPN/vMqwCxRoUkc6
 tsK0+UC7UXusuJadxJ+Bsv25iPF+qejnThMU+SQ5yTVj/PNfxOe0WPdCEGGiL8Ye
 2JCk6GqSOB/360SlLmtR1B1xHDwsuuUcQTz0w57CH66HRV5vpoWSMSwj/ypy+8nU
 PlgjItaxdCKa9NJ+SUJZPWIxRkt/PsA1kwlV1OcxkgB++IiIHQEbPxECq9mlzWXF
 b4Sq/Sdf2OmEePN+DYoey4fneRwJnkjkeX/o+CqosCPHRIiWUlSu5W/lU5IYojM3
 s3XpMNNg/z8PMXR4JA2VaPYWLUZyBOs+3dM7Y6Am+z55EoxMxfzg6pGx2tfM4ftl
 vF8wG3Z1c9MmpLk+P9LatNvfHeVLNve8KgOLa5phMDQ/El/a8KqLu6HmRDPONvKp
 d6iXdPq1CP8P6jOtlFfzLmKPShgEcp+Zz9W3CaQR/0ZJEsEqrpKOLzdT86hJhBV0
 mBCdzixmGtKAh0BdPdmg2FCLScqER3HKIJhZSdV8I+jSETIHCuMiIfbMXR7iwm/H
 R1/ayvxrbc1mPseo28scqvo7m6cn5BFBxIUf4Sokp52ZCapz1v2aWzo4vHI0cTgT
 ZOjlTrf+fgYLn1dqdD45TJiQPnmRrw4dU+WWSFRFJY2qjfyucj80vdqdkE5zkp5b
 UPomlVimG4ccPg==
 =FHGU
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "Another set of timer API cleanups:

    - Convert init_timer*(), try_to_del_timer_sync() and
      destroy_timer_on_stack() over to the canonical timer_*()
      namespace convention.

  There is another large conversion pending, which has not been included
  because it would have caused a gazillion of merge conflicts in next.
  The conversion scripts will be run towards the end of the merge window
  and a pull request sent once all conflict dependencies have been
  merged"

* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
  treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
  timers: Rename init_timers() as timers_init()
  timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
  timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
  timers: Rename __init_timer() as __timer_init()
  timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
  timers: Rename init_timer_key() as timer_init_key()
2025-05-27 08:31:21 -07:00
Guilherme G. Piccoli
08d7becc1a clocksource: Fix the CPUs' choice in the watchdog per CPU verification
Right now, if the clocksource watchdog detects a clocksource skew, it might
perform a per CPU check, for example in the TSC case on x86.  In other
words: supposing TSC is detected as unstable by the clocksource watchdog
running at CPU1, as part of marking TSC unstable the kernel will also run a
check of TSC readings on some CPUs to be sure it is synced between them
all.

But that check happens only on some CPUs, not all of them; this choice is
based on the parameter "verify_n_cpus" and in some random cpumask
calculation. So, the watchdog runs such per CPU checks on up to
"verify_n_cpus" random CPUs among all online CPUs, with the risk of
repeating CPUs (that aren't double checked) in the cpumask random
calculation.

But if "verify_n_cpus" > num_online_cpus(), it should skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation.

Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per
CPU verification skipping some CPU even with verify_n_cpus=8, due to the
duplicity on random cpumask generation. Skipping the randomization when the
number of online CPUs is smaller than verify_n_cpus, solves that.

Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
2025-05-13 15:38:55 +02:00
Ingo Molnar
aad823aa3a treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Ingo Molnar
367ed4e357 treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-9-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Ingo Molnar
751e6a394c timers: Rename init_timers() as timers_init()
Move this API to the canonical timers_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-8-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Ingo Molnar
220beffd36 timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
Move this macro to the canonical TIMER_* namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-7-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Ingo Molnar
7879d10de3 timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-4-mingo@kernel.org
2025-05-08 19:49:32 +02:00
Ingo Molnar
e86e43907f timers: Rename init_timer_key() as timer_init_key()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-3-mingo@kernel.org
2025-05-08 19:49:32 +02:00
Su Hui
2117c1d503 alarmtimer: Switch spin_{lock,unlock}_irqsave() to guards
Using guard/scoped_guard() to simplify code. Using guard() to remove
'goto unlock' label is neater especially.

[ tglx: Brought back the scoped_guard()'s which were dropped in v2 and
  	simplified alarmtimer_rtc_add_device() ]

Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250430032734.2079290-4-suhui@nfschina.com
2025-04-30 09:06:23 +02:00
Su Hui
d8ca84d48a alarmtimer: Remove dead return value in clock2alarm()
'clockid' can only be ALARM_REALTIME and ALARM_BOOTTIME. It's impossible to
return -1 and callers never check the return value.

Only alarm_clock_get_timespec(), alarm_clock_get_ktime(),
alarm_timer_create() and alarm_timer_nsleep() call clock2alarm(). These
callers use clockid_to_kclock() to get 'struct k_clock', which ensures
that clock2alarm() never returns -1.

Remove the impossible -1 return value, and add a warning to notify about any
future misuse of this function.

Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430032734.2079290-3-suhui@nfschina.com
2025-04-30 09:06:23 +02:00
Su Hui
007c07168a time/jiffies: Change register_refined_jiffies() to void __init
register_refined_jiffies() is only used in setup code and always returns 0.
Mark it as __init to save some bytes and change it to void.

Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430032734.2079290-2-suhui@nfschina.com
2025-04-30 09:06:23 +02:00
Thomas Gleixner
b71f9804f6 timekeeping: Prevent coarse clocks going backwards
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment:

    tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

     multiplicator * interval_cycles

into xtime_nsec.  The accumulated value is always larger then the

     mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.

This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:

  757b000f7b ("timekeeping: Fix possible inconsistencies in _COARSE clockids")

That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:

   1) The forwarding of the base time moves the update out of the original
      period and establishes a new one.

   2) The clearing of the accumulated NTP error is changing the behaviour as
      well.

User-space expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.

Commit 757b000f7b was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.

One of the initial approaches to fix this was to establish a separate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.

Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in
it, switch the time getter functions and the VDSO update to use that value.
coarse_nsec is set on operations which forward or initialize the timekeeper
and after time was accumulated during a tick. If there is no accumulation
the timestamp is unchanged.

This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.

[ jstultz: Simplified the coarse_nsec calculation and kept behavior so
  	   coarse clockids aren't adjusted on each inter-tick adjtimex
  	   call, slightly reworked the comments and commit message ]

Fixes: da15cfdae0 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-04-28 11:17:29 +02:00
Dr. David Alan Gilbert
49916e22d9 timers: Remove unused __round_jiffies(_up)
Remove two trivial but long unused functions.

__round_jiffies() has been unused since 2008's
commit 9c133c469d ("Add round_jiffies_up and related routines")

__round_jiffies_up() has been unused since 2019's
commit 7ae3f6e130 ("powerpc/watchdog: Use hrtimers for per-CPU
heartbeat")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250418200803.427911-1-linux@treblig.org
2025-04-24 14:31:35 +02:00
Sebastian Andrzej Siewior
92e250c624 timekeeping: Add a lockdep override in tick_freeze()
tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the
callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver
acquires a spinlock which becomes a sleeping lock on PREEMPT_RT.  Lockdep
complains about this lock nesting.

Add a lockdep override for this special case and a comment explaining
why it is okay.

Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de
Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/
Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/
2025-04-09 22:30:39 +02:00
Eric Dumazet
0df6db767a posix-timers: Initialize cache early and move pointer into __timer_data
Move posix_timers_cache initialization to posixtimer_init(). At that point
the memory subsystem is already up and running.

Also move the cache pointer to the __timer_data variable to avoid
potential false sharing, since it never was marked as __ro_after_init.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250402133114.253901-1-edumazet@google.com
2025-04-09 21:21:36 +02:00
Nam Cao
2424e146be hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::function
The "function" field of struct hrtimer has been changed to private, but
two instances have not been converted to use ACCESS_PRIVATE().

Convert them to use ACCESS_PRIVATE().

Fixes: 04257da0c9 ("hrtimers: Make callback function pointer private")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de
Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/
2025-04-09 21:00:42 +02:00
Linus Torvalds
16cd1c2657 A set of final cleanups for the timer subsystem:
1) Convert all del_timer[_sync]() instances over to the new
      timer_delete[_sync]() API and remove the legacy wrappers.
 
      Conversion was done with coccinelle plus some manual fixups as
      coccinelle chokes on scoped_guard().
 
   2) The final cleanup of the hrtimer_init() to hrtimer_setup() conversion.
 
      This has been delayed to the end of the merge window, so that all
      patches which have been merged through other trees are in mainline and
      all new users are catched.
 
 Doing this right before rc1 ensures that new code which is merged post rc1
 is not introducing new instances of the original functionality.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfyXi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYzlD/4ykDZbUzgTreYOxEQpBJ9elPwBhxfL
 1v8OwDjRWlNrmLup8RiUfKrlbmztGl1J/u9ld0qhjcqkywCCBC1N5S+DhCjYetyP
 MPWLbi2Dc35cFA+M7i8fMgxI2K9MLz2Zj1UKxz1MdsSuNHm07N3mul/3T11Ye4Rz
 nPlzeQBTBDFCKTEGKjr8zjuoD15Wl48sObM0AjV35BPuQR1jfY4CE6VXo2h78+0c
 jYwpJpDmcd+o1bDrfFhWUME2DzABEkHhn4wNSETnM4E5RXZRMUbi4UiigzInibQr
 JOUTKwPJXTMX/Erd0XyXErrYf2qy1X9BQy6NlyDDOv+8kLEVRsC9Efplx9uoEtfi
 QvVT/UmgmhZFJBfIT3/B8OvasrfwOropaYoG4L0zbDpp1b09VY47N5lCLlNr/mZf
 jb2TwIln8Szy2EfIT2RSd0ZNupyU8V4aH/mYNpSlbUJ6mfvfIAttBSS/YH+Zeqku
 7zOJkoCusaySOCZCOQkeikL3ZBN+FHtNteXxmGnp34ed/tsfgGZj1lsbmkM2rrWo
 f2mQsYAclUA4KQeY9z/Xf7/c5wJUkME69PxOaaN23dOpBR7GA58Cvb0PQTnPlAiT
 KnH/JRweBHtcv4KEHMi2f5no4cxcmXyKTj7/TLyYNjc8LATL9Eo/nxG36PLxy4lN
 QPOWz11zEBLjQQ==
 =8Ftq
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A set of final cleanups for the timer subsystem:

   - Convert all del_timer[_sync]() instances over to the new
     timer_delete[_sync]() API and remove the legacy wrappers.

     Conversion was done with coccinelle plus some manual fixups as
     coccinelle chokes on scoped_guard().

   - The final cleanup of the hrtimer_init() to hrtimer_setup()
     conversion.

     This has been delayed to the end of the merge window, so that all
     patches which have been merged through other trees are in mainline
     and all new users are catched.

  Doing this right before rc1 ensures that new code which is merged post
  rc1 is not introducing new instances of the original functionality"

* tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing/timers: Rename the hrtimer_init event to hrtimer_setup
  hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
  hrtimers: Rename debug_init() to debug_setup()
  hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
  hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
  hrtimers: Make callback function pointer private
  hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
  hrtimers: Switch to use __htimer_setup()
  hrtimers: Delete hrtimer_init()
  treewide: Convert new and leftover hrtimer_init() users
  treewide: Switch/rename to timer_delete[_sync]()
2025-04-06 08:35:37 -07:00
Nam Cao
244132c4e5 tracing/timers: Rename the hrtimer_init event to hrtimer_setup
The function hrtimer_init() doesn't exist anymore. It was replaced by
hrtimer_setup().

Thus, rename the hrtimer_init trace event to hrtimer_setup to keep it
consistent.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/cba84c3d853c5258aa3a262363a6eac08e2c7afc.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
59c9edafc0 hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init_on_stack() to debug_setup_on_stack() as well, to keep the
names consistent.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/073cf6162779a2f5b12624677d4c49ee7eccc1ed.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
e9ef2093ad hrtimers: Rename debug_init() to debug_setup()
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init() to debug_setup() as well, to keep the names consistent.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4b730c1f79648b16a1c5413f928fdc2e138dfc43.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
fcea1ccf24 hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() as well, to
keep the names consistent.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/807694aedad9353421c4a7347629a30c5c31026f.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
1cc24f2e76 hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
The struct hrtimer::function field can only be changed using
hrtimer_setup*() or hrtimer_update_function(), and both already null-check
'function'. Therefore, null-checking 'function' in hrtimer_start_range_ns()
is not necessary.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4661c571ee87980c340ccc318fc1a473c0c8f6bc.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
04257da0c9 hrtimers: Make callback function pointer private
Make the struct hrtimer::function field private, to prevent users from
changing this field in an unsafe way. hrtimer_update_function() should be
used if the callback function needs to be changed.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/7d0e6e0c5c59a64a9bea940051aac05d750bc0c2.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
87d82cff38 hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
__hrtimer_init() is only called by __hrtimer_setup(). Simplify by merging
__hrtimer_init() into __hrtimer_setup().

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/8a0a847a35f711f66b2d05b57255aa44e7e61279.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
50177a8b2e hrtimers: Switch to use __htimer_setup()
__hrtimer_init_sleeper() calls __hrtimer_init() and also sets up the
callback function. But there is already __hrtimer_setup() which does both
actions.

Switch to use __hrtimer_setup() to simplify the code.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/d9a45a51b6a8aa0045310d63f73753bf6b33f385.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Nam Cao
9779489a31 hrtimers: Delete hrtimer_init()
hrtimer_init() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/003722f60c7a2a4f8d4ed24fb741aa313b7e5136.1738746927.git.namcao@linutronix.de
2025-04-05 10:30:17 +02:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Thomas Gleixner
324a2219ba Revert "timekeeping: Fix possible inconsistencies in _COARSE clockids"
This reverts commit 757b000f7b.

Miroslav reported that the changes for handling the inconsistencies in the
coarse time getters result in a regression on the adjtimex() side.

There are two issues:

  1) The forwarding of the base time moves the update out of the original
     period and establishes a new one.

  2) The clearing of the accumulated NTP error is changing the behaviour as
     well.

Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.

Revert the change, so that the established expectations of user space
implementations (ntpd, chronyd) are restored. The re-introduced
inconsistency of the coarse time getters will be addressed in a subsequent
fix.

Fixes: 757b000f7b ("timekeeping: Fix possible inconsistencies in _COARSE clockids")
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z-qsg6iDGlcIJulJ@localhost
2025-04-04 19:10:00 +02:00
Linus Torvalds
1a9239bb42 Networking changes for 6.15.
Core & protocols
 ----------------
 
  - Continue Netlink conversions to per-namespace RTNL lock
    (IPv4 routing, routing rules, routing next hops, ARP ioctls).
 
  - Continue extending the use of netdev instance locks. As a driver
    opt-in protect queue operations and (in due course) ethtool
    operations with the instance lock and not RTNL lock.
 
  - Support collecting TCP timestamps (data submitted, sent, acked)
    in BPF, allowing for transparent (to the application) and lower
    overhead tracking of TCP RPC performance.
 
  - Tweak existing networking Rx zero-copy infra to support zero-copy
    Rx via io_uring.
 
  - Optimize MPTCP performance in single subflow mode by 29%.
 
  - Enable GRO on packets which went thru XDP CPU redirect (were queued
    for processing on a different CPU). Improving TCP stream performance
    up to 2x.
 
  - Improve performance of contended connect() by 200% by searching
    for an available 4-tuple under RCU rather than a spin lock.
    Bring an additional 229% improvement by tweaking hash distribution.
 
  - Avoid unconditionally touching sk_tsflags on RX, improving
    performance under UDP flood by as much as 10%.
 
  - Avoid skb_clone() dance in ping_rcv() to improve performance under
    ping flood.
 
  - Avoid FIB lookup in netfilter if socket is available, 20% perf win.
 
  - Rework network device creation (in-kernel) API to more clearly
    identify network namespaces and their roles.
    There are up to 4 namespace roles but we used to have just 2 netns
    pointer arguments, interpreted differently based on context.
 
  - Use sysfs_break_active_protection() instead of trylock to avoid
    deadlocks between unregistering objects and sysfs access.
 
  - Add a new sysctl and sockopt for capping max retransmit timeout
    in TCP.
 
  - Support masking port and DSCP in routing rule matches.
 
  - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
 
  - Support specifying at what time packet should be sent on AF_XDP
    sockets.
 
  - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
 
  - Add Netlink YAML spec for WiFi (nl80211) and conntrack.
 
  - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
    which only need to be exported when IPv6 support is built as a module.
 
  - Age FDB entries based on Rx not Tx traffic in VxLAN, similar
    to normal bridging.
 
  - Allow users to specify source port range for GENEVE tunnels.
 
  - netconsole: allow attaching kernel release, CPU ID and task name
    to messages as metadata
 
 Driver API
 ----------
 
  - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
    the SW layers. Delegate the responsibilities to phylink where possible.
    Improve its handling in phylib.
 
  - Support symmetric OR-XOR RSS hashing algorithm.
 
  - Support tracking and preserving IRQ affinity by NAPI itself.
 
  - Support loopback mode speed selection for interface selftests.
 
 Device drivers
 --------------
 
  - Remove the IBM LCS driver for s390.
 
  - Remove the sb1000 cable modem driver.
 
  - Add support for SFP module access over SMBus.
 
  - Add MCTP transport driver for MCTP-over-USB.
 
  - Enable XDP metadata support in multiple drivers.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - add PCIe TLP Processing Hints (TPH) support for new AMD platforms
      - support dumping RoCE queue state for debug
      - opt into instance locking
    - Intel (100G, ice, idpf):
      - ice: rework MSI-X IRQ management and distribution
      - ice: support for E830 devices
      - iavf: add support for Rx timestamping
      - iavf: opt into instance locking
    - nVidia/Mellanox:
      - mlx4: use page pool memory allocator for Rx
      - mlx5: support for one PTP device per hardware clock
      - mlx5: support for 200Gbps per-lane link modes
      - mlx5: move IPSec policy check after decryption
    - AMD/Solarflare:
      - support FW flashing via devlink
    - Cisco (enic):
      - use page pool memory allocator for Rx
      - enable 32, 64 byte CQEs
      - get max rx/tx ring size from the device
    - Meta (fbnic):
      - support flow steering and RSS configuration
      - report queue stats
      - support TCP segmentation
      - support IRQ coalescing
      - support ring size configuration
    - Marvell/Cavium:
      - support AF_XDP
    - Wangxun:
      - support for PTP clock and timestamping
    - Huawei (hibmcge):
      - checksum offload
      - add more statistics
 
  - Ethernet virtual:
    - VirtIO net:
      - aggressively suppress Tx completions, improve perf by 96% with
        1 CPU and 55% with 2 CPUs
      - expose NAPI to IRQ mapping and persist NAPI settings
    - Google (gve):
      - support XDP in DQO RDA Queue Format
      - opt into instance locking
    - Microsoft vNIC:
      - support BIG TCP
 
  - Ethernet NICs consumer, and embedded:
    - Synopsys (stmmac):
      - cleanup Tx and Tx clock setting and other link-focused cleanups
      - enable SGMII and 2500BASEX mode switching for Intel platforms
      - support Sophgo SG2044
    - Broadcom switches (b53):
      - support for BCM53101
    - TI:
      - iep: add perout configuration support
      - icssg: support XDP
    - Cadence (macb):
      - implement BQL
    - Xilinx (axinet):
      - support dynamic IRQ moderation and changing coalescing at runtime
      - implement BQL
      - report standard stats
    - MediaTek:
      - support phylink managed EEE
    - Intel:
      - igc: don't restart the interface on every XDP program change
    - RealTek (r8169):
      - support reading registers of internal PHYs directly
      - increase max jumbo packet size on RTL8125/RTL8126
    - Airoha:
      - support for RISC-V NPU packet processing unit
      - enable scatter-gather and support MTU up to 9kB
    - Tehuti (tn40xx):
      - support cards with TN4010 MAC and an Aquantia AQR105 PHY
 
  - Ethernet PHYs:
    - support for TJA1102S, TJA1121
    - dp83tg720: add randomized polling intervals for link detection
    - dp83822: support changing the transmit amplitude voltage
    - support for LEDs on 88q2xxx
 
  - CAN:
    - canxl: support Remote Request Substitution bit access
    - flexcan: add S32G2/S32G3 SoC
 
  - WiFi:
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
    - batman-adv: add support for jumbo frames
 
  - WiFi drivers:
    - RealTek (rtw88):
      - support RTL8814AE and RTL8814AU
    - RealTek (rtw89):
      - switch using wiphy_lock and wiphy_work
      - add BB context to manipulate two PHY as preparation of MLO
      - improve BT-coexistence mechanism to play A2DP smoothly
    - Intel (iwlwifi):
      - add new iwlmld sub-driver for latest HW/FW combinations
    - MediaTek (mt76):
      - preparation for mt7996 Multi-Link Operation (MLO) support
    - Qualcomm/Atheros (ath12k):
      - continued work on MLO
    - Silabs (wfx):
      - Wake-on-WLAN support
 
  - Bluetooth:
    - add support for skb TX SND/COMPLETION timestamping
    - hci_core: enable buffer flow control for SCO/eSCO
    - coredump: log devcd dumps into the monitor
 
  - Bluetooth drivers:
    - intel: add support to configure TX power
    - nxp: handle bootloader error during cmd5 and cmd7
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
 Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
 OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
 DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
 IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
 cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
 eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
 P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
 =XY0S
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Linus Torvalds
317a76a996 Updates for the VDSO infrastructure:
- Consolidate the VDSO storage
 
     The VDSO data storage and data layout has been largely architecture
     specific for historical reasons. That increases the maintenance effort
     and causes inconsistencies over and over.
 
     There is no real technical reason for architecture specific layouts and
     implementations. The architecture specific details can easily be
     integrated into a generic layout, which also reduces the amount of
     duplicated code for managing the mappings.
 
     Convert all architectures over to a unified layout and common mapping
     infrastructure. This splits the VDSO data layout into subsystem
     specific blocks, timekeeping, random and architecture parts, which
     provides a better structure and allows to improve and update the
     functionalities without conflict and interaction.
 
   - Rework the timekeeping data storage
 
     The current implementation is designed for exposing system timekeeping
     accessors, which was good enough at the time when it was designed.
 
     PTP and Time Sensitive Networking (TSN) change that as there are
     requirements to expose independent PTP clocks, which are not related to
     system timekeeping.
 
     Replace the monolithic data storage by a structured layout, which
     allows to add support for independent PTP clocks on top while reusing
     both the data structures and the time accessor implementations.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfgSWUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYGED/0f/M8YyacAyErDYW4ufW+zh2sUidSf
 GVlK0Jn5BMljOoye+y2XfTxuvvXxEDjJNYiJm2uKGPdV29tjNXreGK39XyNqXPu5
 jwR4f/IN/QVSM2nCO6jyydMz8ympJ2k6M4RewwmxXBL2KsUzzJWSKTgRNqM5Tdjs
 1RhJMjkQVTiiSYerBpHXYCeZLM7/VEfZ120uuzVAYPXo0/R6zuyF7IBgIao9hbfO
 IQeCMLLfpDQHQhwquTA8ZbWqQusiEoSYHT+kTDa3eXDDbE/2UklAUs9gaatI979x
 73zs0Yqxyx2iIGaghACWOAbKdcBWBeCYDw5fFwYVKn4VMQi1+wcxbtOYL767jp9o
 vfkLXGilXcVkvDjv4fH+e1NoJXXBxq1Ug1silKdOeJzenQF8Q1i3tavkWUVCNfwH
 qyOIM72NiCEWbYBDcz0lwBxEAyO4o0E6NP1bDc4y50VedEYIbXwSh0QGrdev1abn
 rjY9vsuUR9oznmZ6BRPPxMTY87gOSHoKvqydgSZUACEgLV9346f5qZf341OReYai
 MXUmXOM4+LdyaM1+Mec8ppvjMbLw+736NZyZtT2InusEBE+Ddp25L3hYiWnklJu8
 2uwv0AoyrwaJ8y6ADOX4thcLZq0gND0Z/Ayz/XvpeI30eftsGUCt5KOVlqwfwOkI
 4EQKvk2fAixPxg==
 =rwei
 -----END PGP SIGNATURE-----

Merge tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull VDSO infrastructure updates from Thomas Gleixner:

 - Consolidate the VDSO storage

   The VDSO data storage and data layout has been largely architecture
   specific for historical reasons. That increases the maintenance
   effort and causes inconsistencies over and over.

   There is no real technical reason for architecture specific layouts
   and implementations. The architecture specific details can easily be
   integrated into a generic layout, which also reduces the amount of
   duplicated code for managing the mappings.

   Convert all architectures over to a unified layout and common mapping
   infrastructure. This splits the VDSO data layout into subsystem
   specific blocks, timekeeping, random and architecture parts, which
   provides a better structure and allows to improve and update the
   functionalities without conflict and interaction.

 - Rework the timekeeping data storage

   The current implementation is designed for exposing system
   timekeeping accessors, which was good enough at the time when it was
   designed.

   PTP and Time Sensitive Networking (TSN) change that as there are
   requirements to expose independent PTP clocks, which are not related
   to system timekeeping.

   Replace the monolithic data storage by a structured layout, which
   allows to add support for independent PTP clocks on top while reusing
   both the data structures and the time accessor implementations.

* tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
  sparc/vdso: Always reject undefined references during linking
  x86/vdso: Always reject undefined references during linking
  vdso: Rework struct vdso_time_data and introduce struct vdso_clock
  vdso: Move architecture related data before basetime data
  powerpc/vdso: Prepare introduction of struct vdso_clock
  arm64/vdso: Prepare introduction of struct vdso_clock
  x86/vdso: Prepare introduction of struct vdso_clock
  time/namespace: Prepare introduction of struct vdso_clock
  vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
  vdso/vsyscall: Prepare introduction of struct vdso_clock
  vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock
  vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock
  vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock
  vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock
  vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock
  vdso/gettimeofday: Prepare introduction of struct vdso_clock
  vdso/helpers: Prepare introduction of struct vdso_clock
  vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks
  vdso: Make vdso_time_data cacheline aligned
  arm64: Make asm/cache.h compatible with vDSO
  ...
2025-03-25 11:30:42 -07:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds
d5048d1176 Updates for the core time/timer subsystem:
- Fix a memory ordering issue in posix-timers
 
     Posix-timer lookup is lockless and reevaluates the timer validity under
     the timer lock, but the update which validates the timer is not
     protected by the timer lock. That allows the store to be reordered
     against the initialization stores, so that the lookup side can observe
     a partially initialized timer. That's mostly a theoretical problem, but
     incorrect nevertheless.
 
   - Fix a long standing inconsistency of the coarse time getters
 
     The coarse time getters read the base time of the current update cycle
     without reading the actual hardware clock. NTP frequency adjustment can
     set the base time backwards. The fine grained interfaces compensate
     this by reading the clock and applying the new conversion factor, but
     the coarse grained time getters use the base time directly. That allows
     the user to observe time going backwards.
 
     Cure it by always forwarding base time, when NTP changes the frequency
     with an immediate step.
 
   - Rework of posix-timer hashing
 
     The posix-timer hash is not scalable and due to the CRIU timer restore
     mechanism prone to massive contention on the global hash bucket lock.
 
     Replace the global hash lock with a fine grained per bucket locking
     scheme to address that.
 
   - Rework the proc/$PID/timers interface.
 
     /proc/$PID/timers is provided for CRIU to be able to restore a
     timer. The printout happens with sighand lock held and interrupts
     disabled. That's not required as this can be done with RCU protection
     as well.
 
   - Provide a sane mechanism for CRIU to restore a timer ID
 
     CRIU restores timers by creating and deleting them until the kernel
     internal per process ID counter reached the requested ID. That's
     horribly slow for sparse timer IDs.
 
     Provide a prctl() which allows CRIU to restore a timer with a given
     ID. When enabled the ID pointer is used as input pointer to read the
     requested ID from user space. When disabled, the normal allocation
     scheme (next ID) is active as before. This is backwards compatible for
     both kernel and user space.
 
   - Make hrtimer_update_function() less expensive.
 
     The sanity checks are valuable, but expensive for high frequency usage
     in io/uring. Make the debug checks conditional and enable them only
     when lockdep is enabled.
 
   - Small updates, cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfgQ6wTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeQzD/9p+EuUGrMbSNaLVMCYFULBbR0lersJ
 hrGGoKUsNt5T+f6hEEbSLBnkjZcMIj0J+mdIEUiRa73ryw1KmwLk/8MBu0c6u6q3
 musDvJqt3dLTG98yN0YeWK3tJDxhSjxIpwcAXusPQ04j16I2fVXFzDQ/kGPq6MTI
 tdMYzsS3wjuWpi+CbgRSP2HEwu08fIDVsQ7Grynh4Kmd31apne4ZgF2UVp6UiZyp
 8yJHZgVzJcFs7Y3MS6XTgezHnuADxMY1irzbXmok19941X8mZz2QRIpGQX+oMh6o
 g7SG2lj9i8YbLqU9/5RbC5ppjRcWfogDpW0Lk+OmdOpr0RiXTmx5Lz8Egxex9wG5
 pUJszeTY+bLw7mmYmkGZyBz+PNoGgVM5KFZRe5ENvYM8Gy8LUW5DA9zvxeHqDDz1
 FiMmKdYrwr8VCKqx+8hJQdzlzRbepxq9sNzDdMKVOUcFdGUVWekfG6ZFkfLKxwzA
 XDTKJilzXbAAj4r57vEvOCYLUZH/ZsFK4yyg0O53fEg6fj87EbTDb5+YUGazb3+C
 yNTEOQIT8LtutzLR9+xeLi92k+6zlJ4c1PfqBx5Kv/TwBrIfV1P8N2c6TCOWDoRM
 AOvo2SXEA/jEPix2GjT5jalSV1mROEXo2T9/G7kz4H7K+DkI/dGgS9mXyUDO2mMd
 ouOxYN0GohVqTQ==
 =XUGH
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - Fix a memory ordering issue in posix-timers

   Posix-timer lookup is lockless and reevaluates the timer validity
   under the timer lock, but the update which validates the timer is not
   protected by the timer lock. That allows the store to be reordered
   against the initialization stores, so that the lookup side can
   observe a partially initialized timer. That's mostly a theoretical
   problem, but incorrect nevertheless.

 - Fix a long standing inconsistency of the coarse time getters

   The coarse time getters read the base time of the current update
   cycle without reading the actual hardware clock. NTP frequency
   adjustment can set the base time backwards. The fine grained
   interfaces compensate this by reading the clock and applying the new
   conversion factor, but the coarse grained time getters use the base
   time directly. That allows the user to observe time going backwards.

   Cure it by always forwarding base time, when NTP changes the
   frequency with an immediate step.

 - Rework of posix-timer hashing

   The posix-timer hash is not scalable and due to the CRIU timer
   restore mechanism prone to massive contention on the global hash
   bucket lock.

   Replace the global hash lock with a fine grained per bucket locking
   scheme to address that.

 - Rework the proc/$PID/timers interface.

   /proc/$PID/timers is provided for CRIU to be able to restore a timer.
   The printout happens with sighand lock held and interrupts disabled.
   That's not required as this can be done with RCU protection as well.

 - Provide a sane mechanism for CRIU to restore a timer ID

   CRIU restores timers by creating and deleting them until the kernel
   internal per process ID counter reached the requested ID. That's
   horribly slow for sparse timer IDs.

   Provide a prctl() which allows CRIU to restore a timer with a given
   ID. When enabled the ID pointer is used as input pointer to read the
   requested ID from user space. When disabled, the normal allocation
   scheme (next ID) is active as before. This is backwards compatible
   for both kernel and user space.

 - Make hrtimer_update_function() less expensive.

   The sanity checks are valuable, but expensive for high frequency
   usage in io/uring. Make the debug checks conditional and enable them
   only when lockdep is enabled.

 - Small updates, cleanups and improvements

* tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  selftests/timers: Improve skew_consistency by testing with other clockids
  timekeeping: Fix possible inconsistencies in _COARSE clockids
  posix-timers: Drop redundant memset() invocation
  selftests/timers/posix-timers: Add a test for exact allocation mode
  posix-timers: Provide a mechanism to allocate a given timer ID
  posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  posix-timers: Make per process list RCU safe
  posix-timers: Avoid false cacheline sharing
  posix-timers: Switch to jhash32()
  posix-timers: Improve hash table performance
  posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  posix-timers: Make lock_timer() use guard()
  posix-timers: Rework timer removal
  posix-timers: Simplify lock/unlock_timer()
  posix-timers: Use guards in a few places
  posix-timers: Remove SLAB_PANIC from kmem cache
  posix-timers: Remove a few paranoid warnings
  posix-timers: Cleanup includes
  posix-timers: Add cond_resched() to posix_timer_add() search loop
  posix-timers: Initialise timer before adding it to the hash table
  ...
2025-03-25 10:33:23 -07:00
Josh Poimboeuf
2cbb20b008 tracing: Disable branch profiling in noinstr code
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update()
for each use of likely() or unlikely().  That breaks noinstr rules if
the affected function is annotated as noinstr.

Disable branch profiling for files with noinstr functions.  In addition
to some individual files, this also includes the entire arch/x86
subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle
directories, all of which are noinstr-heavy.

Due to the nature of how sched binaries are built by combining multiple
.c files into one, branch profiling is disabled more broadly across the
sched code than would otherwise be needed.

This fixes many warnings like the following:

  vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section
  ...

Reported-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-22 09:49:26 +01:00
John Stultz
757b000f7b timekeeping: Fix possible inconsistencies in _COARSE clockids
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies.

Lei tracked down that this was being caused by the adjustment

    tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.

However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.

By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately

     multiplicator * interval_cycles

into xtime_nsec.  The accumulated value is always larger then the

     mult_adj * offset

value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well, to to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller then
interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct
for the new multiplicator.

Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the _COARSE
clockid getters, which don't compensate for the offset, can observe the
inconsistency.

To fix this, rework the timekeeping_advance() logic so that when invoked
from do_adjtimex(), the time is immediately forwarded to accumulate also
the sub-interval portion into xtime. That means the remaining offset
becomes zero and the subsequent multiplier adjustment therefore does not
modify xtime_nsec.

There is another related inconsistency. If xtime is forwarded due to the
instantaneous multiplier adjustment, the NTP error, which was accumulated
with the previous setting, becomes meaningless.

Therefore clear the NTP error as well, after forwarding the clock for the
instantaneous multiplier update.

Fixes: da15cfdae0 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250320200306.1712599-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-03-21 19:16:18 +01:00
Cyrill Gorcunov
d1c3a3f1c9 posix-timers: Drop redundant memset() invocation
Initially in commit 6891c4509c memset() was required to clear a variable
allocated on stack. Commit 2482097c6c removed the on stack variable and
retained the memset() despite the fact that the memory is allocated via
kmem_cache_zalloc() and therefore zereoed already.

Drop the redundant memset().

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z9ctVxwaYOV4A2g4@grain
2025-03-17 10:38:49 +01:00
Thomas Gleixner
ec2d0c0462 posix-timers: Provide a mechanism to allocate a given timer ID
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.

It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.

The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.

This allows to address this issue with a prctl() so that the restorer
thread can do:

   if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
      goto linear_mode;
   create_timers_with_explicit_ids();
   prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
   
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.

Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.

If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.

There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out. 
 
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.

Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>
Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx
2025-03-13 12:07:18 +01:00
Thomas Gleixner
451898ea42 posix-timers: Make per process list RCU safe
Preparatory change to remove the sighand locking from the /proc/$PID/timers
iterator.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.403223080@linutronix.de
2025-03-13 12:07:18 +01:00
Thomas Gleixner
5fa75a432f posix-timers: Avoid false cacheline sharing
struct k_itimer has the hlist_node, which is used for lookup in the hash
bucket, and the timer lock in the same cache line.

That's obviously bad, if one CPU fiddles with a timer and the other is
walking the hash bucket on which that timer is queued.

Avoid this by restructuring struct k_itimer, so that the read mostly (only
modified during setup and teardown) fields are in the first cache line and
the lock and the rest of the fields which get written to are in cacheline
2-N.

Reduces cacheline contention in a test case of 64 processes creating and
accessing 20000 timers each by almost 30% according to perf.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.341108067@linutronix.de
2025-03-13 12:07:18 +01:00
Thomas Gleixner
781764e0b4 posix-timers: Switch to jhash32()
The hash distribution of hash_32() is suboptimal. jhash32() provides a way
better distribution, which evens out the length of the hash bucket lists,
which in turn avoids large outliers in list walk times.

Due to the sparse ID space (thanks CRIU) there is no guarantee that the
timers will be fully evenly distributed over the hash buckets, but the
behaviour is way better than with hash_32() even for randomly sparse ID
spaces.

For a pathological test case with 64 processes creating and accessing
20000 timers each, this results in a runtime reduction of ~10% and a
significantly reduced runtime variation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250308155624.279080328@linutronix.de
2025-03-13 12:07:17 +01:00
Thomas Gleixner
1535cb8028 posix-timers: Improve hash table performance
Eric and Ben reported a significant performance bottleneck on the global
hash, which is used to store posix timers for lookup.

Eric tried to do a lockless validation of a new timer ID before trying to
insert the timer, but that does not solve the problem.

For the non-contended case this is a pointless exercise and for the
contended case this extra lookup just creates enough interleaving that all
tasks can make progress.

There are actually two real solutions to the problem:

  1) Provide a per process (signal struct) xarray storage

  2) Implement a smarter hash like the one in the futex code

#1 works perfectly fine for most cases, but the fact that CRIU enforced a
   linear increasing timer ID to restore timers makes this problematic.

   It's easy enough to create a sparse timer ID space, which amounts very
   fast to a large junk of memory consumed for the xarray. 2048 timers with
   a ID offset of 512 consume more than one megabyte of memory for the
   xarray storage.

#2 The main advantage of the futex hash is that it uses per hash bucket
   locks instead of a global hash lock. Aside of that it is scaled
   according to the number of CPUs at boot time.

Experiments with artifical benchmarks have shown that a scaled hash with
per bucket locks comes pretty close to the xarray performance and in some
scenarios it performes better.

Test 1:

     A single process creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         23 ms       23 ms      9 ms     8 ms
getoverrun     14 ms       14 ms      5 ms     4 ms

Test 2:

     A single process creates 50000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         98 ms      219 ms     20 ms    18 ms
getoverrun     62 ms       62 ms     10 ms     9 ms

Test 3:

     A single process creates 100000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create        313 ms      750 ms     48 ms    33 ms
getoverrun    261 ms      260 ms     20 ms    14 ms

Erics changes create quite some overhead in the create() path due to the
double list walk, as the main issue according to perf is the list walk
itself. With 100k timers each hash bucket contains ~200 timers, which in
the worst case need to be all inspected. The same problem applies for
getoverrun() where the lookup has to walk through the hash buckets to find
the timer it is looking for.

The scaled hash obviously reduces hash collisions and lock contention
significantly. This becomes more prominent with concurrency.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The threads are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per thread:

            mainline        Eric   newhash   xarray
create     180239 ms    38599 ms    579 ms   813 ms
getoverrun   2645 ms     2642 ms     32 ms     7 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The processes are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per process:

            mainline        eric   newhash   xarray
create     157253 ms    40008 ms     83 ms    60 ms
getoverrun   2611 ms     2614 ms     40 ms     4 ms

So clearly the reduction of lock contention with Eric's changes makes a
significant difference for the create() loop, but it does not mitigate the
problem of long list walks, which is clearly visible on the getoverrun()
side because that is purely dominated by the lookup itself. Once the timer
is found, the syscall just reads from the timer structure with no other
locks or code paths involved and returns.

The reason for the difference between the thread and the fork case for the
new hash and the xarray is that both suffer from contention on
sighand::siglock and the xarray suffers additionally from contention on the
xarray lock on insertion.

The only case where the reworked hash slighly outperforms the xarray is a
tight loop which creates and deletes timers.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The threads are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per thread:

            mainline        Eric   newhash   xarray
loop	    5917  ms	 5897 ms   5473 ms  7846 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The processes are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per process:

            mainline        Eric   newhash   xarray
loop	     5137 ms	 7828 ms    891 ms   872 ms

In both test there is not much contention on the hash, but the ucount
accounting for the signal and in the thread case the sighand::siglock
contention (plus the xarray locking) contribute dominantly to the overhead.

As the memory consumption of the xarray in the sparse ID case is
significant, the scaled hash with per bucket locks seems to be the better
overall option. While the xarray has faster lookup times for a large number
of timers, the actual syscall usage, which requires the lookup is not an
extreme hotpath. Most applications utilize signal delivery and all syscalls
except timer_getoverrun(2) are all but cheap.

So implement a scaled hash with per bucket locks, which offers the best
tradeoff between performance and memory consumption.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de
2025-03-13 12:07:17 +01:00
Eric Dumazet
feb864ee99 posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.

Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.

To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.

Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().

[ tglx: Adopted from Eric's series, massaged change log and simplified it ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de
2025-03-13 12:07:17 +01:00
Peter Zijlstra
538d710ec7 posix-timers: Make lock_timer() use guard()
The lookup and locking of posix timers requires the same repeating pattern
at all usage sites:

   tmr = lock_timer(tiner_id);
   if (!tmr)
   	return -EINVAL;
   ....
   unlock_timer(tmr);

Solve this with a guard implementation, which works in most places out of
the box except for those, which need to unlock the timer inside the guard
scope.

Though the only places where this matters are timer_delete() and
timer_settime(). In both cases the timer pointer needs to be preserved
across the end of the scope, which is solved by storing the pointer in a
variable outside of the scope.

timer_settime() also has to protect the timer with RCU before unlocking,
which obviously can't use guard(rcu) before leaving the guard scope as that
guard is cleaned up before the unlock. Solve this by providing the RCU
protection open coded.

[ tglx: Made it work and added change log ]

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net
Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de
2025-03-13 12:07:17 +01:00
Thomas Gleixner
1d25bdd3f3 posix-timers: Rework timer removal
sys_timer_delete() and the do_exit() cleanup function itimer_delete() are
doing the same thing, but have needlessly different implementations instead
of sharing the code.

The other oddity of timer deletion is the fact that the timer is not
invalidated before the actual deletion happens, which allows concurrent
lookups to succeed.

That's wrong because a timer which is in the process of being deleted
should not be visible and any actions like signal queueing, delivery and
rearming should not happen once the task, which invoked timer_delete(), has
the timer locked.

Rework the code so that:

   1) The signal queueing and delivery code ignore timers which are marked
      invalid

   2) The deletion implementation between sys_timer_delete() and
      itimer_delete() is shared

   3) The timer is invalidated and removed from the linked lists before
      the deletion callback of the relevant clock is invoked.

      That requires to rework timer_wait_running() as it does a lookup of
      the timer when relocking it at the end. In case of deletion this
      lookup would fail due to the preceding invalidation and the wait loop
      would terminate prematurely.

      But due to the preceding invalidation the timer cannot be accessed by
      other tasks anymore, so there is no way that the timer has been freed
      after the timer lock has been dropped.

      Move the re-validation out of timer_wait_running() and handle it at
      the only other usage site, timer_settime().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx
2025-03-13 12:07:17 +01:00
Thomas Gleixner
50f53b23f1 posix-timers: Simplify lock/unlock_timer()
Since the integration of sigqueue into the timer struct, lock_timer() is
only used in task context. So taking the lock with irqsave() is not longer
required.

Convert it to use spin_[un]lock_irq().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de
2025-03-13 12:07:17 +01:00
Thomas Gleixner
a31a300c4d posix-timers: Use guards in a few places
Switch locking and RCU to guards where applicable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de
2025-03-13 12:07:17 +01:00
Thomas Gleixner
f6d0c3d2eb posix-timers: Remove SLAB_PANIC from kmem cache
There is no need to panic when the posix-timer kmem_cache can't be
created. timer_create() will fail with -ENOMEM and that's it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de
2025-03-13 12:07:16 +01:00
Thomas Gleixner
4c5cd058be posix-timers: Remove a few paranoid warnings
Warnings about a non-initialized timer or non-existing callbacks are just
useful for implementing new posix clocks, but there a NULL pointer
dereference is expected anyway. :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de
2025-03-13 12:07:16 +01:00
Thomas Gleixner
6ad9c3380a posix-timers: Cleanup includes
Remove pointless includes and sort the remaining ones alphabetically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de
2025-03-13 12:07:16 +01:00
Eric Dumazet
5f2909c6cd posix-timers: Add cond_resched() to posix_timer_add() search loop
With a large number of POSIX timers the search for a valid ID might cause a
soft lockup on PREEMPT_NONE/VOLUNTARY kernels.

Add cond_resched() to the loop to prevent that.

[ tglx: Split out from Eric's series ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de
2025-03-13 12:07:16 +01:00
Eric Dumazet
45ece9933d posix-timers: Initialise timer before adding it to the hash table
A timer is only valid in the hashtable when both timer::it_signal and
timer::it_id are set to their final values, but timers are added without
those values being set.

The timer ID is allocated when the timer is added to the hash in invalid
state. The ID is taken from a monotonically increasing per process counter
which wraps around after reaching INT_MAX. The hash insertion validates
that there is no timer with the allocated ID in the hash table which
belongs to the same process. That opens a mostly theoretical race condition:

If other threads of the same process manage to create/delete timers in
rapid succession before the newly created timer is fully initialized and
wrap around to the timer ID which was handed out, then a duplicate timer ID
will be inserted into the hash table.

Prevent this by:

  1) Setting timer::it_id before inserting the timer into the hashtable.
 
  2) Storing the signal pointer in timer::it_signal with bit 0 set before
     inserting it into the hashtable.

     Bit 0 acts as a invalid bit, which means that the regular lookup for
     sys_timer_*() will fail the comparison with the signal pointer.

     But the lookup on insertion masks out bit 0 and can therefore detect a
     timer which is not yet valid, but allocated in the hash table.  Bit 0
     in the pointer is cleared once the initialization of the timer
     completed.

[ tglx: Fold ID and signal iniitializaion into one patch and massage change
  	log and comments. ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de
2025-03-13 12:07:16 +01:00
Thomas Gleixner
2389c6efd3 posix-timers: Ensure that timer initialization is fully visible
Frederic pointed out that the memory operations to initialize the timer are
not guaranteed to be visible, when __lock_timer() observes timer::it_signal
valid under timer::it_lock:

  T0                                      T1
  ---------                               -----------
  do_timer_create()
      // A
      new_timer->.... = ....
      spin_lock(current->sighand)
      // B
      WRITE_ONCE(new_timer->it_signal, current->signal)
      spin_unlock(current->sighand)
					sys_timer_*()
					   t =  __lock_timer()
						  spin_lock(&timr->it_lock)
						  // observes B
						  if (timr->it_signal == current->signal)
						    return timr;
			                   if (!t)
					       return;
					// Is not guaranteed to observe A

Protect the write of timer::it_signal, which makes the timer valid, with
timer::it_lock as well. This guarantees that T1 must observe the
initialization A completely, when it observes the valid signal pointer
under timer::it_lock. sighand::siglock must still be taken to protect the
signal::posix_timers list.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de
2025-03-13 12:07:16 +01:00
Thorsten Blum
fc661d0a78 clocksource: Remove unnecessary strscpy() size argument
The size argument of strscpy() is only required when the destination
pointer is not a fixed sized array or when the copy needs to be smaller
than the size of the fixed sized destination array.

For fixed sized destination arrays and full copies, strscpy() automatically
determines the length of the destination buffer if the size argument is
omitted.

This makes the explicit sizeof() unnecessary. Remove it.

[ tglx: Massaged change log ]

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250311110624.495718-2-thorsten.blum@linux.dev
2025-03-13 11:37:44 +01:00
Thomas Weißschuh
a52067c24c timer_list: Don't use %pK through printk()
This reverts commit f590308536 ("timer debug: Hide kernel addresses via
%pK in /proc/timer_list")

The timer list helper SEQ_printf() uses either the real seq_printf() for
procfs output or vprintk() to print to the kernel log, when invoked from
SysRq-q. It uses %pK for printing pointers.

In the past %pK was prefered over %p as it would not leak raw pointer
values into the kernel log. Since commit ad67b74d24 ("printk: hash
addresses printed with %p") the regular %p has been improved to avoid this
issue.

Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping looks in atomic contexts.

Switch to the regular pointer formatting which is safer, easier to reason
about and sufficient here.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/
Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de
2025-03-13 08:19:19 +01:00
Anna-Maria Behnsen
886653e366 vdso: Rework struct vdso_time_data and introduce struct vdso_clock
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be an array of VDSO clocks.

Now that all preparatory changes are in place:

Split the clock related struct members into a separate struct
vdso_clock. Make sure all users are aware, that vdso_time_data is no longer
initialized as an array and vdso_clock is now the array inside
vdso_data. Remove the vdso_clock define, which mapped it to vdso_time_data
for the transition.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-19-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Anna-Maria Behnsen
5911e16cad time/namespace: Prepare introduction of struct vdso_clock
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.

To prepare for the rework of the data structures, replace the struct
vdso_time_data pointer with a struct vdso_clock pointer where applicable.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-14-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Anna-Maria Behnsen
0235220807 vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.

For time namespaces, vdso_time_data needs to be set up. But only the clock
related part of the vdso_data thats requires this setup. To reflect the
future struct vdso_clock, rename timens_setup_vdso_data() to
timns_setup_vdso_clock_data().

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-13-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Anna-Maria Behnsen
b5afbc106d vdso/vsyscall: Prepare introduction of struct vdso_clock
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.

To prepare for the rework of the data structures, replace the struct
vdso_time_data pointer with a struct vdso_clock pointer where applicable.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-12-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Wojtek Wasko
b4e53b15c0 ptp: Add PHC file mode checks. Allow RO adjtime() without FMODE_WRITE.
Many devices implement highly accurate clocks, which the kernel manages
as PTP Hardware Clocks (PHCs). Userspace applications rely on these
clocks to timestamp events, trace workload execution, correlate
timescales across devices, and keep various clocks in sync.

The kernel’s current implementation of PTP clocks does not enforce file
permissions checks for most device operations except for POSIX clock
operations, where file mode is verified in the POSIX layer before
forwarding the call to the PTP subsystem. Consequently, it is common
practice to not give unprivileged userspace applications any access to
PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An
example of users running into this limitation is documented in [1].
Additionally, POSIX layer requires WRITE permission even for readonly
adjtime() calls which are used in PTP layer to return current frequency
offset applied to the PHC.

Add permission checks for functions that modify the state of a PTP
device. Continue enforcing permission checks for POSIX clock operations
(settime, adjtime) in the POSIX layer. Only require WRITE access for
dynamic clocks adjtime() if any flags are set in the modes field.

[1] https://lists.nwtime.org/sympa/arc/linuxptp-users/2024-01/msg00036.html

Changes in v4:
- Require FMODE_WRITE in ajtime() only for calls modifying the clock in
  any way.

Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-05 12:43:54 +00:00
Wojtek Wasko
e859d375d1 posix-clock: Store file pointer in struct posix_clock_context
File descriptor based pc_clock_*() operations of dynamic posix clocks
have access to the file pointer and implement permission checks in the
generic code before invoking the relevant dynamic clock callback.

Character device operations (open, read, poll, ioctl) do not implement a
generic permission control and the dynamic clock callbacks have no
access to the file pointer to implement them.

Extend struct posix_clock_context with a struct file pointer and
initialize it in posix_clock_open(), so that all dynamic clock callbacks
can access it.

Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-05 12:43:54 +00:00
Thomas Weißschuh
7a6b158e00 posix-clock: Remove duplicate compat ioctl() handler
The normal and compat ioctl handlers are identical,
which is fine as compat ioctls are detected and handled dynamically
inside the underlying clock implementation.
The duplicate definition however is unnecessary.

Just reuse the regular ioctl handler also for compat ioctls.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Link: https://lore.kernel.org/all/20250225-posix-clock-compat-cleanup-v2-1-30de86457a2b@weissschuh.net
2025-02-26 16:53:58 +01:00
Thomas Weißschuh
ac1a42f4e4 vdso: Remove remnants of architecture-specific time storage
All users of the time releated parts of the vDSO are now using the generic
storage implementation. Remove the therefore unnecessary compatibility
accessor functions and symbols.

Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-18-13a4669dfc8c@linutronix.de
2025-02-21 09:54:03 +01:00
Thomas Weißschuh
df7fcbefa7 vdso: Add generic time data storage
Historically each architecture defined their own way to store the vDSO
data page. Add a generic mechanism to provide storage for that page.

Furthermore this generic storage will be extended to also provide
uniform storage for *non*-time-related data, like the random state or
architecture-specific data. These will have their own pages and data
structures, so rename 'vdso_data' into 'vdso_time_data' to make that
split clear from the name.

Also introduce a new consistent naming scheme for the symbols related to
the vDSO, which makes it clear if the symbol is accessible from
userspace or kernel space and the type of data behind the symbol.

The generic fault handler contains an optimization to prefault the vvar
page when the timens page is accessed. This was lifted from s390 and x86.

Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-5-13a4669dfc8c@linutronix.de
2025-02-21 09:54:01 +01:00
Nam Cao
806e32248e can: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Most of this patch is generated by Coccinelle. Except for the TX thrtimer
in bcm_tx_setup() because this timer is not used and the callback function
is never set. For this particular case, set the callback to
hrtimer_dummy_timeout()

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/all/a3a6be42c818722ad41758457408a32163bfd9a0.1738746872.git.namcao@linutronix.de
2025-02-18 10:35:45 +01:00
Nam Cao
f66b0acf39 time: Switch to hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/170bb691a0d59917c8268a98c80b607128fc9f7f.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:33 +01:00
Benjamin Segall
f99c5bb396 posix-timers: Invoke cond_resched() during exit_itimers()
exit_itimers() loops through every timer in the process to delete it.  This
requires taking the system-wide hash_lock for each of these timers, and
contends with other processes trying to create or delete timers.

When a process creates hundreds of thousands of timers, and then exits
while other processes contend with it, this can trigger softlockups on
CONFIG_PREEMPT=n.

Add a cond_resched() invocation into the loop to allow the system to make
progress.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/xm2634gg2n23.fsf@google.com
2025-02-18 10:12:49 +01:00
Andy Shevchenko
4441b976df hrtimers: Replace hrtimer_clock_to_base_table with switch-case
Clang and GCC complain about overlapped initialisers in the
hrtimer_clock_to_base_table definition. With `make W=1` and CONFIG_WERROR=y
(which is default nowadays) this breaks the build:

  CC      kernel/time/hrtimer.o
kernel/time/hrtimer.c:124:21: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
  124 |         [CLOCK_REALTIME]        = HRTIMER_BASE_REALTIME,

kernel/time/hrtimer.c:122:27: note: previous initialization is here
  122 |         [0 ... MAX_CLOCKS - 1]  = HRTIMER_MAX_CLOCK_BASES,

(and similar for CLOCK_MONOTONIC, CLOCK_BOOTTIME, and CLOCK_TAI).

hrtimer_clockid_to_base(), which uses the table, is only used in
__hrtimer_init(), which is not a hotpath.

Therefore replace the table lookup with a switch case in
hrtimer_clockid_to_base() to avoid this warning.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250214134424.3367619-1-andriy.shevchenko@linux.intel.com
2025-02-18 10:12:49 +01:00
Linus Torvalds
3a0562d733 Fix a PREEMPT_RT bug in the clocksource verification code that
caused false positive warnings.
 
 Also fix a timer migration setup bug when new CPUs are added.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenI0MRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jUGA/+MfsjIC+WolYPCKwLXCRXOXc4Qx3kKdTP
 kcJeL59SDoaKRKmgyhCLxpAdDORhK5vA8u05328Cr5JCtPrlDY22pBgi984CLUBL
 AJdu5oBMPZlLiZ735PPhicCffrV33dKLyBbuqzhtlhs+9cYdEgcbn6FfNdWawYxA
 MjreFnAQGJ3/M6il2An58GfofrKd6y8QTufTOBSSVNmVAh/QABhYu1N0ytiwjvaX
 m9HxGy0l4xH/KF0pICWTJjLPbBpSWTNqIfK1WBConpQHesp6PXwakgWQj5/Np0ot
 wMkAUwPnLldvQTm664xlTAzoZv9N4jlXORvJ/xvPWgTDcYiDnsHE/44DAEc4wHh1
 2nvOrDu9EAhpTrMWRDct7h7BhShQUNFl+L2rF6kOgUZfCQ8OHL1U3IO9HxcO31Zg
 ZLnNfF6tz6D05y2EBJWS3st1CSZKfHTxlb8p4QFMZ9dyTMRDfTYSrEO2C6fmdJcg
 GMS/rL8MC4/N4kI3BkOv144ImcZIoiEzzPC8SnR73KeEg5LRM5IwJZ8cSP9ZUz9W
 P5VQIoBsHBbtROePRmurUqFgdmWzC0qyAQLPrWvNVUiweRcGF6Au7AqE4yjoVYAz
 Aa+z+pUu6EZLlVX3+yWa/fn2ExBWCApaVJS1ctoplNUjJY5EXVgaoWpS/9/B0du9
 KlNU3DhCaYA=
 =sKCk
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix a PREEMPT_RT bug in the clocksource verification code that caused
  false positive warnings.

  Also fix a timer migration setup bug when new CPUs are added"

* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix off-by-one root mis-connection
  clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
2025-02-08 11:55:03 -08:00
Frederic Weisbecker
868c9037df timers/migration: Fix off-by-one root mis-connection
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit b729cc1ec2
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.

This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.

Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:

  WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote
  CPU: 8 UID: 0 PID: 0 Comm: swapper/8
  RIP: 0010:tmigr_requires_handle_remote
  Call Trace:
   <IRQ>
   ? tmigr_requires_handle_remote
   ? hrtimer_run_queues
   update_process_times
   tick_periodic
   tick_handle_periodic
   __sysvec_apic_timer_interrupt
   sysvec_apic_timer_interrupt
  </IRQ>

Fix the problem by taking the old root into account in the children count
of the new root so the connection is not omitted.

Also warn when more than one top level group exists to better detect
similar issues in the future.

Fixes: b729cc1ec2 ("timers/migration: Fix another race between hotplug and idle entry/exit")
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
2025-02-07 09:02:16 +01:00
Linus Torvalds
f286757b64 Updates for timers and timekeeping:
- Properly cast the input to secs_to_jiffies() to unsigned long as
    otherwise the result uses the data type of the input variable, which
    causes result range checks to fail if the input data type is signed and
    smaller than unsigned long.
 
  - Handle late armed hrtimers gracefully on CPU hotplug
 
    There are legitimate cases where a hrtimer is (re)armed on an outgoing
    CPU after the timers have been migrated away. This triggers warnings and
    caused people to implement horrible workarounds in RCU. But those work
    arounds are incomplete and do not cover e.g. the scheduler hrtimers.
 
    Stop this by force moving timer which are enqueued on the current CPU
    after timer migration to be queued on a remote online CPU.
 
    This allows to undo the workarounds in a seperate step.
 
  - Demote a warning level printk() to info level in the clocksource
    watchdog code as there is no point to emit a warning level message for a
    purely informational message.
 
  - Mark a helper function __always_inline and move it into the existing
    #ifdef block to avoid 'unused function' warnings from CLANG
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmegv8QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWA7EADF7/GBufaTAYr1ZQKs2oK+xD+Vhs8M
 4CHgG0zlnl0HkPk1CE2VNBJ9PP8C5bKfMQJyYdtsxELVBFiJJEPEqbgpGFJQljD7
 lG/bJSc5MctOauSkbURZyFKtzOwre+q4tWqZ2xvth0LTtaY3SycsImIWCKr4cvKv
 95IQlXLMUkHZsTR4sXLSwaE1Kt9uyHOPa00pkvsQJ3CaWT7BAc+bdbZ83OdM7BTk
 2XnLvH3zlwijp/o4sS8HCpdX24HQlsKm7TF5igxGmwNophRwNzP3Imd3yh6onpL3
 9BrEYPyptKl7hB9N0y3mMu7JRljphfVBmfmzcGYLfkjuGmX7KkOOj5tpD9PwqIFl
 Mu8fDff1wTMxpDcoMzW2M540xYq3Pm9kheuwGQFH3XRoq28IKxo6MufXWgpgJkz4
 JxpQ8h+9wT4LodVthcaotqHxe1yTsOzot0ggejtCDMptVlLXudZu4J/QvBqOiygg
 +3ehX7G+AY+yTbqyncPY/jLKd2lnp3PArmJ0zqvYGiGsnr07F7zFZmCGhgUr96w7
 ZQWc9D9AvHREPoXiXb6AxbYAOImjVbYW2+Y1eB3eWK6GlhALoQ82YGldW4y4rfeu
 +9OaU7WgRTjlfBdaid7DmqBaLvpCSQKj/spzwKkq7eAiO9RSNdy91eaug99Ezjgn
 NySGwUM4t0Wkfw==
 =lUGt
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:

 - Properly cast the input to secs_to_jiffies() to unsigned long as
   otherwise the result uses the data type of the input variable, which
   causes result range checks to fail if the input data type is signed
   and smaller than unsigned long.

 - Handle late armed hrtimers gracefully on CPU hotplug

   There are legitimate cases where a hrtimer is (re)armed on an
   outgoing CPU after the timers have been migrated away. This triggers
   warnings and caused people to implement horrible workarounds in RCU.
   But those workarounds are incomplete and do not cover e.g. the
   scheduler hrtimers.

   Stop this by force moving timer which are enqueued on the current CPU
   after timer migration to be queued on a remote online CPU.

   This allows to undo the workarounds in a seperate step.

 - Demote a warning level printk() to info level in the clocksource
   watchdog code as there is no point to emit a warning level message
   for a purely informational message.

 - Mark a helper function __always_inline and move it into the existing
   #ifdef block to avoid 'unused function' warnings from CLANG

* tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jiffies: Cast to unsigned long in secs_to_jiffies() conversion
  clocksource: Use pr_info() for "Checking clocksource synchronization" message
  hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING
  hrtimers: Mark is_migration_base() with __always_inline
2025-02-03 09:10:56 -08:00
Waiman Long
6bb05a3333 clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
The following bug report happened with a PREEMPT_RT kernel:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  get_random_u32+0x4f/0x110
  clocksource_verify_choose_cpus+0xab/0x1a0
  clocksource_verify_percpu.part.0+0x6b/0x330
  clocksource_watchdog_kthread+0x193/0x1a0

It is due to the fact that clocksource_verify_choose_cpus() is invoked with
preemption disabled.  This function invokes get_random_u32() to obtain
random numbers for choosing CPUs.  The batched_entropy_32 local lock and/or
the base_crng.lock spinlock in driver/char/random.c will be acquired during
the call. In PREEMPT_RT kernel, they are both sleeping locks and so cannot
be acquired in atomic context.

Fix this problem by using migrate_disable() to allow smp_processor_id() to
be reliably used without introducing atomic context. preempt_disable() is
then called after clocksource_verify_choose_cpus() but before the
clocksource measurement is being run to avoid introducing unexpected
latency.

Fixes: 7560c02bdf ("clocksource: Check per-CPU clock synchronization when marked unstable")
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/all/20250131173323.891943-2-longman@redhat.com
2025-02-03 16:18:56 +01:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Waiman Long
1f566840a8 clocksource: Use pr_info() for "Checking clocksource synchronization" message
The "Checking clocksource synchronization" message is normally printed
when clocksource_verify_percpu() is called for a given clocksource if
both the CLOCK_SOURCE_UNSTABLE and CLOCK_SOURCE_VERIFY_PERCPU flags
are set.

It is an informational message and so pr_info() is the correct choice.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250125015442.3740588-1-longman@redhat.com
2025-01-27 10:30:59 +01:00
Frederic Weisbecker
53dac34539 hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING
hrtimers are migrated away from the dying CPU to any online target at
the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers
handling tasks involved in the CPU hotplug forward progress.

However wakeups can still be performed by the outgoing CPU after
CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being
armed. Depending on several considerations (crystal ball power management
based election, earliest timer already enqueued, timer migration enabled or
not), the target may eventually be the current CPU even if offline. If that
happens, the timer is eventually ignored.

The most notable example is RCU which had to deal with each and every of
those wake-ups by deferring them to an online CPU, along with related
workarounds:

_ e787644caf (rcu: Defer RCU kthreads wakeup when CPU is dying)
_ 9139f93209 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU)
_ f7345ccc62 (rcu/nocb: Fix rcuog wake-up from offline softirq)

The problem isn't confined to RCU though as the stop machine kthread
(which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end
of its work through cpu_stop_signal_done() and performs a wake up that
eventually arms the deadline server timer:

   WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0
   CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted
   Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0
   RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0
   Call Trace:
   <TASK>
     start_dl_timer
     enqueue_dl_entity
     dl_server_start
     enqueue_task_fair
     enqueue_task
     ttwu_do_activate
     try_to_wake_up
     complete
     cpu_stopper_thread

Instead of providing yet another bandaid to work around the situation, fix
it in the hrtimers infrastructure instead: always migrate away a timer to
an online target whenever it is enqueued from an offline CPU.

This will also allow to revert all the above RCU disgraceful hacks.

Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Vlad Poenaru <vlad.wing@gmail.com>
Reported-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org
Closes: 20241213203739.1519801-1-usamaarif642@gmail.com
2025-01-23 20:06:35 +01:00
Andy Shevchenko
27af31e449 hrtimers: Mark is_migration_base() with __always_inline
When is_migration_base() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:

kernel/time/hrtimer.c:156:20: error: unused function 'is_migration_base' [-Werror,-Wunused-function]
  156 | static inline bool is_migration_base(struct hrtimer_clock_base *base)
      |                    ^~~~~~~~~~~~~~~~~

Fix this by marking it with __always_inline.

[ tglx: Use __always_inline instead of __maybe_unused and move it into the
  	usage sites conditional ]

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250116160745.243358-1-andriy.shevchenko@linux.intel.com
2025-01-23 20:06:35 +01:00
Linus Torvalds
f200c315da Updates for timers and timekeeping:
- Just boring cleanups, typo and comment fixes and trivial optimizations
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmePk4QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodwdD/47AXDT4nkka0mAnWLgv9B8Lult71EC
 NVfZnqg6hWh/ru1a5Wmld1p8nmJc4524F9CrggMIVSp2u1q1n2iBTjU5wKSbKv5x
 Se4crYf2D+iJInXE8zpnAFouUL8ws4XaUls3Nw5BM2mrcOAPeYWpJSHroOSxFIwi
 yNLrGqW0rFczNQTS0hXki3GBjXrK2KdCVFetuu9RrUNGPvLspCUyN2A0TzXSupYP
 Tw7KC2i6lI15N3VTe0MQS9SXXeB7cJBIFK2r6KfNDjcdLrgtACs8eIg8rKqck+QH
 UcxW+bNYIvzt/Iw8x+pWvE5CMxEm+2FsbdXM77SFmRyBZ1UQ+QchI8ZKQ/fF0VnN
 48jwUUmsUetl2nCM77cqP8FMWGmZUUlvBw/mUXDaJLdBkLRRyQWqQw7FMgQb6kGg
 J0XZN8iFRNkSmY8sdNIRR9ELFbbofb+O3dz0fZ1406zDQFvBfxUOB+r4hZot1zVO
 uz+mcScbNHp89GJnJmaClA9NQkItKH2KohAo5rLXtG1GBTqauobAuqG6dx/0JXPF
 FgEPqnsEVWKahBwASxsxdlNA7IhK+vmvBVQVpRnvS+RM/TPd88Da5dhqbQD3ZJ1k
 UwiFwvhVuci1XS+5IIchRiNFy/ZSm5w1N3PFKDOQe4L8FreTDuO7mlrAQMUy2Jk3
 mXF5HwGON7a76A==
 =R/xW
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer and timekeeping updates from Thomas Gleixner:

 - Just boring cleanups, typo and comment fixes and trivial optimizations

* tag 'timers-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Simplify top level detection on group setup
  timers: Optimize get_timer_[this_]cpu_base()
  timekeeping: Remove unused ktime_get_fast_timestamps()
  timer/migration: Fix kernel-doc warnings for union tmigr_state
  tick/broadcast: Add kernel-doc for function parameters
  hrtimers: Update the return type of enqueue_hrtimer()
  clocksource/wdtest: Print time values for short udelay(1)
  posix-timers: Fix typo in __lock_timer()
  vdso: Correct typo in PAGE_SHIFT comment
2025-01-21 13:16:00 -08:00
Frederic Weisbecker
dcf6230555 timers/migration: Simplify top level detection on group setup
Having a single group on a given level is enough to know this is the
top level, because a root has to have at least two children, unless that
root is the only group and the children are actual CPUs.

Simplify the test in tmigr_setup_groups() accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-5-frederic@kernel.org
2025-01-16 14:01:09 +01:00
Koichiro Den
2f8dea1692 hrtimers: Handle CPU state correctly on hotplug
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:

Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.

This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().

Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.

Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.

[ tglx: Make the new callback unconditionally available, remove the online
  	modification in the prepare() callback and clear the remaining
  	state in the starting callback instead of the prepare callback ]

Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16 13:06:14 +01:00
Frederic Weisbecker
922efd298b timers/migration: Annotate accesses to ignore flag
The group's ignore flag is:

_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit

When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.

On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.

Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:

		 BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report

		 write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
		 __tmigr_cpu_activate
		 tmigr_cpu_activate
		 timer_clear_idle
		 tick_nohz_restart_sched_tick
		 tick_nohz_idle_exit
		 do_idle
		 cpu_startup_entry
		 kernel_init
		 do_initcalls
		 clear_bss
		 reserve_bios_regions
		 common_startup_64

		 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
		 print_report
		 kcsan_report_known_origin
		 kcsan_setup_watchpoint
		 tmigr_next_groupevt
		 tmigr_update_events
		 tmigr_inactive_up
		 __walk_groups+0x50/0x77
		 walk_groups
		 __tmigr_cpu_deactivate
		 tmigr_cpu_deactivate
		 __get_next_timer_interrupt
		 timer_base_try_to_set_idle
		 tick_nohz_stop_tick
		 tick_nohz_idle_stop_tick
		 cpuidle_idle_call
		 do_idle

Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
2025-01-16 12:47:11 +01:00
Frederic Weisbecker
de3ced72a7 timers/migration: Enforce group initialization visibility to tree walkers
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug
and idle entry/exit") fixed yet another race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is yet another situation that remains unhandled:

         [GRP0:0]
      migrator  = TMIGR_NONE
      active    = NONE
      groupmask = 1
      /     \      \
     0       1     2..7
   idle      idle   idle

0) The system is fully idle.

         [GRP0:0]
      migrator  = CPU 0
      active    = CPU 0
      groupmask = 1
      /     \      \
     0       1     2..7
   active   idle   idle

1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().

         [GRP0:0]
      migrator  = CPU 0
      active    = CPU 0, CPU 1
      groupmask = 1
      /     \      \
     0       1     2..7
   active  active  idle

2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).

                    [GRP1:0]
                migrator = TMIGR_NONE
                active   = NONE
                groupmask = 1
             /                   \
         [GRP0:0]                  [GRP0:1]
      migrator  = CPU 0           migrator = TMIGR_NONE
      active    = CPU 0, CPU1     active   = NONE
      groupmask = 1               groupmask = 2
      /     \      \
     0       1     2..7                   8
   active  active  idle                !online

3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0.

                    [GRP1:0]
               migrator = GRP0:0
               active   = GRP0:0
               groupmask = 1
             /                   \
         [GRP0:0]                  [GRP0:1]
     migrator  = CPU 0           migrator = TMIGR_NONE
     active    = CPU 0, CPU1     active   = NONE
     groupmask = 1               groupmask = 2
     /     \      \
    0       1     2..7                   8
  active  active  idle                !online

4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched and the pre-initialized groupmask of GRP0:0 is also visible.
As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active
and migrator. CPU 0 is returning to __walk_groups() but suffers again
a #VMEXIT.

                    [GRP1:0]
               migrator = GRP0:0
               active   = GRP0:0
               groupmask = 1
             /                   \
         [GRP0:0]                  [GRP0:1]
     migrator  = CPU 0           migrator = TMIGR_NONE
     active    = CPU 0, CPU1     active   = NONE
     groupmask = 1               groupmask = 2
     /     \      \
    0       1     2..7                   8
  active  active  idle                 !online

5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no
   effect since CPU 0 did it already.

                    [GRP1:0]
               migrator = GRP0:0
               active   = GRP0:0, GRP0:1
               groupmask = 1
             /                   \
         [GRP0:0]                  [GRP0:1]
     migrator  = CPU 0           migrator = CPU 8
     active    = CPU 0, CPU1     active   = CPU 8
     groupmask = 1               groupmask = 2
     /     \      \                     \
    0       1     2..7                   8
  active  active  idle                 active

6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through
   CPUHP_AP_TMIGR_ONLINE which propagates activation.

                                   [GRP2:0]
                              migrator = TMIGR_NONE
                              active   = NONE
                              groupmask = 1
                             /                \
                    [GRP1:0]                    [GRP1:1]
               migrator = GRP0:0              migrator = TMIGR_NONE
               active   = GRP0:0, GRP0:1      active   = NONE
               groupmask = 1                  groupmask = 2
             /                   \
         [GRP0:0]                  [GRP0:1]                [GRP0:2]
     migrator  = CPU 0           migrator = CPU 8        migrator = TMIGR_NONE
     active    = CPU 0, CPU1     active   = CPU 8        active   = NONE
     groupmask = 1               groupmask = 2           groupmask = 0
     /     \      \                     \
    0       1     2..7                   8                  64
  active  active  idle                 active             !online

7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to
GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to
GRP2:0.

                                   [GRP2:0]
                              migrator = 0 (!!!)
                              active   = NONE
                              groupmask = 1
                             /                \
                    [GRP1:0]                    [GRP1:1]
               migrator = GRP0:0              migrator = TMIGR_NONE
               active   = GRP0:0, GRP0:1      active   = NONE
               groupmask = 1                  groupmask = 2
             /                   \
         [GRP0:0]                  [GRP0:1]                [GRP0:2]
     migrator  = CPU 0           migrator = CPU 8        migrator = TMIGR_NONE
     active    = CPU 0, CPU1     active   = CPU 8        active   = NONE
     groupmask = 1               groupmask = 2           groupmask = 0
     /     \      \                     \
    0       1     2..7                   8                  64
  active  active  idle                 active             !online

8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP2:0 is visible and
fetched but the pre-initialized groupmask of GRP1:0 is not because no
ordering made its initialization visible. As a result tmigr_active_up()
may be called to GRP2:0 with a "0" child's groumask. Leaving the timers
ignored for ever when the system is fully idle.

The race is highly theoretical and perhaps impossible in practice but
the groupmask of the child is not the only concern here as the whole
initialization of the child is not guaranteed to be visible to any
tree walker racing against hotplug (idle entry/exit, remote handling,
etc...). Although the current code layout seem to be resilient to such
hazards, this doesn't tell much about the future.

Fix this with enforcing address dependency between group initialization
and the write/read to the group's parent's pointer. Fortunately that
doesn't involve any barrier addition in the fast paths.

Fixes: 10a0e6f3d3 ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
2025-01-16 12:47:11 +01:00
Frederic Weisbecker
b729cc1ec2 timers/migration: Fix another race between hotplug and idle entry/exit
Commit 10a0e6f3d3 ("timers/migration: Move hierarchy setup into
cpuhotplug prepare callback") fixed a race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is still a situation that remains unhandled:

         [GRP0:0]
        migrator  = TMIGR_NONE
        active    = NONE
        groupmask = 0
        /     \      \
       0       1     2..7
     idle      idle   idle

0) The system is fully idle.

         [GRP0:0]
        migrator  = CPU 0
        active    = CPU 0
        groupmask = 0
        /     \      \
       0       1     2..7
     active   idle   idle

1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().

         [GRP0:0]
        migrator  = CPU 0
        active    = CPU 0, CPU 1
        groupmask = 0
        /     \      \
       0       1     2..7
     active  active  idle

2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).

                 [GRP1:0]
              migrator = TMIGR_NONE
              active   = NONE
              groupmask = 0
              /                  \
        [GRP0:0]                      [GRP0:1]
       migrator  = CPU 0           migrator = TMIGR_NONE
       active    = CPU 0, CPU1     active   = NONE
       groupmask = 2               groupmask = 1
       /     \      \
      0       1     2..7                   8
    active  active  idle              !online

3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet
propagated its activation up to GRP1:0.

                 [GRP1:0]
              migrator = 0 (!!!)
              active   = NONE
              groupmask = 0
              /                  \
        [GRP0:0]                  [GRP0:1]
       migrator  = CPU 0           migrator = TMIGR_NONE
       active    = CPU 0, CPU1     active   = NONE
       groupmask = 2               groupmask = 1
       /     \      \
      0       1     2..7                   8
    active  active  idle                !online

4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched but the freshly updated groupmask of GRP0:0 may not be visible
due to lack of ordering! As a result tmigr_active_up() is called to
GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then
becomes the migrator for GRP1:0 forever. As a result, timers on a fully
idle system get ignored.

One possible fix would be to define TMIGR_NONE as "0" so that such a
race would have no effect. And after all TMIGR_NONE doesn't need to be
anything else. However this would leave an uncomfortable state machine
where gears happen not to break by chance but are vulnerable to future
modifications.

Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of
any newly created top level. This groupmask is guaranteed to be visible
upon fetching the corresponding group for the 1st time:

_ By the upcoming CPU thanks to CPU hotplug synchronization between the
  control CPU (BP) and the booting one (AP).

_ By the control CPU since the groupmask and parent pointers are
  initialized locally.

_ By all CPUs belonging to the same group than the control CPU because
  they must wait for it to ever become idle before needing to walk to
  the new top. The cmpcxhg() on ->migr_state then makes sure its
  groupmask is visible.

With this pre-initialization, it is guaranteed that if a future top level
is linked to an old one, it is walked through with a valid groupmask.

Fixes: 10a0e6f3d3 ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
2025-01-16 12:47:11 +01:00
Zhongqiu Han
3ec955713d timers: Optimize get_timer_[this_]cpu_base()
If a timer is deferrable and NO_HZ_COMMON is enabled, get_timer_cpu_base()
and get_timer_this_cpu_base() invoke per_cpu_ptr() and this_cpu_ptr()
twice.

While this seems to be cheap, get_timer_cpu_base() can be called in a loop
in lock_timer_base().

Optimize the functions by updating the base index for deferrable timers and
retrieving the actual base pointer once.

In both cases the resulting assembly code of those helpers becomes smaller,
which results in a ~30% execution time reduction for a lock_timer_base()
micro bench mark.

Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241231150115.1978342-1-quic_zhonhan@quicinc.com
2025-01-16 09:04:23 +01:00
Dr. David Alan Gilbert
2d2a46cf23 timekeeping: Remove unused ktime_get_fast_timestamps()
ktime_get_fast_timestamps() was added in 2020 by commit e2d977c9f1
("timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper")
but has remained unused.

Remove it.

[ tglx: Fold the inline as David suggested in the submission ]

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250112160132.450209-1-linux@treblig.org
2025-01-15 19:49:14 +01:00
Randy Dunlap
4477b06014 timer/migration: Fix kernel-doc warnings for union tmigr_state
Use the correct kernel-doc notation for nested structs/unions to
eliminate warnings:

timer_migration.h:119: warning: Incorrect use of kernel-doc format:          * struct - split state of tmigr_group
timer_migration.h:134: warning: Function parameter or struct member 'active' not described in 'tmigr_state'
timer_migration.h:134: warning: Function parameter or struct member 'migrator' not described in 'tmigr_state'
timer_migration.h:134: warning: Function parameter or struct member 'seq' not described in 'tmigr_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250111063156.910903-1-rdunlap@infradead.org
2025-01-15 19:49:14 +01:00
Randy Dunlap
4903e1ba79 tick/broadcast: Add kernel-doc for function parameters
Add kernel-doc comments for two parameters to eliminate kernel-doc warnings:

tick-broadcast.c:1026: warning: Function parameter or struct member 'bc' not described in 'tick_broadcast_setup_oneshot'
tick-broadcast.c:1026: warning: Function parameter or struct member 'from_periodic' not described in 'tick_broadcast_setup_oneshot'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250111063148.910887-1-rdunlap@infradead.org
2025-01-15 19:49:14 +01:00
Richard Clark
da7100d3bf hrtimers: Update the return type of enqueue_hrtimer()
The return type should be 'bool' instead of 'int' according to the calling
context in the kernel, and its internal implementation, i.e. :

	return timerqueue_add();

which is a bool-return function.

[ tglx: Adjust function arguments ]

Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z2ppT7me13dtxm1a@MBC02GN1V4Q05P
2025-01-15 19:49:14 +01:00
Paul E. McKenney
776b194116 clocksource/wdtest: Print time values for short udelay(1)
When a pair of clocksource reads separated by a udelay(1) claim less than a
full microsecond of elapsed time, print the measured delay as part of the
splat.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/717a2ddf-a80f-490b-aa3a-4e4b74fa56ca@paulmck-laptop
2025-01-15 19:49:13 +01:00
Zhu Jun
9f38e83a88 posix-timers: Fix typo in __lock_timer()
The word 'accross' is wrong, so fix it.

Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204080907.11989-1-zhujun2@cmss.chinamobile.com
2025-01-15 19:49:13 +01:00
Thomas Gleixner
76031d9536 clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.

It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.

  max_idle_ns:                    597268854 ns
  negative motion tripping point: 671088640 ns

If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.

Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.

For the case at hand this results in a tripping point of 1174405120ns.

Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.

Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.

Fixes: c163e40af9 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net
2024-12-05 16:03:24 +01:00
Marcelo Dalmas
f5807b0606 ntp: Remove invalid cast in time offset math
Due to an unsigned cast, adjtimex() returns the wrong offest when using
ADJ_MICRO and the offset is negative. In this case a small negative offset
returns approximately 4.29 seconds (~ 2^32/1000 milliseconds) due to the
unsigned cast of the negative offset.

This cast was added when the kernel internal struct timex was changed to
use type long long for the time offset value to address the problem of a
64bit/32bit division on 32bit systems.

The correct cast would have been (s32), which is correct as time_offset can
only be in the range of [INT_MIN..INT_MAX] because the shift constant used
for calculating it is 32. But that's non-obvious.

Remove the cast and use div_s64() to cure the issue.

[ tglx: Fix white space damage, use div_s64() and amend the change log ]

Fixes: ead25417f8 ("timex: use __kernel_timex internally")
Signed-off-by: Marcelo Dalmas <marcelo.dalmas@ge.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/SJ0P101MB03687BF7D5A10FD3C49C51E5F42E2@SJ0P101MB0368.NAMP101.PROD.OUTLOOK.COM
2024-11-28 12:02:38 +01:00
Linus Torvalds
bf9aa14fc5 A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
 
     posix-timers are currently auto-rearmed by the kernel when the signal
     of the timer is ignored so that the timer signal can be delivered once
     the corresponding signal is unignored.
 
     This requires to throttle the timer to prevent a DoS by small intervals
     and keeps the system pointlessly out of low power states for no value.
     This is a long standing non-trivial problem due to the lock order of
     posix-timer lock and the sighand lock along with life time issues as
     the timer and the sigqueue have different life time rules.
 
     Cure this by:
 
      * Embedding the sigqueue into the timer struct to have the same life
        time rules. Aside of that this also avoids the lookup of the timer
        in the signal delivery and rearm path as it's just a always valid
        container_of() now.
 
      * Queuing ignored timer signals onto a seperate ignored list.
 
      * Moving queued timer signals onto the ignored list when the signal is
        switched to SIG_IGN before it could be delivered.
 
      * Walking the ignored list when SIG_IGN is lifted and requeue the
        signals to the actual signal lists. This allows the signal delivery
        code to rearm the timer.
 
     This also required to consolidate the signal delivery rules so they are
     consistent across all situations. With that all self test scenarios
     finally succeed.
 
   - Core infrastructure for VFS multigrain timestamping
 
     This is required to allow the kernel to use coarse grained time stamps
     by default and switch to fine grained time stamps when inode attributes
     are actively observed via getattr().
 
     These changes have been provided to the VFS tree as well, so that the
     VFS specific infrastructure could be built on top.
 
   - Cleanup and consolidation of the sleep() infrastructure
 
     * Move all sleep and timeout functions into one file
 
     * Rework udelay() and ndelay() into proper documented inline functions
       and replace the hardcoded magic numbers by proper defines.
 
     * Rework the fsleep() implementation to take the reality of the timer
       wheel granularity on different HZ values into account. Right now the
       boundaries are hard coded time ranges which fail to provide the
       requested accuracy on different HZ settings.
 
     * Update documentation for all sleep/timeout related functions and fix
       up stale documentation links all over the place
 
     * Fixup a few usage sites
 
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
 
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as that's
     the clock which drives the timekeeping adjustments via the various user
     space daemons through adjtimex(2).
 
     The non TAI based clock domains are accessible via the file descriptor
     based posix clocks, but their usability is very limited. They can't be
     accessed fast as they always go all the way out to the hardware and
     they cannot be utilized in the kernel itself.
 
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
 
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the kernel
     provides access to clock MONOTONIC, REALTIME etc.
 
     Instead of creating a duplicated infrastructure this rework converts
     timekeeping and adjtimex(2) into generic functionality which operates
     on pointers to data structures instead of using static variables.
 
     This allows to provide time accessors and adjtimex(2) functionality for
     the independent PTP clocks in a subsequent step.
 
   - Consolidate hrtimer initialization
 
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
 
     That's an extra unnecessary step and makes Rust support less straight
     forward than it should be.
 
     Provide a new set of hrtimer_setup*() functions and convert the core
     code and a few usage sites of the less frequently used interfaces over.
 
     The bulk of the htimer_init() to hrtimer_setup() conversion is already
     prepared and scheduled for the next merge window.
 
   - Drivers:
 
     * Ensure that the global timekeeping clocksource is utilizing the
       cluster 0 timer on MIPS multi-cluster systems.
 
       Otherwise CPUs on different clusters use their cluster specific
       clocksource which is not guaranteed to be synchronized with other
       clusters.
 
     * Mostly boring cleanups, fixes, improvements and code movement
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
 tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
 EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
 bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
 DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
 Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
 AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
 hZZ4hdM9+3T7cg==
 =2VC6
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A rather large update for timekeeping and timers:

   - The final step to get rid of auto-rearming posix-timers

     posix-timers are currently auto-rearmed by the kernel when the
     signal of the timer is ignored so that the timer signal can be
     delivered once the corresponding signal is unignored.

     This requires to throttle the timer to prevent a DoS by small
     intervals and keeps the system pointlessly out of low power states
     for no value. This is a long standing non-trivial problem due to
     the lock order of posix-timer lock and the sighand lock along with
     life time issues as the timer and the sigqueue have different life
     time rules.

     Cure this by:

       - Embedding the sigqueue into the timer struct to have the same
         life time rules. Aside of that this also avoids the lookup of
         the timer in the signal delivery and rearm path as it's just a
         always valid container_of() now.

       - Queuing ignored timer signals onto a seperate ignored list.

       - Moving queued timer signals onto the ignored list when the
         signal is switched to SIG_IGN before it could be delivered.

       - Walking the ignored list when SIG_IGN is lifted and requeue the
         signals to the actual signal lists. This allows the signal
         delivery code to rearm the timer.

     This also required to consolidate the signal delivery rules so they
     are consistent across all situations. With that all self test
     scenarios finally succeed.

   - Core infrastructure for VFS multigrain timestamping

     This is required to allow the kernel to use coarse grained time
     stamps by default and switch to fine grained time stamps when inode
     attributes are actively observed via getattr().

     These changes have been provided to the VFS tree as well, so that
     the VFS specific infrastructure could be built on top.

   - Cleanup and consolidation of the sleep() infrastructure

       - Move all sleep and timeout functions into one file

       - Rework udelay() and ndelay() into proper documented inline
         functions and replace the hardcoded magic numbers by proper
         defines.

       - Rework the fsleep() implementation to take the reality of the
         timer wheel granularity on different HZ values into account.
         Right now the boundaries are hard coded time ranges which fail
         to provide the requested accuracy on different HZ settings.

       - Update documentation for all sleep/timeout related functions
         and fix up stale documentation links all over the place

       - Fixup a few usage sites

   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
     clocks

     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as
     that's the clock which drives the timekeeping adjustments via the
     various user space daemons through adjtimex(2).

     The non TAI based clock domains are accessible via the file
     descriptor based posix clocks, but their usability is very limited.
     They can't be accessed fast as they always go all the way out to
     the hardware and they cannot be utilized in the kernel itself.

     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.

     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the
     kernel provides access to clock MONOTONIC, REALTIME etc.

     Instead of creating a duplicated infrastructure this rework
     converts timekeeping and adjtimex(2) into generic functionality
     which operates on pointers to data structures instead of using
     static variables.

     This allows to provide time accessors and adjtimex(2) functionality
     for the independent PTP clocks in a subsequent step.

   - Consolidate hrtimer initialization

     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.

     That's an extra unnecessary step and makes Rust support less
     straight forward than it should be.

     Provide a new set of hrtimer_setup*() functions and convert the
     core code and a few usage sites of the less frequently used
     interfaces over.

     The bulk of the htimer_init() to hrtimer_setup() conversion is
     already prepared and scheduled for the next merge window.

   - Drivers:

       - Ensure that the global timekeeping clocksource is utilizing the
         cluster 0 timer on MIPS multi-cluster systems.

         Otherwise CPUs on different clusters use their cluster specific
         clocksource which is not guaranteed to be synchronized with
         other clusters.

       - Mostly boring cleanups, fixes, improvements and code movement"

* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  posix-timers: Fix spurious warning on double enqueue versus do_exit()
  clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
  clocksource/drivers/gpx: Remove redundant casts
  clocksource/drivers/timer-ti-dm: Fix child node refcount handling
  dt-bindings: timer: actions,owl-timer: convert to YAML
  clocksource/drivers/ralink: Add Ralink System Tick Counter driver
  clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
  clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
  clocksource/drivers:sp804: Make user selectable
  clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
  hrtimers: Delete hrtimer_init_on_stack()
  alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
  io_uring: Switch to use hrtimer_setup_on_stack()
  sched/idle: Switch to use hrtimer_setup_on_stack()
  hrtimers: Delete hrtimer_init_sleeper_on_stack()
  wait: Switch to use hrtimer_setup_sleeper_on_stack()
  timers: Switch to use hrtimer_setup_sleeper_on_stack()
  net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
  futex: Switch to use hrtimer_setup_sleeper_on_stack()
  fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
  ...
2024-11-19 16:35:06 -08:00
Linus Torvalds
0352387523 First step of consolidating the VDSO data page handling:
The VDSO data page handling is architecture specific for historical
   reasons, but there is no real technical reason to do so.
 
   Aside of that VDSO data has become a dump ground for various mechanisms
   and fail to provide a clear separation of the functionalities.
 
   Clean this up by:
 
     * consolidating the VDSO page data by getting rid of architecture
       specific warts especially in x86 and PowerPC.
 
     * removing the last includes of header files which are pulling in other
       headers outside of the VDSO namespace.
 
     * seperating timekeeping and other VDSO data accordingly.
 
   Further consolidation of the VDSO page handling is done in subsequent
   changes scheduled for the next merge window.
 
   This also lays the ground for expanding the VDSO time getters for
   independent PTP clocks in a generic way without making every architecture
   add support seperately.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kyoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVBjD/9awdN2YeCGIM9rlHIktUdNRmRSL2SL
 6av1CPffN5DenONYTXWrDYPkC4yfjUwIs8H57uzFo10yA7RQ/Qfq+O68k5GnuFew
 jvpmmYSZ6TT21AmAaCIhn+kdl9YbEJFvN2AWH85Bl29k9FGB04VzJlQMMjfEZ1a5
 Mhwv+cfYNuPSZmU570jcxW2XgbyTWlLZBByXX/Tuz9bwpmtszba507bvo45x6gIP
 twaWNzrsyJpdXfMrfUnRiChN8jHlDN7I6fgQvpsoRH5FOiVwIFo0Ip2rKbk+ONfD
 W/rcU5oeqRIxRVDHzf2Sv8WPHMCLRv01ZHBcbJOtgvZC3YiKgKYoeEKabu9ZL1BH
 6VmrxjYOBBFQHOYAKPqBuS7BgH5PmtMbDdSZXDfRaAKaCzhCRysdlWW7z48r2R//
 zPufb7J6Tle23AkuZWhFjvlGgSBl4zxnTFn31HYOyQps3TMI4y50Z2DhE/EeU8a6
 DRl8/k1KQVDUZ6udJogS5kOr1J8pFtUPrA2uhR8UyLdx7YKiCzcdO1qWAjtXlVe8
 oNpzinU+H9bQqGe9IyS7kCG9xNaCRZNkln5Q1WfnkTzg5f6ihfaCvIku3l4bgVpw
 3HmcxYiC6RxQB+ozwN7hzCCKT4L9aMhr/457TNOqRkj2Elw3nvJ02L4aI86XAKLE
 jwO9Fkp9qcCxCw==
 =q5eD
 -----END PGP SIGNATURE-----

Merge tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull vdso data page handling updates from Thomas Gleixner:
 "First steps of consolidating the VDSO data page handling.

  The VDSO data page handling is architecture specific for historical
  reasons, but there is no real technical reason to do so.

  Aside of that VDSO data has become a dump ground for various
  mechanisms and fail to provide a clear separation of the
  functionalities.

  Clean this up by:

   - consolidating the VDSO page data by getting rid of architecture
     specific warts especially in x86 and PowerPC.

   - removing the last includes of header files which are pulling in
     other headers outside of the VDSO namespace.

   - seperating timekeeping and other VDSO data accordingly.

  Further consolidation of the VDSO page handling is done in subsequent
  changes scheduled for the next merge window.

  This also lays the ground for expanding the VDSO time getters for
  independent PTP clocks in a generic way without making every
  architecture add support seperately"

* tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/vdso: Add missing brackets in switch case
  vdso: Rename struct arch_vdso_data to arch_vdso_time_data
  powerpc: Split systemcfg struct definitions out from vdso
  powerpc: Split systemcfg data out of vdso data page
  powerpc: Add kconfig option for the systemcfg page
  powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processors
  powerpc/pseries/lparcfg: Fix printing of system_active_processors
  powerpc/procfs: Propagate error of remap_pfn_range()
  powerpc/vdso: Remove offset comment from 32bit vdso_arch_data
  x86/vdso: Split virtual clock pages into dedicated mapping
  x86/vdso: Delete vvar.h
  x86/vdso: Access vdso data without vvar.h
  x86/vdso: Move the rng offset to vsyscall.h
  x86/vdso: Access rng vdso data without vvar.h
  x86/vdso: Access timens vdso data without vvar.h
  x86/vdso: Allocate vvar page from C code
  x86/vdso: Access rng data from kernel without vvar
  x86/vdso: Place vdso_data at beginning of vvar page
  x86/vdso: Use __arch_get_vdso_data() to access vdso data
  x86/mm/mmap: Remove arch_vma_name()
  ...
2024-11-19 16:09:13 -08:00
Linus Torvalds
5c2b050848 A set of updates for the interrupt subsystem:
- Tree wide:
 
     * Make nr_irqs static to the core code and provide accessor functions
       to remove existing and prevent future aliasing problems with local
       variables or function arguments of the same name.
 
   - Core code:
 
     * Prevent freeing an interrupt in the devres code which is not managed
       by devres in the first place.
 
     * Use seq_put_decimal_ull_width() for decimal values output in
       /proc/interrupts which increases performance significantly as it
       avoids parsing the format strings over and over.
 
     * Optimize raising the timer and hrtimer soft interrupts by using the
       'set bit only' variants instead of the combined version which checks
       whether ksoftirqd should be woken up. The latter is a pointless
       exercise as both soft interrupts are raised in the context of the
       timer interrupt and therefore never wake up ksoftirqd.
 
     * Delegate timer/hrtimer soft interrupt processing to a dedicated thread
       on RT.
 
       Timer and hrtimer soft interrupts are always processed in ksoftirqd
       on RT enabled kernels. This can lead to high latencies when other
       soft interrupts are delegated to ksoftirqd as well.
 
       The separate thread allows to run them seperately under a RT
       scheduling policy to reduce the latency overhead.
 
   - Drivers:
 
     * New drivers or extensions of existing drivers to support Renesas
       RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt
       chips
 
     * Support for multi-cluster GICs on MIPS.
 
       MIPS CPUs can come with multiple CPU clusters, where each CPU cluster
       has its own GIC (Generic Interrupt Controller). This requires to
       access the GIC of a remote cluster through a redirect register block.
 
       This is encapsulated into a set of helper functions to keep the
       complexity out of the actual code paths which handle the GIC details.
 
     * Support for encrypted guests in the ARM GICV3 ITS driver
 
       The ITS page needs to be shared with the hypervisor and therefore
       must be decrypted.
 
     * Small cleanups and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7ggcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaf7D/9G6FgJXx/60zqnpnOr9Yx0hxjaI47x
 PFyCd3P05qyVMBYXfI99vrSKuVdMZXJ/fH5L83y+sOaTASyLTzg37igZycIDJzLI
 FnHh/m/+UA8k2aIC5VUiNAjne2RLaTZiRN15uEHFVjByC5Y+YTlCNUE4BBhg5RfQ
 hKmskeffWdtui3ou13CSNvbFn+pmqi4g6n1ysUuLhiwM2E5b1rZMprcCOnun/cGP
 IdUQsODNWTTv9eqPJez985M6A1x2SCGNv7Z73h58B9N0pBRPEC1xnhUnCJ1sA0cJ
 pnfde2C1lztEjYbwDngy0wgq0P6LINjQ5Ma2YY2F2hTMsXGJxGPDZm24/u5uR46x
 N/gsOQMXqw6f5yvbiS7Asx9WzR6ry8rJl70QRgTyozz7xxJTaiNm2HqVFe2wc+et
 Q/BzaKdhmUJj1GMZmqD2rrgwYeDcb4wWYNtwjM4PVHHxYlJVq0mEF1kLLS8YDyjf
 HuGPVqtSkt3E0+Br3FKcv5ltUQP8clXbudc6L1u98YBfNK12hW8L+c3YSvIiFoYM
 ZOAeANPM7VtQbP2Jg2q81Dd3CShImt5jqL2um+l8g7+mUE7l9gyuO/w/a5dQ57+b
 kx7mHHIW2zCeHrkZZbRUYzI2BJfMCCOVN4Ax5OZxTLnLsL9VEehy8NM8QYT4TS8R
 XmTOYW3U9XR3gw==
 =JqxC
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt subsystem updates from Thomas Gleixner:
 "Tree wide:

   - Make nr_irqs static to the core code and provide accessor functions
     to remove existing and prevent future aliasing problems with local
     variables or function arguments of the same name.

  Core code:

   - Prevent freeing an interrupt in the devres code which is not
     managed by devres in the first place.

   - Use seq_put_decimal_ull_width() for decimal values output in
     /proc/interrupts which increases performance significantly as it
     avoids parsing the format strings over and over.

   - Optimize raising the timer and hrtimer soft interrupts by using the
     'set bit only' variants instead of the combined version which
     checks whether ksoftirqd should be woken up. The latter is a
     pointless exercise as both soft interrupts are raised in the
     context of the timer interrupt and therefore never wake up
     ksoftirqd.

   - Delegate timer/hrtimer soft interrupt processing to a dedicated
     thread on RT.

     Timer and hrtimer soft interrupts are always processed in ksoftirqd
     on RT enabled kernels. This can lead to high latencies when other
     soft interrupts are delegated to ksoftirqd as well.

     The separate thread allows to run them seperately under a RT
     scheduling policy to reduce the latency overhead.

  Drivers:

   - New drivers or extensions of existing drivers to support Renesas
     RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt
     chips

   - Support for multi-cluster GICs on MIPS.

     MIPS CPUs can come with multiple CPU clusters, where each CPU
     cluster has its own GIC (Generic Interrupt Controller). This
     requires to access the GIC of a remote cluster through a redirect
     register block.

     This is encapsulated into a set of helper functions to keep the
     complexity out of the actual code paths which handle the GIC
     details.

   - Support for encrypted guests in the ARM GICV3 ITS driver

     The ITS page needs to be shared with the hypervisor and therefore
     must be decrypted.

   - Small cleanups and fixes all over the place"

* tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  irqchip/riscv-aplic: Prevent crash when MSI domain is missing
  genirq/proc: Use seq_put_decimal_ull_width() for decimal values
  softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  timers: Use __raise_softirq_irqoff() to raise the softirq.
  hrtimer: Use __raise_softirq_irqoff() to raise the softirq
  riscv: defconfig: Enable T-HEAD C900 ACLINT SSWI drivers
  irqchip: Add T-HEAD C900 ACLINT SSWI driver
  dt-bindings: interrupt-controller: Add T-HEAD C900 ACLINT SSWI device
  irqchip/stm32mp-exti: Use of_property_present() for non-boolean properties
  irqchip/mips-gic: Fix selection of GENERIC_IRQ_EFFECTIVE_AFF_MASK
  irqchip/mips-gic: Prevent indirect access to clusters without CPU cores
  irqchip/mips-gic: Multi-cluster support
  irqchip/mips-gic: Setup defaults in each cluster
  irqchip/mips-gic: Support multi-cluster in for_each_online_cpu_gic()
  irqchip/mips-gic: Replace open coded online CPU iterations
  genirq/irqdesc: Use str_enabled_disabled() helper in wakeup_show()
  genirq/devres: Don't free interrupt which is not managed by devres
  irqchip/gic-v3-its: Fix over allocation in itt_alloc_pool()
  irqchip/aspeed-intc: Add AST27XX INTC support
  dt-bindings: interrupt-controller: Add support for ASPEED AST27XX INTC
  ...
2024-11-19 15:54:19 -08:00
Linus Torvalds
364eeb79a2 Locking changes for v6.13 are:
- lockdep:
     - Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING (Sebastian Andrzej Siewior)
     - Add lockdep_cleanup_dead_cpu() (David Woodhouse)
 
  - futexes:
     - Use atomic64_inc_return() in get_inode_sequence_number() (Uros Bizjak)
     - Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number() (Uros Bizjak)
 
  - RT locking:
     - Add sparse annotation PREEMPT_RT's locking (Sebastian Andrzej Siewior)
 
  - spinlocks:
     - Use atomic_try_cmpxchg_release() in osq_unlock() (Uros Bizjak)
 
  - atomics:
     - x86: Use ALT_OUTPUT_SP() for __alternative_atomic64() (Uros Bizjak)
     - x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu() (Uros Bizjak)
 
  - KCSAN, seqlocks:
     - Support seqcount_latch_t (Marco Elver)
 
  - <linux/cleanup.h>:
     - Add if_not_cond_guard() conditional guard helper (David Lechner)
     - Adjust scoped_guard() macros to avoid potential warning (Przemek Kitszel)
     - Remove address space of returned pointer (Uros Bizjak)
 
  - WW mutexes:
     - locking/ww_mutex: Adjust to lockdep nest_lock requirements (Thomas Hellström)
 
  - Rust integration:
     - Fix raw_spin_lock initialization on PREEMPT_RT (Eder Zulian)
 
  - miscellaneous cleanups & fixes:
     - lockdep: Fix wait-type check related warnings (Ahmed Ehab)
     - lockdep: Use info level for initial info messages (Jiri Slaby)
     - spinlocks: Make __raw_* lock ops static (Geert Uytterhoeven)
     - pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase (Qiuxu Zhuo)
     - iio: magnetometer: Fix if () scoped_guard() formatting (Stephen Rothwell)
     - rtmutex: Fix misleading comment (Peter Zijlstra)
     - percpu-rw-semaphores: Fix grammar in percpu-rw-semaphore.rst (Xiu Jianfeng)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7AkQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hGqQ/+KWR5arkoJjH/Nf5IyezYitOwqK7YAdJk
 mrWoZcez0DRopNTf8yZMv1m8jyx7W9KUQumEO/ghqJRlBW+AbxZ1t99kmqWI5Aw0
 +zmhpyo06JHeMYQAfKJXX3iRt2Rt59BPHtGzoop6b0e2i55+uPE+DZTNm2+FwCV9
 4vxmfpYyg5/sJB9/v5b0N9TTDe9a8caOHXU5F+HA1yWuxMmqFuDFIcpKrgS/sUeP
 NelOLbh2L3UOPWP6tRRfpajxCQTmRoeZOQQv0L9dd3jYpyQOCesgKqOhqNTCU8KK
 qamTPig2N00smSLp6I/OVyJ96vFYZrbhyq0kwMayaafAU7mB8lzcfUj+8qP0c90k
 1PROtD1XpF3Nobp1F+YUp3sQxEGdCgs+9VeLWWObv2b/Vt3MDZijdEiC/3OkRAUh
 LPCfl/ky41BmT8AlaxRDjkyrN7hH4oUOkGUdVx6yR389J0OR9MSwEX9qNaMw8bBg
 1ALvv9+OR3QhTWyG30PGqUf3Um230oIdWuWxwFrhaoMmDVEVMRZQMtvQahi5hDYq
 zyX79DKWtExEe/f2hY1m/6eNm6st5HE7X7scOba3TamQzvOzJkjzo7XoS2yeUAjb
 eByO2G0PvTrA0TFls6Hyrl6db5OW5KjQnVWr6W3fiWL5YIdh0SQMkWeaGVvGyfy8
 Q3vhk7POaZo=
 =BvPn
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Lockdep:
   - Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING (Sebastian Andrzej
     Siewior)
   - Add lockdep_cleanup_dead_cpu() (David Woodhouse)

  futexes:
   - Use atomic64_inc_return() in get_inode_sequence_number() (Uros
     Bizjak)
   - Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number()
     (Uros Bizjak)

  RT locking:
   - Add sparse annotation PREEMPT_RT's locking (Sebastian Andrzej
     Siewior)

  spinlocks:
   - Use atomic_try_cmpxchg_release() in osq_unlock() (Uros Bizjak)

  atomics:
   - x86: Use ALT_OUTPUT_SP() for __alternative_atomic64() (Uros Bizjak)
   - x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu() (Uros
     Bizjak)

  KCSAN, seqlocks:
   - Support seqcount_latch_t (Marco Elver)

  <linux/cleanup.h>:
   - Add if_not_guard() conditional guard helper (David Lechner)
   - Adjust scoped_guard() macros to avoid potential warning (Przemek
     Kitszel)
   - Remove address space of returned pointer (Uros Bizjak)

  WW mutexes:
   - locking/ww_mutex: Adjust to lockdep nest_lock requirements (Thomas
     Hellström)

  Rust integration:
   - Fix raw_spin_lock initialization on PREEMPT_RT (Eder Zulian)

  Misc cleanups & fixes:
   - lockdep: Fix wait-type check related warnings (Ahmed Ehab)
   - lockdep: Use info level for initial info messages (Jiri Slaby)
   - spinlocks: Make __raw_* lock ops static (Geert Uytterhoeven)
   - pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase
     (Qiuxu Zhuo)
   - iio: magnetometer: Fix if () scoped_guard() formatting (Stephen
     Rothwell)
   - rtmutex: Fix misleading comment (Peter Zijlstra)
   - percpu-rw-semaphores: Fix grammar in percpu-rw-semaphore.rst (Xiu
     Jianfeng)"

* tag 'locking-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  locking/Documentation: Fix grammar in percpu-rw-semaphore.rst
  iio: magnetometer: fix if () scoped_guard() formatting
  rust: helpers: Avoid raw_spin_lock initialization for PREEMPT_RT
  kcsan, seqlock: Fix incorrect assumption in read_seqbegin()
  seqlock, treewide: Switch to non-raw seqcount_latch interface
  kcsan, seqlock: Support seqcount_latch_t
  time/sched_clock: Broaden sched_clock()'s instrumentation coverage
  time/sched_clock: Swap update_clock_read_data() latch writes
  locking/atomic/x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu()
  locking/atomic/x86: Use ALT_OUTPUT_SP() for __alternative_atomic64()
  cleanup: Add conditional guard helper
  cleanup: Adjust scoped_guard() macros to avoid potential warning
  locking/osq_lock: Use atomic_try_cmpxchg_release() in osq_unlock()
  cleanup: Remove address space of returned pointer
  locking/rtmutex: Fix misleading comment
  locking/rt: Annotate unlock followed by lock for sparse.
  locking/rt: Add sparse annotation for RCU.
  locking/rt: Remove one __cond_lock() in RT's spin_trylock_irqsave()
  locking/rt: Add sparse annotation PREEMPT_RT's sleeping locks.
  locking/pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase
  ...
2024-11-19 12:43:11 -08:00
Linus Torvalds
6ac81fd55e vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
 oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
 SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
 =nVlm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs multigrain timestamps from Christian Brauner:
 "This is another try at implementing multigrain timestamps. This time
  with significant help from the timekeeping maintainers to reduce the
  performance impact.

  Thomas provided a base branch that contains the required timekeeping
  interfaces for the VFS. It serves as the base for the multi-grain
  timestamp work:

   - Multigrain timestamps allow the kernel to use fine-grained
     timestamps when an inode's attributes is being actively observed
     via ->getattr(). With this support, it's possible for a file to get
     a fine-grained timestamp, and another modified after it to get a
     coarse-grained stamp that is earlier than the fine-grained time. If
     this happens then the files can appear to have been modified in
     reverse order, which breaks VFS ordering guarantees.

     To prevent this, a floor value is maintained for multigrain
     timestamps. Whenever a fine-grained timestamp is handed out, record
     it, and when later coarse-grained stamps are handed out, ensure
     they are not earlier than that value. If the coarse-grained
     timestamp is earlier than the fine-grained floor, return the floor
     value instead.

     The timekeeper changes add a static singleton atomic64_t into
     timekeeper.c that is used to keep track of the latest fine-grained
     time ever handed out. This is tracked as a monotonic ktime_t value
     to ensure that it isn't affected by clock jumps. Because it is
     updated at different times than the rest of the timekeeper object,
     the floor value is managed independently of the timekeeper via a
     cmpxchg() operation, and sits on its own cacheline.

     Two new public timekeeper interfaces are added:

      (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
          later of the coarse-grained clock and the floor time

      (2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
          and tries to swap it into the floor. A timespec64 is filled
          with the result.

   - The VFS has always used coarse-grained timestamps when updating the
     ctime and mtime after a change. This has the benefit of allowing
     filesystems to optimize away a lot metadata updates, down to around
     1 per jiffy, even when a file is under heavy writes.

     Unfortunately, this has always been an issue when we're exporting
     via NFSv3, which relies on timestamps to validate caches. A lot of
     changes can happen in a jiffy, so timestamps aren't sufficient to
     help the client decide when to invalidate the cache. Even with
     NFSv4, a lot of exported filesystems don't properly support a
     change attribute and are subject to the same problems with
     timestamp granularity. Other applications have similar issues with
     timestamps (e.g backup applications).

     If we were to always use fine-grained timestamps, that would
     improve the situation, but that becomes rather expensive, as the
     underlying filesystem would have to log a lot more metadata
     updates.

     This adds a way to only use fine-grained timestamps when they are
     being actively queried. Use the (unused) top bit in
     inode->i_ctime_nsec as a flag that indicates whether the current
     timestamps have been queried via stat() or the like. When it's set,
     we allow the kernel to use a fine-grained timestamp iff it's
     necessary to make the ctime show a different value.

     This solves the problem of being able to distinguish the timestamp
     between updates, but introduces a new problem: it's now possible
     for a file being changed to get a fine-grained timestamp. A file
     that is altered just a bit later can then get a coarse-grained one
     that appears older than the earlier fine-grained time. This
     violates timestamp ordering guarantees.

     This is where the earlier mentioned timkeeping interfaces help. A
     global monotonic atomic64_t value is kept that acts as a timestamp
     floor. When we go to stamp a file, we first get the latter of the
     current floor value and the current coarse-grained time. If the
     inode ctime hasn't been queried then we just attempt to stamp it
     with that value.

     If it has been queried, then first see whether the current coarse
     time is later than the existing ctime. If it is, then we accept
     that value. If it isn't, then we get a fine-grained time and try to
     swap that into the global floor. Whether that succeeds or fails, we
     take the resulting floor time, convert it to realtime and try to
     swap that into the ctime.

     We take the result of the ctime swap whether it succeeds or fails,
     since either is just as valid.

     Filesystems can opt into this by setting the FS_MGTIME fstype flag.
     Others should be unaffected (other than being subject to the same
     floor value as multigrain filesystems)"

* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reduce pointer chasing in is_mgtime() test
  tmpfs: add support for multigrain timestamps
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps
  fs: add percpu counters for significant multigrain timestamp events
  fs: tracepoints around multigrain timestamp events
  fs: handle delegated timestamps in setattr_copy_mgtime
  timekeeping: Add percpu counter for tracking floor swap events
  timekeeping: Add interfaces for handling timestamps with a floor value
  fs: have setattr_copy handle multigrain timestamps appropriately
  fs: add infrastructure for multigrain timestamps
2024-11-18 09:15:39 -08:00
Nam Cao
3c2fb01521 hrtimers: Delete hrtimer_init_on_stack()
hrtimer_init_on_stack() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/510ce0d2944c4a382ea51e51d03dcfb73ba0f4f7.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:07 +01:00
Nam Cao
d82fadc727 alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
hrtimer_setup() and hrtimer_setup_on_stack() take the callback function
pointer as argument and initialize the timer completely.

Replace the hrtimer_init*() variants and the open coded initialization of
hrtimer::function with the new setup mechanism.

Switch to use the new functions.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/2bae912336103405adcdab96b88d3ea0353b4228.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:07 +01:00
Nam Cao
f3bef7aaa6 hrtimers: Delete hrtimer_init_sleeper_on_stack()
hrtimer_init_sleeper_on_stack() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/52549846635c0b3a2abf82101f539efdabcd9778.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
8fae141107 timers: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack()
to keep the naming convention consistent.

Convert the usage sites over to it. The conversion was done with
Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/299c07f0f96af8ab3a7631b47b6ca22b06b20577.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
c9bd83abfe hrtimers: Introduce hrtimer_setup_sleeper_on_stack()
The hrtimer_init*() API is replaced by hrtimer_setup*() variants to
initialize the timer including the callback function at once.

hrtimer_init_sleeper_on_stack() does not need user to setup the callback
function separately, so a new variant would not be strictly necessary.

Nonetheless, to keep the naming convention consistent, introduce
hrtimer_setup_sleeper_on_stack(). hrtimer_init_on_stack() will be removed
once all users are converted.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/7b5e18e6dd0ace9eaa211201528cb9dc23752454.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
444cb7db4c hrtimers: Introduce hrtimer_setup_on_stack()
To initialize hrtimer on stack, hrtimer_init_on_stack() needs to be called
and also hrtimer::function must be set. This is error-prone and awkward to
use.

Introduce hrtimer_setup_on_stack() which does both of these things, so that
users of hrtimer can be simplified.

The new setup function also has a sanity check for the provided function
pointer. If NULL, a warning is emitted and a dummy callback installed.

hrtimer_init_on_stack() will be removed as soon as all of its users have
been converted to the new function.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/4b05e2ab3a82c517adf67fabc0f0cd8fe118b97c.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
908a1d7754 hrtimers: Introduce hrtimer_setup() to replace hrtimer_init()
To initialize hrtimer, hrtimer_init() needs to be called and also
hrtimer::function must be set. This is error-prone and awkward to use.

Introduce hrtimer_setup() which does both of these things, so that users of
hrtimer can be simplified.

The new setup function also has a sanity check for the provided function
pointer. If NULL, a warning is emitted and a dummy callback installed.

hrtimer_init() will be removed as soon as all of its users have been
converted to the new function.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/5057c1ddbfd4b92033cd93d37fe38e6b069d5ba6.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
fbf920f255 hrtimers: Add missing hrtimer_init() trace points
hrtimer_init*_on_stack() is not covered by tracing when
CONFIG_DEBUG_OBJECTS_TIMERS=y.

Rework the functions similar to hrtimer_init() and hrtimer_init_sleeper()
so that the hrtimer_init() tracepoint is unconditionally available.

The rework makes hrtimer_init_sleeper() unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/74528e8abf2bb96e8bee85ffacbf14e15cf89f0d.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:04 +01:00
Sebastian Andrzej Siewior
49a1763950 softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
The timer and hrtimer soft interrupts are raised in hard interrupt
context. With threaded interrupts force enabled or on PREEMPT_RT this leads
to waking the ksoftirqd for the processing of the soft interrupt.

ksoftirqd runs as SCHED_OTHER task which means it will compete with other
tasks for CPU resources.  This can introduce long delays for timer
processing on heavy loaded systems and is not desired.

Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
timers thread and let it run at the lowest SCHED_FIFO priority.
Wake-ups for RT tasks happen from hardirq context so only timer_list timers
and hrtimers for "regular" tasks are processed here. The higher priority
ensures that wakeups are performed before scheduling SCHED_OTHER tasks.

Using a dedicated variable to store the pending softirq bits values ensure
that the timer are not accidentally picked up by ksoftirqd and other
threaded interrupts.

It shouldn't be picked up by ksoftirqd since it runs at lower priority.
However if ksoftirqd is already running while a timer fires, then ksoftird
will be PI-boosted due to the BH-lock to ktimer's priority.

The timer thread can pick up pending softirqs from ksoftirqd but only
if the softirq load is high. It is not be desired that the picked up
softirqs are processed at SCHED_FIFO priority under high softirq load
but this can already happen by a PI-boost by a force-threaded interrupt.

[ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
  local_timers_pending() for tick_nohz_next_event() ]

[ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
  softirq is currently served. ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-4-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Sebastian Andrzej Siewior
a02976cfce timers: Use __raise_softirq_irqoff() to raise the softirq.
Raising the timer soft interrupt is always done from hard interrupt
context, so it can be reduced to just setting the TIMER soft interrupt
flag. The soft interrupt will be invoked on return from interrupt.

Use therefore __raise_softirq_irqoff() to raise the TIMER soft interrupt,
which is a trivial optimization.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-3-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Sebastian Andrzej Siewior
7a7f5065bc hrtimer: Use __raise_softirq_irqoff() to raise the softirq
Raising the hrtimer soft interrupt is always done from hard interrupt
context, so it can be reduced to just setting the HRTIMER soft interrupt
flag. The soft interrupt will be invoked on return from interrupt.

Use therefore __raise_softirq_irqoff() to raise the HRTIMER soft interrupt,
which is a trivial optimization.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-2-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Thomas Gleixner
2634303f87 alarmtimers: Remove return value from alarm functions
Now that the SIG_IGN problem is solved in the core code, the alarmtimer
callbacks do not require a return value anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064214.318837272@linutronix.de
2024-11-07 02:14:46 +01:00
Thomas Gleixner
6b0aa14578 alarmtimers: Remove the throttle mechanism from alarm_forward_now()
Now that ignored posix timer signals are requeued and the timers are
rearmed on signal delivery the workaround to keep such timers alive and
self rearm them is not longer required.

Remove the unused alarm timer parts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.252443020@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
7a66f72b09 posix-timers: Cleanup SIG_IGN workaround leftovers
Now that ignored posix timer signals are requeued and the timers are
rearmed on signal delivery the workaround to keep such timers alive and
self rearm them is not longer required.

Remove the relevant hacks and the not longer required return values from
the related functions. The alarm timer workarounds will be cleaned up in a
separate step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.187239060@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
df7a996b4d signal: Queue ignored posixtimers on ignore list
Queue posixtimers which have their signal ignored on the ignored list:

   1) When the timer fires and the signal has SIG_IGN set

   2) When SIG_IGN is installed via sigaction() and a timer signal
      is already queued

This only happens when the signal is for a valid timer, which delivered the
signal in periodic mode. One-shot timer signals are correctly dropped.

Due to the lock order constraints (sighand::siglock nests inside
timer::lock) the signal code cannot access any of the timer fields which
are relevant to make this decision, e.g. timer::it_status.

This is addressed by establishing a protection scheme which requires to
lock both locks on the timer side for modifying decision fields in the
timer struct and therefore makes it possible for the signal delivery to
evaluate with only sighand:siglock being held:

  1) Move the NULLification of timer->it_signal into the sighand::siglock
     protected section of timer_delete() and check timer::it_signal in the
     code path which determines whether the signal is dropped or queued on
     the ignore list.

     This ensures that a deleted timer cannot be moved onto the ignore
     list, which would prevent it from being freed on exit() as it is not
     longer in the process' posix timer list.

     If the timer got moved to the ignored list before deletion then it is
     removed from the ignored list under sighand lock in timer_delete().

  2) Provide a new timer::it_sig_periodic flag, which gets set in the
     signal queue path with both timer and sighand locks held if the timer
     is actually in periodic mode at expiry time.

     The ignore list code checks this flag under sighand::siglock and drops
     the signal when it is not set.

     If it is set, then the signal is moved to the ignored list independent
     of the actual state of the timer.

     When the signal is un-ignored later then the signal is moved back to
     the signal queue. On signal delivery the posix timer side decides
     about dropping the signal if the timer was re-armed, dis-armed or
     deleted based on the signal sequence counter check.

     If the thread/process exits then not yet delivered signals are
     discarded which means the reference of the timer containing the
     sigqueue is dropped and frees the timer.

     This is way cheaper than requiring all code paths to lock
     sighand::siglock of the target thread/process on any modification of
     timer::it_status or going all the way and removing pending signals
     from the signal queues on every rearm, disarm or delete operation.

So the protection scheme here is that on the timer side both timer::lock
and sighand::siglock have to be held for modifying

   timer::it_signal
   timer::it_sig_periodic

which means that on the signal side holding sighand::siglock is enough to
evaluate these fields.
                                                                                                                                                                                                                                                                                                                             
In posixtimer_deliver_signal() holding timer::lock is sufficient to do the
sequence validation against timer::it_signal_seq because a concurrent
expiry is waiting on timer::lock to be released.

This completes the SIG_IGN handling and such timers are not longer self
rearmed which avoids pointless wakeups.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.120756416@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
0e20cd33ac posix-timers: Handle ignored list on delete and exit
To handle posix timer signals on sigaction(SIG_IGN) properly, the timers
will be queued on a separate ignored list.

Add the necessary cleanup code for timer_delete() and exit_itimers().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.987530588@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
647da5f709 posix-timers: Move sequence logic into struct k_itimer
The posix timer signal handling uses siginfo::si_sys_private for handling
the sequence counter check. That indirection is not longer required and the
sequence count value at signal queueing time can be stored in struct
k_itimer itself.

This removes the requirement of treating siginfo::si_sys_private special as
it's now always zero as the kernel does not touch it anymore.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/all/20241105064213.852619866@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
6017a158be posix-timers: Embed sigqueue in struct k_itimer
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

Now that the prerequisites are in place, embed the sigqueue into struct
k_itimer and fixup the relevant usage sites.

Aside of preparing for proper SIG_IGN handling, this spares an extra
allocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.719695194@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
11629b9808 signal: Replace resched_timer logic
In preparation for handling ignored posix timer signals correctly and
embedding the sigqueue struct into struct k_itimer, hand down a pointer to
the sigqueue struct into posix_timer_deliver_signal() instead of just
having a boolean flag.

No functional change.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/all/20241105064213.652658158@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
0360ed14d9 signal: Refactor send_sigqueue()
To handle posix timers which have their signal ignored via SIG_IGN properly
it is required to requeue a ignored signal for delivery when SIG_IGN is
lifted so the timer gets rearmed.

Split the required code out of send_sigqueue() so it can be reused in
context of sigaction().

While at it rename send_sigqueue() to posixtimer_send_sigqueue() so its
clear what this is about.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.586453412@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
ef1c5bcd6d posix-timers: Store PID type in the timer
instead of re-evaluating the signal delivery mode everywhere.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.519086500@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
5d916a0988 posix-timers: Add a refcount to struct k_itimer
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

To make that work correctly it needs reference counting so that timer
deletion does not free the timer prematuraly when there is a signal queued
or delivered concurrently.

Add a rcuref to the posix timer part.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.304756440@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
4cf7bf2a2f posix-cpu-timers: Use dedicated flag for CPU timer nanosleep
POSIX CPU timer nanosleep creates a k_itimer on stack and uses the sigq
pointer to detect the nanosleep case in the expiry function.

Prepare for embedding sigqueue into struct k_itimer by using a dedicated
flag for nanosleep.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.238550394@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
bf635681c9 posix-cpu-timers: Cleanup the firing logic
The firing flag of a posix CPU timer is tristate:

  0: when the timer is not about to deliver a signal

  1: when the timer has expired, but the signal has not been delivered yet

 -1: when the timer was queued for signal delivery and a rearm operation
     raced against it and supressed the signal delivery.

This is a pointless exercise as this can be simply expressed with a
boolean. Only if set, the signal is delivered. This makes delete and rearm
consistent with the rest of the posix timers.

Convert firing to bool and fixup the usage sites accordingly and add
comments why the timer cannot be dequeued right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064213.172848618@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
b06b0345ff posix-timers: Make signal overrun accounting sensible
The handling of the timer overrun in the signal code is inconsistent as it
takes previous overruns into account. This is just wrong as after the
reprogramming of a timer the overrun count starts over from a clean state,
i.e. 0.

Don't touch info::si_overrun in send_sigqueue() and only store the overrun
value at signal delivery time, which is computed from the timer itself
relative to the expiry time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.106738193@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
513793bc6a posix-timers: Make signal delivery consistent
Signals of timers which are reprogammed, disarmed or deleted can deliver
signals related to the past. The POSIX spec is blury about this:

 - "The effect of disarming or resetting a timer with pending expiration
    notifications is unspecified."

 - "The disposition of pending signals for the deleted timer is
    unspecified."

In both cases it is reasonable to expect that pending signals are
discarded. Especially in the reprogramming case it does not make sense to
account for previous overruns or to deliver a signal for a timer which has
been disarmed. This makes the behaviour consistent and understandable.

Remove the si_sys_private check from the signal delivery code and invoke
posix_timer_deliver_signal() unconditionally for posix timer related
signals.

Change posix_timer_deliver_signal() so it controls the actual signal
delivery via the return value. It now instructs the signal code to drop the
signal when:

  1) The timer does not longer exist in the hash table

  2) The timer signal_seq value is not the same as the si_sys_private value
     which was set when the signal was queued.

This is also a preparatory change to embed the sigqueue into the k_itimer
structure, which in turn allows to remove the si_sys_private magic.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064213.040348644@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
15cbfb92ef posix-cpu-timers: Correctly update timer status in posix_cpu_timer_del()
If posix_cpu_timer_del() exits early due to task not found or sighand
invalid, it fails to clear the state of the timer. That's harmless but
inconsistent.

These early exits are accounted as successful delete. Move the update of
the timer state into the success return path, so all "successful" deletions
are handled.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064212.974053438@linutronix.de
2024-11-07 02:14:43 +01:00
Marco Elver
93190bc35d seqlock, treewide: Switch to non-raw seqcount_latch interface
Switch all instrumentable users of the seqcount_latch interface over to
the non-raw interface.

Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-5-elver@google.com
2024-11-05 12:55:35 +01:00
Marco Elver
8ab40fc2b9 time/sched_clock: Broaden sched_clock()'s instrumentation coverage
Most of sched_clock()'s implementation is ineligible for instrumentation
due to relying on sched_clock_noinstr().

Split the implementation off into an __always_inline function
__sched_clock(), which is then used by the noinstr and instrumentable
version, to allow more of sched_clock() to be covered by various
instrumentation.

This will allow instrumentation with the various sanitizers (KASAN,
KCSAN, KMSAN, UBSAN). For KCSAN, we know that raw seqcount_latch usage
without annotations will result in false positive reports: tell it that
all of __sched_clock() is "atomic" for the latch reader; later changes
in this series will take care of the writers.

Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-3-elver@google.com
2024-11-05 12:55:35 +01:00
Marco Elver
1139c71df5 time/sched_clock: Swap update_clock_read_data() latch writes
Swap the writes to the odd and even copies to make the writer critical
section look like all other seqcount_latch writers.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-2-elver@google.com
2024-11-05 12:55:34 +01:00
Thomas Gleixner
c163e40af9 timekeeping: Always check for negative motion
clocksource_delta() has two variants. One with a check for negative motion,
which is only selected by x86. This is a historic leftover as this function
was previously used in the time getter hot paths.

Since 135225a363 timekeeping_cycles_to_ns() has unconditional protection
against this as a by-product of the protection against 64bit math overflow.

clocksource_delta() is only used in the clocksource watchdog and in
timekeeping_advance(). The extra conditional there is not hurting anyone.

Remove the config option and unconditionally prevent negative motion of the
readout.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241031120328.599430157@linutronix.de
2024-11-02 10:14:31 +01:00
Thomas Gleixner
d44d26987b timekeeping: Remove CONFIG_DEBUG_TIMEKEEPING
Since 135225a363 timekeeping_cycles_to_ns() handles large offsets which
would lead to 64bit multiplication overflows correctly. It's also protected
against negative motion of the clocksource unconditionally, which was
exclusive to x86 before.

timekeeping_advance() handles large offsets already correctly.

That means the value of CONFIG_DEBUG_TIMEKEEPING which analyzed these cases
is very close to zero. Remove all of it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241031120328.536010148@linutronix.de
2024-11-02 10:14:31 +01:00
Thomas Gleixner
1d4199cbbe timers: Add missing READ_ONCE() in __run_timer_base()
__run_timer_base() checks base::next_expiry without holding
base::lock. That can race with a remote CPU updating next_expiry under the
lock. This is an intentional and harmless data race, but lacks a
READ_ONCE(), so KCSAN complains about this.

Add the missing READ_ONCE(). All other places are covered already.

Fixes: 79f8b28e85 ("timers: Annotate possible non critical data race of next_expiry")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/87a5emyqk0.ffs@tglx
Closes: https://lore.kernel.org/oe-lkp/202410301205.ef8e9743-lkp@intel.com
2024-10-31 11:45:01 +01:00
Frederic Weisbecker
a6347864d9 tick: Remove now unneeded low-res tick stop on CPUHP_AP_TICK_DYING
The generic clockevent layer now detaches and stops the underlying
clockevent from the dying CPU, unifying the tick behaviour for both
periodic and oneshot mode on offline CPUs. There is no more need for
the tick layer to care about that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241029125451.54574-4-frederic@kernel.org
2024-10-31 10:41:42 +01:00
Frederic Weisbecker
3b1596a21f clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING
The way the clockevent devices are finally stopped while a CPU is
offlining is currently chaotic. The layout being by order:

1) tick_sched_timer_dying() stops the tick and the underlying clockevent
  but only for oneshot case. The periodic tick and its related
  clockevent still runs.

2) tick_broadcast_offline() detaches and stops the per-cpu oneshot
  broadcast and append it to the released list.

3) Some individual clockevent drivers stop the clockevents (a second time if
  the tick is oneshot)

4) Once the CPU is dead, a control CPU remotely detaches and stops
  (a 3rd time if oneshot mode) the CPU clockevent and adds it to the
  released list.

5) The released list containing the broadcast device released on step 2)
   and the remotely detached clockevent from step 4) are unregistered.

These random events can be factorized if the current clockevent is
detached and stopped by the dying CPU at the generic layer, that is
from the dying CPU:

a) Stop the tick
b) Stop/detach the underlying per-cpu oneshot broadcast clockevent
c) Stop/detach the underlying clockevent
d) Release / unregister the clockevents from b) and c)
e) Release / unregister the remaining clockevents from the dying CPU.
   This part could be performed by the dying CPU

This way the drivers and the tick layer don't need to care about
clockevent operations during cpuhotplug down. This also unifies the tick
behaviour on offline CPUs between oneshot and periodic modes, avoiding
offline ticks altogether for sanity.

Adopt the simplification.

[ tglx: Remove the WARN_ON() in clockevents_register_device() as that
  	is called from an upcoming CPU before the CPU is marked online ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241029125451.54574-3-frederic@kernel.org
2024-10-31 10:41:42 +01:00
Frederic Weisbecker
17a8945f36 clockevents: Improve clockevents_notify_released() comment
When a new clockevent device is added and replaces a previous device,
the latter is put into the released list. Then the released list is
added back.

This may look counter-intuitive but the reason is that released device
might be suitable for other uses. For example a released CPU regular
clockevent can be a better replacement for the current broadcast event.
Similarly a released broadcast clockevent can be a better replacement
for the current regular clockevent of a given CPU.

Improve comments stating about these subtleties.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241029125451.54574-2-frederic@kernel.org
2024-10-31 10:41:42 +01:00
Thomas Gleixner
1550dde8a5 posix-timers: Add proper state tracking
Right now the state tracking is done by two struct members:

 - it_active:
     A boolean which tracks armed/disarmed state

 - it_signal_seq:
     A sequence counter which is used to invalidate settings
     and prevent rearming

Replace it_active with it_status and keep properly track about the states
in one place.

This allows to reuse it_signal_seq to track reprogramming, disarm and
delete operations in order to drop signals which are related to the state
previous of those operations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.670337048@linutronix.de
2024-10-29 11:43:19 +01:00
Thomas Gleixner
cd1e93aeda posix-timers: Rename k_itimer:: It_requeue_pending
Prepare for using this struct member to do a proper reprogramming and
deletion accounting so that stale signals can be dropped.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.611997737@linutronix.de
2024-10-29 11:43:19 +01:00
Thomas Gleixner
2860d4d315 posix-timers: Drop signal if timer has been deleted or reprogrammed
No point in delivering a signal from the past. POSIX does not specify the
behaviour here:

 - "The effect of disarming or resetting a timer with pending expiration
    notifications is unspecified."

 - "The disposition of pending signals for the deleted timer is unspecified."

In both cases it is reasonable to expect that pending signals are
discarded. Especially in the reprogramming case it does not make sense to
account for previous overruns or to deliver a signal for a timer which has
been disarmed.

Drop the signal as that is conistent and understandable behaviour.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.553646280@linutronix.de
2024-10-29 11:43:19 +01:00
Thomas Gleixner
c775ea28d4 signal: Allow POSIX timer signals to be dropped
In case that a timer was reprogrammed or deleted an already pending signal
is obsolete. Right now such signals are kept around and eventually
delivered. While POSIX is blury about this:

 - "The effect of disarming or resetting a timer with pending expiration
    notifications is unspecified."

 - "The disposition of pending signals for the deleted timer is
    unspecified."

it is reasonable in both cases to expect that pending signals are discarded
as they have no meaning anymore.

Prepare the signal code to allow dropping posix timer signals.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.494416923@linutronix.de
2024-10-29 11:43:19 +01:00
Thomas Gleixner
4febce44cf posix-timers: Cure si_sys_private race
The si_sys_private member of the siginfo which is embedded in the
preallocated sigqueue is used by the posix timer code to decide whether a
timer must be reprogrammed on signal delivery.

The handling of this is racy as a long standing comment in that code
documents. It is modified with the timer lock held, but without sighand
lock being held. The actual signal delivery code checks for it under
sighand lock without holding the timer lock.

Hand the new value to send_sigqueue() as argument and store it with sighand
lock held. This is an intermediate change to address this issue.

The arguments to this function will be cleanup in subsequent changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.434338954@linutronix.de
2024-10-29 11:43:18 +01:00
Thomas Gleixner
68f99be287 signal: Confine POSIX_TIMERS properly
Move the itimer rearming out of the signal code and consolidate all posix
timer related functions in the signal code under one ifdef.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241001083835.314100569@linutronix.de
2024-10-29 11:43:18 +01:00
Miguel Ojeda
92b043fd99 time: Fix references to _msecs_to_jiffies() handling of values
The details about the handling of the "normal" values were moved
to the _msecs_to_jiffies() helpers in commit ca42aaf0c8 ("time:
Refactor msecs_to_jiffies"). However, the same commit still mentioned
__msecs_to_jiffies() in the added documentation.

Thus point to _msecs_to_jiffies() instead.

Fixes: ca42aaf0c8 ("time: Refactor msecs_to_jiffies")
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241025110141.157205-2-ojeda@kernel.org
2024-10-25 19:50:10 +02:00
Miguel Ojeda
b05aefc1f5 time: Partially revert cleanup on msecs_to_jiffies() documentation
The documentation's intention is to compare msecs_to_jiffies() (first
sentence) with __msecs_to_jiffies() (second sentence), which is what the
original documentation did. One of the cleanups in commit f3cb80804b
("time: Fix various kernel-doc problems") may have thought the paragraph
was talking about the latter since that is what it is being documented.

Thus revert that part of the change.

Fixes: f3cb80804b ("time: Fix various kernel-doc problems")
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241025110141.157205-1-ojeda@kernel.org
2024-10-25 19:49:16 +02:00
Anna-Maria Behnsen
147ba94302 timekeeping: Merge timekeeping_update_staged() and timekeeping_update()
timekeeping_update_staged() is the only call site of timekeeping_update().

Merge those functions. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-25-554456a44a15@linutronix.de
2024-10-25 19:49:16 +02:00
Anna-Maria Behnsen
0026766dfd timekeeping: Remove TK_MIRROR timekeeping_update() action
All call sites of using TK_MIRROR flag in timekeeping_update() are
gone. The TK_MIRROR dependent code path is therefore dead code.

Remove it along with the TK_MIRROR define.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-24-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
ae455cb7b8 timekeeping: Rework do_adjtimex() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

Convert do_adjtimex() to use this scheme and take the opportunity to use a
scoped_guard() for locking.

That requires to have a separate function for updating the leap state so
that the update is protected by the sequence count. This also brings the
timekeeper and the shadow timekeeper in sync for this state, which was not
the case so far. That's not a correctness problem as the state is only used
at the read sides which use the real timekeeper, but it's inconsistent
nevertheless.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-23-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
d05eae8776 timekeeping: Rework timekeeping_suspend() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

While the sequence count held time is not relevant for the resume path as
there is no concurrency, there is no reason to have this function
different than all the other update sites.

Convert timekeeping_inject_offset() to use this scheme and cleanup the
variable declarations while at it.

As halt_fast_timekeeper() does not need protection sequence counter, it is
no problem to move it with this change outside of the sequence counter
protected area. But it still needs to be executed while holding the lock.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-22-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
b2350d954d timekeeping: Rework timekeeping_resume() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

While the sequence count held time is not relevant for the resume path as
there is no concurrency, there is no reason to have this function
different than all the other update sites.

Convert timekeeping_inject_offset() to use this scheme and cleanup the
variable declaration while at it.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-21-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
2b473e65de timekeeping: Rework timekeeping_inject_sleeptime64() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

Convert timekeeping_inject_sleeptime64() to use this scheme.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-20-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
2cab490b41 timekeeping: Rework timekeeping_init() to use shadow_timekeeper
For timekeeping_init() the sequence count write held time is not relevant
and it could keep working on the real timekeeper, but there is no reason to
make it different from other timekeeper updates.

Convert it to operate on the shadow timekeeper.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-19-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
351619fc99 timekeeping: Rework change_clocksource() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

Convert change_clocksource() to use this scheme.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-18-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
82214756d3 timekeeping: Rework timekeeping_inject_offset() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

Convert timekeeping_inject_offset() to use this scheme.

That allows to use a scoped_guard() for locking the timekeeper lock as the
usage of the shadow timekeeper allows a rollback in the error case instead
of the full timekeeper update of the original code.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-17-554456a44a15@linutronix.de
2024-10-25 19:49:15 +02:00
Anna-Maria Behnsen
bba9898ef3 timekeeping: Rework do_settimeofday64() to use shadow_timekeeper
Updates of the timekeeper can be done by operating on the shadow timekeeper
and afterwards copying the result into the real timekeeper. This has the
advantage, that the sequence count write protected region is kept as small
as possible.

Convert do_settimeofday64() to use this scheme.

That allows to use a scoped_guard() for locking the timekeeper lock as the
usage of the shadow timekeeper allows a rollback in the error case instead
of the full timekeeper update of the original code.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-16-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Thomas Gleixner
97e5379253 timekeeping: Provide timekeeping_restore_shadow()
Functions which operate on the real timekeeper, e.g. do_settimeofday(),
have error conditions. If they are hit a full timekeeping update is still
required because the already committed operations modified the timekeeper.

When switching these functions to operate on the shadow timekeeper then the
full update can be avoided in the error case, but the modified shadow
timekeeper has to be restored.

Provide a helper function for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-15-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
6b1ef640f4 timekeeping: Introduce combined timekeeping action flag
Instead of explicitly listing all the separate timekeeping actions flags,
introduce a new one which covers all actions except TK_MIRROR action.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-14-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
5aa6c43eca timekeeping: Split out timekeeper update of timekeeping_advanced()
timekeeping_advance() is the only optimized function which uses
shadow_timekeeper for updating the real timekeeper to keep the sequence
counter protected region as small as possible.

To be able to transform timekeeper updates in other functions to use the
same logic, split out functionality into a separate function
timekeeper_update_staged().

While at it, document the reason why the sequence counter must be write
held over the call to timekeeping_update() and the copying to the real
timekeeper and why using a pointer based update is suboptimal.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-13-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
1d72d7b5fd timekeeping: Add struct tk_data as argument to timekeeping_update()
Updates of the timekeeper are done in two ways:

 1. Updating timekeeper and afterwards memcpy()'ing the result into
    shadow_timekeeper using timekeeping_update(). Used everywhere for
    updates except in timekeeping_advance(); the sequence counter protected
    region starts before the first change to the timekeeper is done.

 2. Updating shadow_timekeeper and then memcpy()'ing the result into
    timekeeper.  Used only by in timekeeping_advance(); The seqence counter
    protected region is only around timekeeping_update() and the memcpy for
    copy from shadow to timekeeper.

The second option is fast path optimized. The sequence counter protected
region is as short as possible.

As this behaviour is mainly documented by commit messages, but not in code,
it makes the not easy timekeeping code more complicated to read.

There is no reason why updates to the timekeeper can't use the optimized
version everywhere. With this, the code will be cleaner, as code is reused
instead of duplicated.

To be able to access tk_data which contains all required information, add a
pointer to tk_data as an argument to timekeeping_update(). With that
convert the comment about holding the lock into a lockdep assert.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-12-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
a5f9e4e4ef timekeeping: Introduce tkd_basic_setup() to make lock and seqcount init reusable
Initialization of lock and seqcount needs to be done for every instance of
timekeeper struct. To be able to easily reuse it, create a separate
function for it.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-11-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
10f7c178a9 timekeeping: Define a struct type for tk_core to make it reusable
The struct tk_core uses is not reusable. As long as there is only a single
timekeeper, this is not a problem. But when the timekeeper infrastructure
will be reused for per ptp clock timekeepers, an explicit struct type is
required.

Define struct tk_data as explicit struct type for tk_core.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-10-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Anna-Maria Behnsen
8c4799b184 timekeeping: Move timekeeper_lock into tk_core
timekeeper_lock protects updates to struct tk_core but is not part of
struct tk_core. As long as there is only a single timekeeper, this is not a
problem. But when the timekeeper infrastructure will be reused for per ptp
clock timekeepers, timekeeper_lock needs to be part of tk_core.

Move the lock into tk_core, move initialisation of the lock and sequence
counter into timekeeping_init() and update all users of timekeeper_lock.

As this is touching all lock sites, convert them to use:

  guard(raw_spinlock_irqsave)(&tk_core.lock);

instead of lock/unlock functions whenever possible.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-9-554456a44a15@linutronix.de
2024-10-25 19:49:14 +02:00
Thomas Gleixner
dbdcf8c4ca timekeeping: Encapsulate locking/unlocking of timekeeper_lock
timekeeper_lock protects updates of timekeeper (tk_core). It is also used
by vdso_update_begin/end() and not only internally by the timekeeper code.

As long as there is only a single timekeeper, this works fine.  But when
the timekeeper infrastructure will be reused for per ptp clock timekeepers,
timekeeper_lock needs to be part of tk_core..

Therefore encapuslate locking/unlocking of timekeeper_lock and make the
lock static.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-8-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Thomas Gleixner
20c7b582e8 timekeeping: Move shadow_timekeeper into tk_core
tk_core requires shadow_timekeeper to allow timekeeping_advance() updating
without holding the timekeeper sequence count write locked. This allows the
readers to make progress up to the actual update where the shadow
timekeeper is copied over to the real timekeeper.

As long as there is only a single timekeeper, having them separate is
fine. But when the timekeeper infrastructure will be reused for per ptp
clock timekeepers, shadow_timekeeper needs to be part of tk_core.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-7-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Thomas Gleixner
c2a329566a timekeeping: Simplify code in timekeeping_advance()
timekeeping_advance() takes the timekeeper_lock and releases it before
returning. When an early return is required, goto statements are used to
make sure the lock is realeased properly. When the code was written the
locking guard() was not yet available.

Use the guard() to simplify the code and while at it cleanup ordering of
function variables. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-5-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Thomas Gleixner
1f7226b1e7 timekeeping: Abort clocksource change in case of failure
There is no point to go through a full timekeeping update when acquiring a
module reference or enabling the new clocksource fails.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-4-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Anna-Maria Behnsen
9fe7d9a984 timekeeping: Avoid duplicate leap state update
do_adjtimex() invokes tk_update_leap_state() unconditionally even when a
previous invocation of timekeeping_update() already did that update.

Put it into the else path which is invoked when timekeeping_update() is not
called.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-3-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Thomas Gleixner
886150fb4f timekeeping: Don't stop time readers across hard_pps() update
hard_pps() update does not modify anything which might be required by time
readers so forcing readers out of the way during the update is a pointless
exercise.

The interaction with adjtimex() and timekeeper updates which call into the
NTP code is properly serialized by timekeeper_lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-2-554456a44a15@linutronix.de
2024-10-25 19:49:13 +02:00
Thomas Gleixner
14f1e3b3df timekeeping: Read NTP tick length only once
No point in reading it a second time when the comparison fails.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241009-devel-anna-maria-b4-timers-ptp-timekeeping-v2-1-554456a44a15@linutronix.de
2024-10-25 19:49:12 +02:00
Linus Torvalds
d44cd82264 Including fixes from netfiler, xfrm and bluetooth.
Current release - regressions:
 
   - posix-clock: Fix unbalanced locking in pc_clock_settime()
 
   - netfilter: fix typo causing some targets not to load on IPv6
 
 Current release - new code bugs:
 
   - xfrm: policy: remove last remnants of pernet inexact list
 
 Previous releases - regressions:
 
   - core: fix races in netdev_tx_sent_queue()/dev_watchdog()
 
   - bluetooth: fix UAF on sco_sock_timeout
 
   - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event
 
   - eth: usbnet: fix name regression
 
   - eth: be2net: fix potential memory leak in be_xmit()
 
   - eth: plip: fix transmit path breakage
 
 Previous releases - always broken:
 
   - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
 
   - netfilter: bpf: must hold reference on net namespace
 
   - eth: virtio_net: fix integer overflow in stats
 
   - eth: bnxt_en: replace ptp_lock with irqsave variant
 
   - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx()
 
 Misc:
 
   - MAINTAINERS: add Simon as an official reviewer
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcaTkUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkW8kP/iYfaxQ8zR61wUU7bOcVUSnEADR9XQ1H
 Nta5Z0tDJprZv254XW3hYDzU0Iy3OgclRE1oewF5fQVLn6Sfg4U5awxRTNdJw7KV
 wj62ziAv/xht2W/4nBsNfYkOZaDAibItbKtxlkOhgCGXSrXBoS22IonKRqEv2HLV
 Gu0vAY/VI9YNvB5Z6SEKFmQp2bWfX79AChVT72shLBLakOCUHBavk/DOU56XH1Ci
 IRmU5Lt8ysXWxCTF91rPCAbMyuxBbIv6phIKPV2ALpRUd6ha5nBqcl0wcS7Y1E+/
 0XOV71zjcXFoE/6hc5W3/mC7jm+ipXKVJOnIkCcWq40p6kDVJJ+E1RWEr5JxGEyF
 FtnUCZ8iK/F3/jSalMras2z+AZ/CGtfHF9wAS3YfMGtOJJb/k4dCxAddp7UzD9O4
 yxAJhJ0DrVuplzwovL5owoJJXeRAMQeFydzHBYun5P8Sc9TtvviICi19fMgKGn4O
 eUQhjgZZY371sPnTDLDEw1Oqzs9qeaeV3S2dSeFJ98PQuPA5KVOf/R2/CptBIMi5
 +UNcqeXrlUeYSBW94pPioEVStZDrzax5RVKh/Jo1tTnKzbnWDOOKZqSVsGPMWXdO
 0aBlGuSsNe36VDg2C0QMxGk7+gXbKmk9U4+qVQH3KMpB8uqdAu5deMbTT6dfcwBV
 O/BaGiqoR4ak
 =dR3Q
 -----END PGP SIGNATURE-----

Merge tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfiler, xfrm and bluetooth.

  Oddly this includes a fix for a posix clock regression; in our
  previous PR we included a change there as a pre-requisite for
  networking one. That fix proved to be buggy and requires the follow-up
  included here. Thomas suggested we should send it, given we sent the
  buggy patch.

  Current release - regressions:

   - posix-clock: Fix unbalanced locking in pc_clock_settime()

   - netfilter: fix typo causing some targets not to load on IPv6

  Current release - new code bugs:

   - xfrm: policy: remove last remnants of pernet inexact list

  Previous releases - regressions:

   - core: fix races in netdev_tx_sent_queue()/dev_watchdog()

   - bluetooth: fix UAF on sco_sock_timeout

   - eth: hv_netvsc: fix VF namespace also in synthetic NIC
     NETDEV_REGISTER event

   - eth: usbnet: fix name regression

   - eth: be2net: fix potential memory leak in be_xmit()

   - eth: plip: fix transmit path breakage

  Previous releases - always broken:

   - sched: deny mismatched skip_sw/skip_hw flags for actions created by
     classifiers

   - netfilter: bpf: must hold reference on net namespace

   - eth: virtio_net: fix integer overflow in stats

   - eth: bnxt_en: replace ptp_lock with irqsave variant

   - eth: octeon_ep: add SKB allocation failures handling in
     __octep_oq_process_rx()

  Misc:

   - MAINTAINERS: add Simon as an official reviewer"

* tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
  net: dsa: mv88e6xxx: support 4000ps cycle counter period
  net: dsa: mv88e6xxx: read cycle counter period from hardware
  net: dsa: mv88e6xxx: group cycle counter coefficients
  net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition
  hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event
  net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x
  Bluetooth: ISO: Fix UAF on iso_sock_timeout
  Bluetooth: SCO: Fix UAF on sco_sock_timeout
  Bluetooth: hci_core: Disable works on hci_unregister_dev
  posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
  r8169: avoid unsolicited interrupts
  net: sched: use RCU read-side critical section in taprio_dump()
  net: sched: fix use-after-free in taprio_change()
  net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
  net: usb: usbnet: fix name regression
  mlxsw: spectrum_router: fix xa_store() error checking
  virtio_net: fix integer overflow in stats
  net: fix races in netdev_tx_sent_queue()/dev_watchdog()
  net: wwan: fix global oob in wwan_rtnl_policy
  netfilter: xtables: fix typo causing some targets not to load on IPv6
  ...
2024-10-24 16:43:50 -07:00
Julia Lawall
2e529e637c posix-timers: Replace call_rcu() by kfree_rcu() for simple kmem_cache_free() callback
Since SLOB was removed and since commit 6c6c47b063 ("mm, slab: call
kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not longer
necessary to use call_rcu() when the callback only performs
kmem_cache_free(). Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lore.kernel.org/all/20241013201704.49576-12-Julia.Lawall@inria.fr
2024-10-24 11:22:54 +02:00
Jinjie Ruan
6e62807c7f posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
If get_clock_desc() succeeds, it calls fget() for the clockid's fd,
and get the clk->rwsem read lock, so the error path should release
the lock to make the lock balance and fput the clockid's fd to make
the refcount balance and release the fd related resource.

However the below commit left the error path locked behind resulting in
unbalanced locking. Check timespec64_valid_strict() before
get_clock_desc() to fix it, because the "ts" is not changed
after that.

Fixes: d8794ac20a ("posix-clock: Fix missing timespec64 check in pc_clock_settime()")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
[pabeni@redhat.com: fixed commit message typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23 16:05:01 +02:00
Linus Torvalds
2b4d25010d - Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their state,
   i.e., whether they're runnable or blocked
 
 - Handle delayed dequeued tasks and their migration wrt PSI properly
 
 - Fix the situation where a delayed dequeue task gets enqueued into a new
   class, which should not happen
 
 - Fix a case where memory allocation would happen while the runqueue lock is
   held, which is a no-no
 
 - Do not over-schedule when tasks with shorter slices preempt the currently
   running task
 
 - Make sure delayed to deque entities are properly handled before unthrottling
 
 - Other smaller cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmcU3tMACgkQEsHwGGHe
 VUqpuhAAqqyi2NNgrIOlEWh/Ej4NQZL7KleF84cSpKCIBK2somYX5ksgMcUgn82i
 bIuDVErQu/a4lhNAf5zn7TO3yuPA1Q5xd/453qBlWM9ApkH0S69Mp9f0yocVu8F0
 t3XsgXm+/R8A4TYbiv8cB+r1Xt8E5NUP6RkNIKCHbPLAG94gqYF8UZEJ9sAl9ZXw
 qEWc9afpnp+4LQ9PlzePuaM976LWUPB49OoFZMnFmY1VkvFuVjkjXSVzJX6l4qB7
 Omo/+TXOOBSHXVVflNx/68Q16irFHAnqwPPrLCBQWBLIPz3iRiZjV9ptD9tUZkRM
 M+klL7w0jRG+8wa9fTwuqybmBNIBt4Az1/WUw9Lc3ryEWRsCKzkGT8au3lv5FpQY
 CTwIIBSMmUcqQSG40R0HHS3nDR4UBFFD0PAww+8cJQZc0IPd2rT9/hfqYdt3sq2Z
 vV9rmTFOcDlApeDdCGcfC7zJhdgVuBgDVjdTsE5SNRUduBUsBYOeLDnT+0Qi0ArJ
 txVINGxQDm6jz512f4CAB/xzUcYpU4o639Z1Jkd6a8QbO1NBZGX1ioOcvPEMhmFF
 f/qFyM8ctR5Kj6LJCZiDcstqtAZviW1d2uMp48gk2QeSvkCyIUQqrWshItd02iBG
 TZdSYRvSYtYSIz7WYtE/CABUDmrJGjuLtb+jOrR93//TsWwwVdE=
 =1D7H
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduling fixes from Borislav Petkov:

 - Add PREEMPT_RT maintainers

 - Fix another aspect of delayed dequeued tasks wrt determining their
   state, i.e., whether they're runnable or blocked

 - Handle delayed dequeued tasks and their migration wrt PSI properly

 - Fix the situation where a delayed dequeue task gets enqueued into a
   new class, which should not happen

 - Fix a case where memory allocation would happen while the runqueue
   lock is held, which is a no-no

 - Do not over-schedule when tasks with shorter slices preempt the
   currently running task

 - Make sure delayed to deque entities are properly handled before
   unthrottling

 - Other smaller cleanups and improvements

* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Add an entry for PREEMPT_RT.
  sched/fair: Fix external p->on_rq users
  sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
  sched/core: Dequeue PSI signals for blocked tasks that are delayed
  sched: Fix delayed_dequeue vs switched_from_fair()
  sched/core: Disable page allocation in task_tick_mm_cid()
  sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
  sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
  sched: Fix sched_delayed vs cfs_bandwidth
2024-10-20 11:30:56 -07:00
Anna-Maria Behnsen
6279abf16a timers: Add a warning to usleep_range_state() for wrong order of arguments
There is a warning in checkpatch script that triggers, when min and max
arguments of usleep_range_state() are in reverse order. This check does
only cover callsites which uses constants. Add this check into the code as
a WARN_ON_ONCE() to also cover callsites not using constants and fix the
mis-usage by resetting the delta to 0.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-9-dc8b907cb62f@linutronix.de
2024-10-16 00:36:47 +02:00
Anna-Maria Behnsen
f36eb17141 timers: Update function descriptions of sleep/delay related functions
A lot of commonly used functions for inserting a sleep or delay lack a
proper function description. Add function descriptions to all of them to
have important information in a central place close to the code.

No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-5-dc8b907cb62f@linutronix.de
2024-10-16 00:36:47 +02:00
Anna-Maria Behnsen
cf5b6ef0c3 timers: Update schedule_[hr]timeout*() related function descriptions
schedule_timeout*() functions do not have proper kernel-doc formatted
function descriptions. schedule_hrtimeout() and schedule_hrtimeout_range()
have a almost identical description.

Add missing function descriptions. Remove copy of function description and
add a pointer to the existing description instead.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-3-dc8b907cb62f@linutronix.de
2024-10-16 00:36:46 +02:00
Anna-Maria Behnsen
da7bd0a9e0 timers: Move *sleep*() and timeout functions into a separate file
All schedule_timeout() and *sleep*() related functions are interfaces on
top of timer list timers and hrtimers to add a sleep to the code. As they
are built on top of the timer list timers and hrtimers, the [hr]timer
interfaces are already used except when queuing the timer in
schedule_timeout(). But there exists the appropriate interface add_timer()
which does the same job with an extra check for an already pending timer.

Split all those functions as they are into a separate file and use
add_timer() instead of __mod_timer() in schedule_timeout().

While at it fix minor formatting issues and a multi line printk function
call in schedule_timeout().

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-2-dc8b907cb62f@linutronix.de
2024-10-16 00:36:46 +02:00
Wang Jinchao
a849881a9e time: Remove '%' from numeric constant in kernel-doc comment
Change %0 to 0 in kernel-doc comments. %0 is not valid.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241009022135.92400-2-wangjinchao@xfusion.com
2024-10-16 00:30:26 +02:00
Thomas Weißschuh
39c089a01a vdso: Remove timekeeper argument of __arch_update_vsyscall()
No implementation of this hook uses the passed in timekeeper anymore.

This avoids including a non-VDSO header while building the VDSO, which can
lead to compilation errors.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/20241010-vdso-generic-arch_update_vsyscall-v1-1-7fe5a3ea4382@linutronix.de
2024-10-15 17:50:28 +02:00
Jinjie Ruan
d8794ac20a posix-clock: Fix missing timespec64 check in pc_clock_settime()
As Andrew pointed out, it will make sense that the PTP core
checked timespec64 struct's tv_sec and tv_nsec range before calling
ptp->info->settime64().

As the man manual of clock_settime() said, if tp.tv_sec is negative or
tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL,
which include dynamic clocks which handles PTP clock, and the condition is
consistent with timespec64_valid(). As Thomas suggested, timespec64_valid()
only check the timespec is valid, but not ensure that the time is
in a valid range, so check it ahead using timespec64_valid_strict()
in pc_clock_settime() and return -EINVAL if not valid.

There are some drivers that use tp->tv_sec and tp->tv_nsec directly to
write registers without validity checks and assume that the higher layer
has checked it, which is dangerous and will benefit from this, such as
hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(),
and some drivers can remove the checks of itself.

Cc: stable@vger.kernel.org
Fixes: 0606f422b4 ("posix clocks: Introduce dynamic clocks")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:22:43 -07:00
Peter Zijlstra
cd9626e9eb sched/fair: Fix external p->on_rq users
Sean noted that ever since commit 152e11f6df ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.

Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.

Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
2024-10-14 09:14:35 +02:00
Dr. David Alan Gilbert
bafffd56c6 clocksource: Remove unused clocksource_change_rating
clocksource_change_rating() has been unused since 2017's commit
63ed4e0c67 ("Drivers: hv: vmbus: Consolidate all Hyper-V specific clocksource code")

Remove it.

__clocksource_change_rating now only has one use which is ifdef'd.
Move it into the ifdef'd section.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241010135446.213098-1-linux@treblig.org
2024-10-10 21:03:10 +02:00
Jeff Layton
2a15385742
timekeeping: Add percpu counter for tracking floor swap events
The mgtime_floor value is a global variable for tracking the latest
fine-grained timestamp handed out. Because it's a global, track the
number of times that a new floor value is assigned.

Add a new percpu counter to the timekeeping code to track the number of
floor swap events that have occurred. A later patch will add a debugfs
file to display this counter alongside other stats involving multigrain
timestamps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Link: https://lore.kernel.org/all/20241002-mgtime-v10-2-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:46 +02:00
Jeff Layton
ee3283c608
timekeeping: Add interfaces for handling timestamps with a floor value
Multigrain timestamps allow the kernel to use fine-grained timestamps when
an inode's attributes is being actively observed via ->getattr().  With
this support, it's possible for a file to get a fine-grained timestamp, and
another modified after it to get a coarse-grained stamp that is earlier
than the fine-grained time.  If this happens then the files can appear to
have been modified in reverse order, which breaks VFS ordering guarantees
[1].

To prevent this, maintain a floor value for multigrain timestamps.
Whenever a fine-grained timestamp is handed out, record it, and when later
coarse-grained stamps are handed out, ensure they are not earlier than that
value. If the coarse-grained timestamp is earlier than the fine-grained
floor, return the floor value instead.

Add a static singleton atomic64_t into timekeeper.c that is used to keep
track of the latest fine-grained time ever handed out. This is tracked as a
monotonic ktime_t value to ensure that it isn't affected by clock
jumps. Because it is updated at different times than the rest of the
timekeeper object, the floor value is managed independently of the
timekeeper via a cmpxchg() operation, and sits on its own cacheline.

Add two new public interfaces:

- ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
  coarse-grained clock and the floor time

- ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
  to swap it into the floor. A timespec64 is filled with the result.

The floor value is global and updated via a single try_cmpxchg(). If
that fails then the operation raced with a concurrent update. Any
concurrent update must be later than the existing floor value, so any
racing tasks can accept any resulting floor value without retrying.

[1]: POSIX requires that files be stamped with realtime clock values, and
     makes no provision for dealing with backward clock jumps. If a backward
     realtime clock jump occurs, then files can appear to have been modified
     in reverse order.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241002-mgtime-v10-1-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:46 +02:00
Thomas Gleixner
b7f6d3a09d Merge branch 'timers/vfs' into timers/core
Pick up the VFS specific interfaces so further timekeeping changes can be
based on them.
2024-10-06 21:00:01 +02:00
Jeff Layton
96f9a366ec timekeeping: Add percpu counter for tracking floor swap events
The mgtime_floor value is a global variable for tracking the latest
fine-grained timestamp handed out. Because it's a global, track the
number of times that a new floor value is assigned.

Add a new percpu counter to the timekeeping code to track the number of
floor swap events that have occurred. A later patch will add a debugfs
file to display this counter alongside other stats involving multigrain
timestamps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Link: https://lore.kernel.org/all/20241002-mgtime-v10-2-d1c4717f5284@kernel.org
2024-10-06 20:56:07 +02:00
Jeff Layton
70c8fd00a9 timekeeping: Add interfaces for handling timestamps with a floor value
Multigrain timestamps allow the kernel to use fine-grained timestamps when
an inode's attributes is being actively observed via ->getattr().  With
this support, it's possible for a file to get a fine-grained timestamp, and
another modified after it to get a coarse-grained stamp that is earlier
than the fine-grained time.  If this happens then the files can appear to
have been modified in reverse order, which breaks VFS ordering guarantees
[1].

To prevent this, maintain a floor value for multigrain timestamps.
Whenever a fine-grained timestamp is handed out, record it, and when later
coarse-grained stamps are handed out, ensure they are not earlier than that
value. If the coarse-grained timestamp is earlier than the fine-grained
floor, return the floor value instead.

Add a static singleton atomic64_t into timekeeper.c that is used to keep
track of the latest fine-grained time ever handed out. This is tracked as a
monotonic ktime_t value to ensure that it isn't affected by clock
jumps. Because it is updated at different times than the rest of the
timekeeper object, the floor value is managed independently of the
timekeeper via a cmpxchg() operation, and sits on its own cacheline.

Add two new public interfaces:

- ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
  coarse-grained clock and the floor time

- ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
  to swap it into the floor. A timespec64 is filled with the result.

The floor value is global and updated via a single try_cmpxchg(). If
that fails then the operation raced with a concurrent update. Any
concurrent update must be later than the existing floor value, so any
racing tasks can accept any resulting floor value without retrying.

[1]: POSIX requires that files be stamped with realtime clock values, and
     makes no provision for dealing with backward clock jumps. If a backward
     realtime clock jump occurs, then files can appear to have been modified
     in reverse order.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241002-mgtime-v10-1-d1c4717f5284@kernel.org
2024-10-06 20:56:07 +02:00
Jeff Layton
8c111f1b96 timekeeping: Don't use seqcount loop in ktime_mono_to_any() on 64-bit systems
ktime_mono_to_any() only fetches the offset inside the loop. This is a
single word on 64-bit CPUs, and seqcount_read_begin() implies a full SMP
barrier.

Use READ_ONCE() to fetch the offset instead of doing a seqcount loop on
64-bit and add the matching WRITE_ONCE()'s to update the offsets in
tk_set_wall_to_mono() and tk_update_sleep_time().

[ tglx: Get rid of the #ifdeffery ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240910-mgtime-v3-1-84406ed53fad@kernel.org
2024-10-02 18:06:03 +02:00
Thomas Gleixner
b98b276873 Merge branch 'timers/kvm' into timers/core
Bring in the update which is provided to arm64/kvm so subsequent
timekeeping work does not conflict.
2024-10-02 17:11:56 +02:00
Vincent Donnefort
8102c4daf4 timekeeping: Add the boot clock to system time snapshot
For tracing purpose, the boot clock is interesting as it doesn't stop on
suspend. Export it as part of the time snapshot. This will later allow
the hypervisor to add boot clock timestamps to its events.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911093029.3279154-5-vdonnefort@google.com
2024-10-02 17:10:41 +02:00
Thomas Gleixner
6fadb4a61d ntp: Move pps monitors into ntp_data
Finalize the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-21-2d52f4e13476@linutronix.de
2024-10-02 16:53:41 +02:00
Thomas Gleixner
12850b4658 ntp: Move pps_freq/stabil into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-20-2d52f4e13476@linutronix.de
2024-10-02 16:53:41 +02:00
Thomas Gleixner
b1c89a762f ntp: Move pps_shift/intcnt into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-19-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
db45e9bce8 ntp: Move pps_fbase into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-18-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
9d7130dfc0 ntp: Move pps_jitter into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-17-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
5cc953b8ae ntp: Move pps_ft into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-16-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
931a177f70 ntp: Move pps_valid into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-15-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
75d956b947 ntp: Move ntp_next_leap_sec into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-14-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
bb6400a298 ntp: Move time_adj/ntp_tick_adj into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-13-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
161b8ec281 ntp: Move time_freq/reftime into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-12-2d52f4e13476@linutronix.de
2024-10-02 16:53:40 +02:00
Thomas Gleixner
7891cf2961 ntp: Move time_max/esterror into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-11-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
d51435548e ntp: Move time_offset/constant into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-10-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
bee18a2301 ntp: Move tick_stat* into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-9-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
ec93ec22aa ntp: Move tick_length* into ntp_data
Continue the conversion from static variables to struct based data.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-8-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
68f66f97c5 ntp: Introduce struct ntp_data
All NTP data is held in static variables. That prevents the NTP code from
being reuasble for non-system time timekeepers, e.g. per PTP clock
timekeeping.

Introduce struct ntp_data and move tick_usec into it for a start.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-7-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
136bccbc2e ntp: Read reference time only once
The reference time is required twice in ntp_update_offset(). It will not
change in the meantime as the calling code holds the timekeeper lock. Read
it only once and store it into a local variable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-6-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Thomas Gleixner
48c3c65f64 ntp: Convert functions with only two states to bool
is_error_status() and ntp_synced() return whether a state is set or
not. Both functions use unsigned int for it even if it would be a perfect
job for a bool.

Use bool instead of unsigned int. And while at it, move ntp_synced()
function to the place where it is used.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-5-2d52f4e13476@linutronix.de
2024-10-02 16:53:39 +02:00
Anna-Maria Behnsen
38007dc032 ntp: Cleanup formatting of code
Code is partially formatted in a creative way which makes reading
harder. Examples are function calls over several lines where the
indentation does not start at the same height then the open bracket after
the function name.

Improve formatting but do not make a functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-4-2d52f4e13476@linutronix.de
2024-10-02 16:53:38 +02:00
Thomas Gleixner
a0581cdb2e ntp: Clean up comments
Usage of different comment formatting makes fast reading and parsing the
code harder. There are several multi-line comments which do not follow the
coding style by starting with a line only containing '/*'. There are also
comments which do not start with capitals.

Clean up all those comments to be consistent and remove comments which
document the obvious.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-3-2d52f4e13476@linutronix.de
2024-10-02 16:53:38 +02:00
Thomas Gleixner
66606a9384 ntp: Make tick_usec static
There are no users of tick_usec outside of the NTP core code. Therefore
make tick_usec static.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-2-2d52f4e13476@linutronix.de
2024-10-02 16:53:38 +02:00
Thomas Gleixner
a849a0273d ntp: Remove unused tick_nsec
tick_nsec is only updated in the NTP core, but there are no users.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20240911-devel-anna-maria-b4-timers-ptp-ntp-v1-1-2d52f4e13476@linutronix.de
2024-10-02 16:53:38 +02:00
Al Viro
cb787f4ac0 [tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")

To quote that commit,

  At -rc1 we'll need do a mechanical removal of no_llseek -

  git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
  done

  would do it.

Unfortunately, that hadn't been done.  Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
	.llseek = no_llseek,
so it's obviously safe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 08:18:43 -07:00
Linus Torvalds
2004cef11e In the v6.12 scheduler development cycle we had 63 commits from 18 contributors:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's
    last major contribution to the kernel:
 
      "SCHED_DEADLINE servers can help fixing starvation issues of low priority
      tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
      cycles. Today we have RT Throttling; DEADLINE servers should be able to
      replace and improve that."
 
      (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes,
       Youssef Esmat, Huang Shijie)
 
  - Preparatory changes for sched_ext integration:
 
      - Use set_next_task(.first) where required
      - Fix up set_next_task() implementations
      - Clean up DL server vs. core sched
      - Split up put_prev_task_balance()
      - Rework pick_next_task()
      - Combine the last put_prev_task() and the first set_next_task()
      - Rework dl_server
      - Add put_prev_task(.next)
 
       (Peter Zijlstra, with a fix by Tejun Heo)
 
  - Complete the EEVDF transition and refine EEVDF scheduling:
 
      - Implement delayed dequeue
      - Allow shorter slices to wakeup-preempt
      - Use sched_attr::sched_runtime to set request/slice suggestion
      - Document the new feature flags
      - Remove unused and duplicate-functionality fields
      - Simplify & unify pick_next_task_fair()
      - Misc debuggability enhancements
 
       (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann,
        Valentin Schneider and Chuyi Zhou)
 
  - Initialize the vruntime of a new task when it is first enqueued,
    resulting in significant decrease in latency of newly woken tasks.
    (Zhang Qiao)
 
  - Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
    (K Prateek Nayak, Peter Zijlstra)
 
  - Clean up and clarify the usage of Clean up usage of rt_task()
    (Qais Yousef)
 
  - Preempt SCHED_IDLE entities in strict cgroup hierarchies
    (Tianchen Ding)
 
  - Clarify the documentation of time units for deadline scheduler
    parameters. (Christian Loehle)
 
  - Remove the HZ_BW chicken-bit feature flag introduced a year ago,
    the original change seems to be working fine.
    (Phil Auld)
 
  - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
    Peilin He, Qais Yousefm and Vincent Guittot)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0
 LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC
 NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B
 uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T
 JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y
 sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x
 m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL
 Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN
 gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2
 5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA
 yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT
 ChngAzap+kA=
 =TEC6
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
   de Oliveira's last major contribution to the kernel:

     "SCHED_DEADLINE servers can help fixing starvation issues of low
      priority tasks (e.g., SCHED_OTHER) when higher priority tasks
      monopolize CPU cycles. Today we have RT Throttling; DEADLINE
      servers should be able to replace and improve that."

   (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
   Esmat, Huang Shijie)

 - Preparatory changes for sched_ext integration:
     - Use set_next_task(.first) where required
     - Fix up set_next_task() implementations
     - Clean up DL server vs. core sched
     - Split up put_prev_task_balance()
     - Rework pick_next_task()
     - Combine the last put_prev_task() and the first set_next_task()
     - Rework dl_server
     - Add put_prev_task(.next)

   (Peter Zijlstra, with a fix by Tejun Heo)

 - Complete the EEVDF transition and refine EEVDF scheduling:
     - Implement delayed dequeue
     - Allow shorter slices to wakeup-preempt
     - Use sched_attr::sched_runtime to set request/slice suggestion
     - Document the new feature flags
     - Remove unused and duplicate-functionality fields
     - Simplify & unify pick_next_task_fair()
     - Misc debuggability enhancements

   (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
   Schneider and Chuyi Zhou)

 - Initialize the vruntime of a new task when it is first enqueued,
   resulting in significant decrease in latency of newly woken tasks
   (Zhang Qiao)

 - Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
   (K Prateek Nayak, Peter Zijlstra)

 - Clean up and clarify the usage of Clean up usage of rt_task()
   (Qais Yousef)

 - Preempt SCHED_IDLE entities in strict cgroup hierarchies
   (Tianchen Ding)

 - Clarify the documentation of time units for deadline scheduler
   parameters (Christian Loehle)

 - Remove the HZ_BW chicken-bit feature flag introduced a year ago,
   the original change seems to be working fine (Phil Auld)

 - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
   Peilin He, Qais Yousefm and Vincent Guittot)

* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/cpufreq: Use NSEC_PER_MSEC for deadline task
  cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
  sched/deadline: Clarify nanoseconds in uapi
  sched/deadline: Convert schedtool example to chrt
  sched/debug: Fix the runnable tasks output
  sched: Fix sched_delayed vs sched_core
  kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
  kthread: Fix task state in kthread worker if being frozen
  sched/pelt: Use rq_clock_task() for hw_pressure
  sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
  sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
  sched: Add put_prev_task(.next)
  sched: Rework dl_server
  sched: Combine the last put_prev_task() and the first set_next_task()
  sched: Rework pick_next_task()
  sched: Split up put_prev_task_balance()
  sched: Clean up DL server vs core sched
  sched: Fixup set_next_task() implementations
  sched: Use set_next_task(.first) where required
  sched/fair: Properly deactivate sched_delayed task upon class change
  ...
2024-09-19 15:55:58 +02:00
Linus Torvalds
9ea925c806 Updates for timers and timekeeping:
- Core:
 
 	- Overhaul of posix-timers in preparation of removing the
 	  workaround for periodic timers which have signal delivery
 	  ignored.
 
         - Remove the historical extra jiffie in msleep()
 
 	  msleep() adds an extra jiffie to the timeout value to ensure
 	  minimal sleep time. The timer wheel ensures minimal sleep
 	  time since the large rewrite to a non-cascading wheel, but the
 	  extra jiffie in msleep() remained unnoticed. Remove it.
 
         - Make the timer slack handling correct for realtime tasks.
 
 	  The procfs interface is inconsistent and does neither reflect
 	  reality nor conforms to the man page. Show the correct 0 slack
 	  for real time tasks and enforce it at the core level instead of
 	  having inconsistent individual checks in various timer setup
 	  functions.
 
         - The usual set of updates and enhancements all over the place.
 
   - Drivers:
 
         - Allow the ACPI PM timer to be turned off during suspend
 
 	- No new drivers
 
 	- The usual updates and enhancements in various drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
 ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
 xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
 V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
 FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
 zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
 Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
 Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
 F6p9cgoDNP9KFg==
 =jE0N
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
cb69d86550 Updates for the interrupt subsystem:
- Core:
 	- Remove a global lock in the affinity setting code
 
 	  The lock protects a cpumask for intermediate results and the lock
 	  causes a bottleneck on simultaneous start of multiple virtual
 	  machines. Replace the lock and the static cpumask with a per CPU
 	  cpumask which is nicely serialized by raw spinlock held when
 	  executing this code.
 
 	- Provide support for giving a suffix to interrupt domain names.
 
 	  That's required to support devices with subfunctions so that the
 	  domain names are distinct even if they originate from the same
 	  device node.
 
 	- The usual set of cleanups and enhancements all over the place
 
   - Drivers:
 
 	- Support for longarch AVEC interrupt chip
 
 	- Refurbishment of the Armada driver so it can be extended for new
           variants.
 
 	- The usual set of cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
 1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
 YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
 uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
 RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
 isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
 rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
 Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
 kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
 dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
 4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
 w6vvU0i+NIyVMA==
 =JlJ2
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Core:

   - Remove a global lock in the affinity setting code

     The lock protects a cpumask for intermediate results and the lock
     causes a bottleneck on simultaneous start of multiple virtual
     machines. Replace the lock and the static cpumask with a per CPU
     cpumask which is nicely serialized by raw spinlock held when
     executing this code.

   - Provide support for giving a suffix to interrupt domain names.

     That's required to support devices with subfunctions so that the
     domain names are distinct even if they originate from the same
     device node.

   - The usual set of cleanups and enhancements all over the place

  Drivers:

   - Support for longarch AVEC interrupt chip

   - Refurbishment of the Armada driver so it can be extended for new
     variants.

   - The usual set of cleanups and enhancements all over the place"

* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
  genirq: Use cpumask_intersects()
  genirq/cpuhotplug: Use cpumask_intersects()
  irqchip/apple-aic: Only access system registers on SoCs which provide them
  irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
  irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
  dt-bindings: apple,aic: Document A7-A11 compatibles
  irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
  genirq/msi: Use kmemdup_array() instead of kmemdup()
  genirq/proc: Change the return value for set affinity permission error
  genirq/proc: Use irq_move_pending() in show_irq_affinity()
  genirq/proc: Correctly set file permissions for affinity control files
  genirq: Get rid of global lock in irq_do_set_affinity()
  genirq: Fix typo in struct comment
  irqchip/loongarch-avec: Add AVEC irqchip support
  irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
  irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
  LoongArch: Architectural preparation for AVEC irqchip
  LoongArch: Move irqchip function prototypes to irq-loongson.h
  irqchip/loongson-pch-msi: Switch to MSI parent domains
  softirq: Remove unused 'action' parameter from action callback
  ...
2024-09-17 07:09:17 +02:00
Linus Torvalds
a64405b78b Updates for the clocksource watchdog:
- Make the uncertainty margin handling more robust to prevent false
     positives
 
   - Clarify comments
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn6xETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoafvEAC30/0ePYUhwnJuhm1DboYLWWVGJlSI
 WpdwmL1YgBUTAjJC0wuYA+Fc8XV9grvxY86bPAsU2Lep85znCSIK/YEO4tsN5Tnq
 +aWGR8Q+2j80MxGCP2mFtG1P0K35P4dopRJgvGmE/xtqKxJ3IPryhfjegHVREDoe
 PNQt59lW1WZhydLXB60S9FCAJeLFRWAoc2tOIr9T2CnsRkc8aXdWnpEU40bN+nFH
 rX09ynPPOlxo5f6z5YWBOuv7LgjvNjmJP3qf8d4Sp4stj7+kI3ipZh6mlolek4l+
 sMlbBAelQk+eiPYPIbthZvMvhM3J+YFXy4nh1LPYSRqWIMsgtobxwNUvWV0CqZaQ
 ZEImJqh+QJA27RAD13Uiv3N68prgW7yj/65b3EbjiK9tka2+67JVWTnlxuuyYu+S
 JxetHfxLKqohGhKAgAeO11efcf4HBh82afqDvMlWvTwDxyqMVLXM2NeT2iOgeYZk
 eH85R7ophcVdQAHZlYagsYA1oQMyT+UD40kiZzrX6WVHqX5ohVpnV/X8cJUVcUGa
 phOdX03ARzTSYdbE+J61bfAMCS2Tme34LkBMgZmooVfKwmnJOCGcskljtLOssm7N
 pTU7dI5dJLt6P+HZ6dbxf0xd0z6mNLXbVDlibEfR1ISfhLOHyH+2ywM0OJFDI0sP
 Ydt3U0n8sNFiXw==
 =Y38F
 -----END PGP SIGNATURE-----

Merge tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull clocksource watchdog updates from Thomas Gleixner:

 - Make the uncertainty margin handling more robust to prevent false
   positives

 - Clarify comments

* tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
  clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
  clocksource: Improve comments for watchdog skew bounds
2024-09-17 07:05:08 +02:00
Benjamin ROBIN
35b603f8a7 ntp: Make sure RTC is synchronized when time goes backwards
sync_hw_clock() is normally called every 11 minutes when time is
synchronized. This issue is that this periodic timer uses the REALTIME
clock, so when time moves backwards (the NTP server jumps into the past),
the timer expires late.

If the timer expires late, which can be days later, the RTC will no longer
be updated, which is an issue if the device is abruptly powered OFF during
this period. When the device will restart (when powered ON), it will have
the date prior to the ADJ_SETOFFSET call.

A normal NTP server should not jump in the past like that, but it is
possible... Another way of reproducing this issue is to use phc2sys to
synchronize the REALTIME clock with, for example, an IRIG timecode with
the source always starting at the same date (not synchronized).

Also, if the time jump in the future by less than 11 minutes, the RTC may
not be updated immediately (minor issue). Consider the following scenario:
 - Time is synchronized, and sync_hw_clock() was just called (the timer
   expires in 11 minutes).
 - A time jump is realized in the future by a couple of minutes.
 - The time is synchronized again.
 - Users may expect that RTC to be updated as soon as possible, and not
   after 11 minutes (for the same reason, if a power loss occurs in this
   period).

Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or
equal to 1s. The timer will be relaunched at the end of do_adjtimex() if
NTP is still considered synced. Otherwise the timer will be relaunched
later when NTP is synced. This way, when the time is synchronized again,
the RTC is updated after less than 2 seconds.

Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10 13:50:40 +02:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
fe90c5ba88 timers: Rename next_expiry_recalc() to be unique
next_expiry_recalc is the name of a function as well as the name of a
struct member of struct timer_base. This might lead to confusion.

Rename next_expiry_recalc() to timer_recalc_next_expiry(). No functional
change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-1-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
79f8b28e85 timers: Annotate possible non critical data race of next_expiry
Global timers could be expired remotely when the target CPU is idle. After
a remote timer expiry, the remote timer_base->next_expiry value is updated
while holding the timer_base->lock. When the formerly idle CPU becomes
active at the same time and checks whether timers need to expire, this
check is done lockless as it is on the local CPU. This could lead to a data
race, which was reported by sysbot:

  https://lore.kernel.org/r/000000000000916e55061f969e14@google.com

When the value is read lockless but changed by the remote CPU, only two non
critical scenarios could happen:

1) The already update value is read -> everything is perfect

2) The old value is read -> a superfluous timer soft interrupt is raised

The same situation could happen when enqueueing a new first pinned timer by
a remote CPU also with non critical scenarios:

1) The already update value is read -> everything is perfect

2) The old value is read -> when the CPU is idle, an IPI is executed
nevertheless and when the CPU isn't idle, the updated value will be visible
on the next tick and the timer might be late one jiffie.

As this is very unlikely to happen, the overhead of doing the check under
the lock is a way more effort, than a superfluous timer soft interrupt or a
possible 1 jiffie delay of the timer.

Document and annotate this non critical behavior in the code by using
READ/WRITE_ONCE() pair when accessing timer_base->next_expiry.

Reported-by: syzbot+bf285fcc0a048e028118@syzkaller.appspotmail.com
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240829154305.19259-1-anna-maria@linutronix.de
Closes: https://lore.kernel.org/lkml/000000000000916e55061f969e14@google.com
2024-09-04 11:57:56 +02:00
Anna-Maria Behnsen
4381b895f5 timers: Remove historical extra jiffie for timeout in msleep()
msleep() and msleep_interruptible() add a jiffie to the requested timeout.

This extra jiffie was introduced to ensure that the timeout will not happen
earlier than specified.

Since the rework of the timer wheel, the enqueue path already takes care of
this. So the extra jiffie added by msleep*() is pointless now.

Remove this extra jiffie in msleep() and msleep_interruptible().

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20240829074133.4547-1-anna-maria@linutronix.de
2024-08-29 16:17:18 +02:00
Felix Moessbauer
ed4fb6d7ef hrtimer: Use and report correct timerslack values for realtime tasks
The timerslack_ns setting is used to specify how much the hardware
timers should be delayed, to potentially dispatch multiple timers in a
single interrupt. This is a performance optimization. Timers of
realtime tasks (having a realtime scheduling policy) should not be
delayed.

This logic was inconsitently applied to the hrtimers, leading to delays
of realtime tasks which used timed waits for events (e.g. condition
variables). Due to the downstream override of the slack for rt tasks,
the procfs reported incorrect (non-zero) timerslack_ns values.

This is changed by setting the timer_slack_ns task attribute to 0 for
all tasks with a rt policy. By that, downstream users do not need to
specially handle rt tasks (w.r.t. the slack), and the procfs entry
shows the correct value of "0". Setting non-zero slack values (either
via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored,
as stated in "man 2 PR_SET_TIMERSLACK":

  Timer slack is not applied to threads that are scheduled under a
  real-time scheduling policy (see sched_setscheduler(2)).

The special handling of timerslack on rt tasks in downstream users
is removed as well.

Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
2024-08-23 20:13:02 +02:00
Caleb Sander Mateos
e68ac2b488 softirq: Remove unused 'action' parameter from action callback
When soft interrupt actions are called, they are passed a pointer to the
struct softirq action which contains the action's function pointer.

This pointer isn't useful, as the action callback already knows what
function it is. And since each callback handles a specific soft interrupt,
the callback also knows which soft interrupt number is running.

No soft interrupt action callback actually uses this parameter, so remove
it from the function pointer signature. This clarifies that soft interrupt
actions are global routines and makes it slightly cheaper to call them.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-20 17:13:40 +02:00
Sebastian Andrzej Siewior
330dd6d9c0 hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
The two hrtimer_cpu_base_.*_expiry() functions are wrappers around the
locking functions and sparse complains about the missing counterpart.

Add sparse annotation to denote that this bevaviour is expected.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812105326.2240000-3-bigeasy@linutronix.de
2024-08-14 12:44:41 +02:00
Sebastian Andrzej Siewior
38cd4cee73 timers: Add sparse annotation for timer_sync_wait_running().
timer_sync_wait_running() first releases two locks and then acquires
them again. This is unexpected and sparse complains about it.

Add sparse annotation for timer_sync_wait_running() to note that the
locking is expected.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812105326.2240000-2-bigeasy@linutronix.de
2024-08-14 12:44:41 +02:00
Qais Yousef
ae04f69de0 sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()
Some find the name realtime overloaded. Use rt_or_dl() as an
alternative, hopefully better, name.

Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07 18:32:38 +02:00
Qais Yousef
130fd056dd sched/rt: Clean up usage of rt_task()
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-07 18:32:37 +02:00
Thomas Gleixner
5916be8a53 timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
The addition of the bases argument to clock_was_set() fixed up all call
sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME
instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0.

As a result the effect of that clock_was_set() notification is incomplete
and might result in timers expiring late because the hrtimer code does
not re-evaluate the affected clock bases.

Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code
which clock bases need to be re-evaluated.

Fixes: 17a1b8826b ("hrtimer: Add bases argument to clock_was_set()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx
2024-08-05 16:14:14 +02:00
Justin Stitt
06c03c8edc ntp: Safeguard against time_constant overflow
Using syzkaller with the recently reintroduced signed integer overflow
sanitizer produces this UBSAN report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18
9223372036854775806 + 4 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 __do_adjtimex+0x1236/0x1440
 do_adjtimex+0x2be/0x740

The user supplied time_constant value is incremented by four and then
clamped to the operating range.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping after incrementing which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 4' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Similar to the fixups for time_maxerror and time_esterror, clamp the user
space supplied value to the operating range.

[ tglx: Switch to clamping ]

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com
Closes: https://github.com/KSPP/linux/issues/352
2024-08-05 16:14:14 +02:00
Justin Stitt
87d571d6fb ntp: Clamp maxerror and esterror to operating range
Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
9223372036854775807 + 500 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 second_overflow+0x2d6/0x500
 accumulate_nsecs_to_secs+0x60/0x160
 timekeeping_advance+0x1fe/0x890
 update_wall_time+0x10/0x30

time_maxerror is unconditionally incremented and the result is checked
against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting
in wrap-around to negative space.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping in handle_overflow() which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 500' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Miroslav confirmed that the input value should be clamped to the operating
range and the same applies to time_esterror. The latter is not used by the
kernel, but the value still should be in the operating range as it was
before the sanity check got removed.

Clamp them to the operating range.

[ tglx: Changed it to clamping and included time_esterror ] 

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com
Closes: https://github.com/KSPP/linux/issues/354
2024-08-05 16:14:14 +02:00
Paul E. McKenney
4ac1dd3245 clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
Right now, cs_watchdog_read() does clocksource sanity checks based
on WATCHDOG_MAX_SKEW, which sets a floor on any clocksource's
.uncertainty_margin.  These sanity checks can therefore act
inappropriately for clocksources with large uncertainty margins.

One reason for a clocksource to have a large .uncertainty_margin is when
that clocksource has long read-out latency, given that it does not make
sense for the .uncertainty_margin to be smaller than the read-out latency.
With the current checks, cs_watchdog_read() could reject all normal
reads from a clocksource with long read-out latencies, such as those
from legacy clocksources that are no longer implemented in hardware.

Therefore, recast the cs_watchdog_read() checks in terms of the
.uncertainty_margin values of the clocksources involved in the timespan in
question.  The first covers two watchdog reads and one cs read, so use
twice the watchdog .uncertainty_margin plus that of the cs.  The second
covers only a pair of watchdog reads, so use twice the watchdog
.uncertainty_margin.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-4-paulmck@kernel.org
2024-08-02 18:37:13 +02:00
Paul E. McKenney
f33a5d4bd9 clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
The WATCHDOG_THRESHOLD macro is no longer used to supply a default value
for ->uncertainty_margin, but WATCHDOG_MAX_SKEW now is.

Therefore, update the comments to reflect this change.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-3-paulmck@kernel.org
2024-08-02 18:37:13 +02:00
Borislav Petkov
17915131ae clocksource: Improve comments for watchdog skew bounds
Add more detail on the rationale for bounding the clocksource
->uncertainty_margin below at about 500ppm.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-1-paulmck@kernel.org
2024-08-02 18:37:13 +02:00
Paul E. McKenney
f2655ac2c0 clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
The current "nretries > 1 || nretries >= max_retries" check in
cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if
nretries is greater than 1.  The intent is instead to never warn on the
first try, but otherwise warn if the successful retry was the last retry.

Therefore, change that "||" to "&&".

Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
2024-08-02 18:29:28 +02:00
Thomas Gleixner
6881e75237 tick/broadcast: Move per CPU pointer access into the atomic section
The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.

This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:

  BUG: using smp_processor_id() in preemptible [00000000] code:
       caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0

Move it to the actual usage site which is in a non-preemptible region.

Fixes: f7d43dd206 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
2024-07-31 12:37:43 +02:00
Thomas Gleixner
566e2d8253 posix-timers: Consolidate signal queueing
Rename posix_timer_event() to posix_timer_queue_signal() as this is what
the function is about.

Consolidate the requeue pending and deactivation updates into that function
as there is no point in doing this in all incarnations of posix timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
24aea4cc48 posix-cpu-timers: Make k_itimer::it_active consistent
Posix CPU timers are not updating k_itimer::it_active which makes it
impossible to base decisions in the common posix timer code on it.

Update it when queueing or dequeueing posix CPU timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
20f13385b5 posix-timers: Consolidate timer setup
hrtimer based and CPU timers have their own way to install the new interval
and to reset overrun and signal handling related data.

Create a helper function and do the same operation for all variants.

This also makes the handling of the interval consistent. It's only stored
when the timer is actually armed, i.e. timer->it_value != 0. Before that it
was stored unconditionally for posix CPU timers and conditionally for the
other posix timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
52dea0a15c posix-timers: Convert timer list to hlist
No requirement for a real list. Spare a few bytes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
aca1dc0ce1 posix-timers: Clear overrun in common_timer_set()
Keeping the overrun count of the previous setup around is just wrong. The
new setting has nothing to do with the previous one and has to start from a
clean slate.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
bfa408f03f posix-timers: Retrieve interval in common timer_settime() code
No point in doing this all over the place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Thomas Gleixner
c20b99e324 posix-cpu-timers: Simplify posix_cpu_timer_set()
Avoid the late sighand lock/unlock dance when a timer is not armed to
enforce reevaluation of the timer base so that the process wide CPU timer
sampling can be disabled.

Do it right at the point where the arming decision is made which already
has sighand locked.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
286bfaccea posix-cpu-timers: Remove incorrect comment in posix_cpu_timer_set()
A leftover from historical code which describes fiction.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
c44462661e posix-cpu-timers: Use @now instead of @val for clarity
posix_cpu_timer_set() uses @val as variable for the current time. That's
confusing at best.

Use @now as anywhere else and rewrite the confusing comment about clock
sampling.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
bd29d773ea posix-cpu-timers: Do not arm SIGEV_NONE timers
There is no point in arming SIGEV_NONE timers as they never deliver a
signal. timer_gettime() is handling the expiry time correctly and that's
all SIGEV_NONE timers care about.

Prevent arming them and remove the expiry handler code which just disarms
them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
d471ff397c posix-cpu-timers: Replace old expiry retrieval in posix_cpu_timer_set()
Reuse the split out __posix_cpu_timer_get() function which does already the
right thing.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
5f9d4a1065 posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_set()
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.

Add the missing conditional.

This will be cleaned up in a follow up patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
d786b8ba9f posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_get()
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.

Add the missing conditional.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
1c50284257 posix-cpu-timers: Handle interval timers correctly in timer_get()
timer_gettime() must return the remaining time to the next expiry of a
timer or 0 if the timer is not armed and no signal pending, but posix CPU
timers fail to forward a timer which is already expired.

Add the required logic to address that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
b3e866b2df posix-cpu-timers: Save interval only for armed timers
There is no point to return the interval for timers which have been
disarmed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Thomas Gleixner
d859704bf1 posix-cpu-timers: Split up posix_cpu_timer_get()
In preparation for addressing issues in the timer_get() and timer_set()
functions of posix CPU timers.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:34 +02:00
Linus Torvalds
5256184b61 Fixes and minor updates for the timer migration code:
- Stop testing the group->parent pointer as it is not guaranteed to be
       stable over a chain of operations by design. This includes a warning
       which would be nice to have but it produces false positives due to
       the racy nature of the check.
 
     - Plug a race between CPUs going in and out of idle and a CPU hotplug
       operation. The latter can create and connect a new hierarchy level
       which is missed in the concurrent updates of CPUs which go into idle.
       As a result the events of such a CPU might not be processed and
       timers go stale.
 
       Cure it by splitting the hotplug operation into a prepare and online
       callback. The prepare callback is guaranteed to run on an online and
       therefore active CPU. This CPU updates the hierarchy and being online
       ensures that there is always at least one migrator active which
       handles the modified hierarchy correctly when going idle. The online
       callback which runs on the incoming CPU then just marks the CPU
       active and brings it into operation.
 
     - Improve tracing and polish the code further so it is more obvious
       what's going on.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmajmXYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWaGD/4iAuj0S3AQ+odB/Fkg1cvCF8YbOJGy
 PEMSDnYtW7ErShEwVQMsWnXCFhIMuqMN0KIzOChpya2ZkmdT47mNwKuwdOMEQZzg
 dKhmqGWR+sk4LMszCFbf5u6JjJeQ+nnxthgJ1IieJwC4VEfRceXYk7ng6Wvu1+lU
 JEIukUh9nRJWma7FYW8MeNZ4lJGdvawZ5UjUAkPtzeKWn0+0/oqV5t1c8E/1jBbi
 sKZW2soL716Xd/3QJUKKmtAcH7yDFwq5AY5bJURr5ztJw/yr2loVdvAPEdn/RQ6f
 fzN/J+nu2ig14g/QvhI8Ke+HbHJEZHpo6simSZRbdaqnCX3R/lYZwLDe7EGjqKIb
 0slsx2V1UxQ+qIYRplrtr/HGChjG/mXDLPIWRWjsiUAqyygy6QtUIko9AuH99Kd6
 7cBjOzajKIAA/J9SUD03VgjXcQ53bW64NMe2pOX9ED1mbfmmu/ROd0neOgksKw5o
 G5XQ+T6tNOoHMzJgv4R8PiViVdrf53A/g1wYTY1RR3XI8IWpounkyDExDvbtigGo
 N+reKoawDGpXeMAByO2E6UDFNA05NYPjlvSrzTS5ywwyF1qCowKI1Qyup9wA/God
 WJvfesmOJHtfcuUcVZf6Pm6+otJiKrT3reauFd6laEbyRuGTKvtNN/pQa6yxiZzT
 FTxZKpcYkPE76g==
 =G1Lg
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer migration updates from Thomas Gleixner:
 "Fixes and minor updates for the timer migration code:

   - Stop testing the group->parent pointer as it is not guaranteed to
     be stable over a chain of operations by design.

     This includes a warning which would be nice to have but it produces
     false positives due to the racy nature of the check.

   - Plug a race between CPUs going in and out of idle and a CPU hotplug
     operation. The latter can create and connect a new hierarchy level
     which is missed in the concurrent updates of CPUs which go into
     idle. As a result the events of such a CPU might not be processed
     and timers go stale.

     Cure it by splitting the hotplug operation into a prepare and
     online callback. The prepare callback is guaranteed to run on an
     online and therefore active CPU. This CPU updates the hierarchy and
     being online ensures that there is always at least one migrator
     active which handles the modified hierarchy correctly when going
     idle. The online callback which runs on the incoming CPU then just
     marks the CPU active and brings it into operation.

   - Improve tracing and polish the code further so it is more obvious
     what's going on"

* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix grammar in comment
  timers/migration: Spare write when nothing changed
  timers/migration: Rename childmask by groupmask to make naming more obvious
  timers/migration: Read childmask and parent pointer in a single place
  timers/migration: Use a single struct for hierarchy walk data
  timers/migration: Improve tracing
  timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  timers/migration: Do not rely always on group->parent
2024-07-27 10:19:55 -07:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00