linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 17:47:41 +02:00

Author	SHA1	Message	Date
Thomas Gleixner	920f893f73	posix-cpu-timers: Prevent UAF caused by non-leader exec() race Wongi and Jungwoo decoded and reported a non-leader exec() related race which can result in an UAF: sys_timer_delete() exec() posix_cpu_timer_del() // Observes old leader p = pid_task(pid, pid_type); de_thread() switch_leader(); release_task(old_leader) __exit_signal(old_leader) sighand = lock(old_leader, sighand); posix_cpu_timers_exit(); sighand = lock_task_sighand(p) unhash_task(old_leader); sh = lock(p, sighand) old_leader->sighand = NULL; unlock(sighand); (p->sighand == NULL) unlock(sh) return NULL; // Returns without action if(!sighand) return 0; free_posix_timer(); This is "harmless" unless the deleted timer was armed and enqueued in p->signal because on exec() a TGID targeted timer is inherited. As sys_timer_delete() freed the underlying posix timer object run_posix_cpu_timers() or any timerqueue related add/delete operations on other timers will access the freed object's timerqueue node, which results in an UAF. There is a similar problem vs. posix_cpu_timer_set(). For regular posix timers it just transiently returns -ESRCH to user space, but for the use case in do_cpu_nanosleep() it's the same UAF just that the k_itimer is allocated on the stack. Also posix_cpu_timer_rearm() fails to rearm the timer, which means it stops to expire. While debating solutions Frederic pointed out another problem: posix_cpu_timer_del(tmr) __exit_signal(p) posix_cpu_timers_exit(p); unhash_task(p); p->sighand = NULL; sh = lock_task_sighand(p) sighand = p->sighand; if (!sighand) return NULL; lock(sighand); if (!sh) WARN_ON_ONCE(timer_queued(tmr)); On weakly ordered architectures it is not guaranteed that posix_cpu_timer_del() will observe the stores in posix_cpu_timers_exit() when p->sighand is observed as NULL, which means the WARN() can be a false positive. Solve these issues by: 1) Changing the store in __exit_signal() to smp_store_release(). 2) Adding a smp_acquire__after_ctrl_dep() into the !sighand path of lock_task_sighand(). 3) Creating a helper function for looking up the task and locking sighand which does not return when sighand == NULL. Instead it retries the task lookup and only if that fails it gives up. 4) Using that helper in the three affected functions. #1/#2 ensures that the reader side which observes sighand == NULL also observes all preceeding stores, i.e. the stores in posix_cpu_timers_exit() and the ones in unhash_task(). #3 ensures that the above described non-leader exec() situation is handled gracefully. When the task lookup returns the old leader, but sighand == NULL then it retries. In the non-leader exec() case the subsequent task lookup will observe the new leader due to #1/#2. In normal exit() scenarios the subsequent lookup fails. When the task lookup fails, the function also checks whether the timer is still enqueued and issues a warning if that's the case. Unfortunately there is nothing which can be done about it, but as the task is already not longer visible the timer should not be accessed anymore. This check also requires memory ordering, which is not provided when the first lookup fails. To achieve that the check is preceeded by a smp_rmb() which pairs with the smp_wmb() in write_seqlock() in __exit_signal(). That ensures that the stores in posix_cpu_timers*_exit() are visible. The history of the non-leader exec() issue goes back to the early days of posix CPU timers, which stored a pointer to the group leader task in the timer. That obviously fails when a non-leader exec() switches the leader. commit `e0a7021710` ("posix-cpu-timers: workaround to suppress the problems with mt exec") added a temporary workaround for that in 2010 which survived about 10 years. The fix for the workaround changed the task pointer to a pid pointer, but failed to see the subtle race described above. So the Fixes tag picks that commit, which seems to be halfways accurate. Thanks to Frederic Weissbecker, Oleg Nesterov and Peter Zijlstra for review, feedback and suggestions and to Wongi and Jungwoo for the excellent bug report and analysis! Fixes: `55e8c8eb2c` ("posix-cpu-timers: Store a reference to a pid not a task") Reported-by: Wongi Lee <qw3rtyp0@gmail.com> Reported-by: Jungwoo Lee <jwlee2217@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org	2026-07-05 11:44:06 +02:00
Wang Yan	269f2b43fa	time: Fix off-by-one in compat settimeofday() usec validation The compat version of settimeofday() uses '>' instead of '>=' when validating tv_usec against USEC_PER_SEC, allowing the value 1000000 to pass the check. After the subsequent conversion to nanoseconds (tv_nsec *= NSEC_PER_USEC), this results in tv_nsec == NSEC_PER_SEC, which violates the timespec invariant that tv_nsec must be strictly less than NSEC_PER_SEC. The native settimeofday() was already fixed in commit `ce4abda5e1` ("time: Fix off-by-one in settimeofday() usec validation"), but the compat counterpart was missed. Fix it by using '>=' to reject tv_usec values outside the valid range [0, USEC_PER_SEC - 1]. Fixes: `5e0fb1b57b` ("y2038: time: avoid timespec usage in settimeofday()") Signed-off-by: Wang Yan <wangyan01@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260622103348.120255-1-wangyan01@kylinos.cn	2026-06-22 13:20:20 +02:00
Ethan Nelson-Moore	f8aceb1adb	hrtimer: Correct CONFIG_NO_HZ_COMMON macro name in comment A comment in kernel/time/hrtimer.c incorrectly refers to CONFIG_NOHZ_COMMON instead of CONFIG_NO_HZ_COMMON. Correct it. Discovered while searching for CONFIG_* symbols referenced in code but not defined in any Kconfig file. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260609034314.25029-1-enelsonmoore@gmail.com	2026-06-17 17:59:45 +02:00
Zhan Xusheng	26aff38fef	posix-cpu-timers: Use u64 multiplication in update_rlimit_cpu() update_rlimit_cpu() converts the RLIMIT_CPU value to nanoseconds with u64 nsecs = rlim_new * NSEC_PER_SEC; On 32-bit kernels both rlim_new (unsigned long) and NSEC_PER_SEC (1000000000L) are 32-bit, so the multiplication is performed in unsigned long and truncated for rlim_new > 4 seconds before being widened to u64. The same file already casts to u64 for the matching computation in check_process_timers(): u64 softns = (u64)soft * NSEC_PER_SEC; As a result, the truncated value is installed into the CPUCLOCK_PROF expiry cache (nextevt), causing the process CPU timer to be programmed to fire prematurely for any RLIMIT_CPU soft limit >= 5 seconds. The actual SIGXCPU/SIGKILL decision in check_process_timers() already casts to u64 and is therefore correct, so limit enforcement is not broken; only the expiry-cache programming is wrong. Apply the same cast here so both paths convert rlim_cur identically. 64-bit kernels are unaffected. Fixes: `858cf3a8c5` ("timers/itimer: Convert internal cputime_t units to nsec") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616112017.1681372-1-zhanxusheng@xiaomi.com	2026-06-17 17:58:34 +02:00
Mikhail Gavrilov	8fa3082118	timekeeping: Register default clocksource before taking tk_core.lock Commit `f24df84cbe` ("time/jiffies: Register jiffies clocksource before usage") moved the jiffies clocksource registration into clocksource_default_clock(), so that it is registered lazily on the first call. __clocksource_register() acquires clocksource_mutex, but the first caller is timekeeping_init(), which invokes clocksource_default_clock() while holding tk_core.lock, a raw spinlock. Acquiring a sleeping mutex while holding a raw spinlock is invalid. The default clocksource only has to be registered before tk_setup_internals() consumes its mult/shift/maxadj. Neither clocksource_default_clock(), the ->enable() callback, nor the registration itself need tk_core.lock, so fetch and enable the clock before acquiring the lock. This preserves the "register before usage" ordering while keeping clocksource_mutex out of the raw spinlock section. clocksource_default_clock() has a second caller, clocksource_done_booting(), which invokes it with clocksource_mutex already held. That path avoids a recursive lock because timekeeping_init() has already run and set cs_jiffies_registered, so the registration is skipped there. This change does not alter that; it only fixes the invalid wait context in timekeeping_init(). Fixes: `f24df84cbe` ("time/jiffies: Register jiffies clocksource before usage") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616070914.65818-1-mikhail.v.gavrilov@gmail.com	2026-06-17 16:55:26 +02:00
Linus Torvalds	186d3c4e92	A series of updates for the VDSO: - Remove the redundant CONFIG_GENERIC_TIME_VSYSCALL after converting the remaining users over. - Rework and sanitize the MIPS VDSO handling, so it does not handle the time related VDSO if there is no VDSO capable clocksource available. Also stop mapping VDSO data pages unconditionally even if there is no usage possible. -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmotog4QHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoYD5D/41UOOboDxkv3Ei0gS12RfFL/pHWPotWMvA eUS4Ohko2pYubZB+iv87QQD5HC6TBN9yXbUjpvS0wN9jVSlXgQOdJ0JkNYYbPfjH nsBt7g/ksCA0Tv0KxnUUtECCTu6frtXkSi5iDU03DVNvxLMDmRaXg+D10TKGiCfb q1g2IoenA6YWz8pUkwOSI5wInepN8Dt9s5VXRsBQS9J9Uin/ZWAIwghIEhfFPNZu braNUSyJ+A7xISipOjWD1kEfBjcUzr24AuMZ9XHpxnykdWoa57qNEoTnjm+bVWpU hK0ED4nXSM8qxioV1Kye0h+hUi0T7amNtq6OSfAL+oXWHINvUe2PybqK+sZ1pPFb 0MSP/J4mV9CIvriPA33fTXw/VxigI/puCMtGuSLLXHwh2xW8hadz7wGU9lqF4E7E f+X9NYqZSmx/82z0KKPFtIQyfrozquRBzy2ZiUloh7ek7GKF5tvuN8lds3Ghuf/6 9Hqr/YquG1Zuu+j95ozVIm7fhDjmMktxAwyAbg5WkNGqOEw7MyhoUGXjKKG29uLa PRitT9DklL5vndzjizkLc1C2DuC++OhpnJCqRWOjJ9/TbvBC6Rn6aKwYyVTFX9xM 4VauNLQg0q9uG0TyW6o3aakaFk9OIKT+vl7bWO0G4g2YZaKDOs7s+b9sdNNNQq5R lTL5F0KUCQ== =Z90m -----END PGP SIGNATURE----- Merge tag 'timers-vdso-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull vdso updates from Thomas Gleixner: - Remove the redundant CONFIG_GENERIC_TIME_VSYSCALL after converting the remaining users over. - Rework and sanitize the MIPS VDSO handling, so it does not handle the time related VDSO if there is no VDSO capable clocksource available. Also stop mapping VDSO data pages unconditionally even if there is no usage possible. * tag 'timers-vdso-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: MIPS: VDSO: Fold MIPS_CLOCK_VSYSCALL into MIPS_GENERIC_GETTIMEOFDAY MIPS: VDSO: Gate microMIPS restriction on GCC version MIPS: VDSO: Fold MIPS_DISABLE_VDSO into MIPS_GENERIC_GETTIMEOFDAY clocksource/drivers/mips-gic-timer: Only use VDSO_CLOCKMODE_GIC when it is a available MIPS: csrc-r4k: Only use VDSO_CLOCKMODE_R4K when it is a available MIPS: VDSO: Only map the data pages when the vDSO is used MIPS: Introduce Kconfig MIPS_GENERIC_GETTIMEOFDAY vdso/datastore: Always provide symbol declarations MAINTAINERS: Add include/linux/vdso_datastore.h to vDSO block vdso/gettimeofday: Rename __arch_get_vdso_u_timens_data() vdso/treewide: Drop GENERIC_TIME_VSYSCALL vdso/vsyscall: Gate update_vsyscall() behind CONFIG_GENERIC_GETTIMEOFDAY riscv: vdso: Drop CONFIG_GENERIC_TIME_VSYSCALL guard around syscall fallbacks	2026-06-15 13:57:13 +05:30
Linus Torvalds	2d6d57f889	Updates for NTP/timekeeping and PTP: - Expand timekeeping snapshot mechanisms The various snapshot functions are mostly used for PTP to collect "atomic" snapshots of various involved clocks. They lack support for the recently introduced AUX clocks and do not provide the underlying counter value (e.g. TSC) to user space. Exposing the counter value snapshot allows for better control and steering. Convert the hard wired ktime_get_snapshot() to take a clock ID, which allows the caller to select the clock ID to be captured along with CLOCK_MONONOTONIC_RAW. Additionally capture the underlying hardware counter value and the clock source ID of the counter. Expand the hardware based snapshot capture where devices provide a mechanism to snapshot the hardware PTP clock and the system counter (usually via PCI/PTM) to support AUX clocks and also provide the captured counter value back to the caller and not only the clock timestamps derived from it. - Add a new optional read_snapshot() callback to clocksources That is required to capture atomic snapshots from clocksources which are derived from TSC with a scaling mechanism (e.g. Hyper-V, KVMclock). The value pair is handed back in the snapshot structure to the callers, so they can do the necessary correlations in a more precise way. This touches usage sites of the affected functions and data structure all over the tree, but stays fully backwards compatible for the existing user space exposed interfaces. New PTP IOCTLs will provide access to the extended functionality in later kernel versions. -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmotoSAQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoYWbEACv/g3pGDxWxfzOI2h6vQgGxDvD3LwmdhPE bzXRRaxp3/J0rZTQmCghknVGDPVjepNQgKkUMXfaFG2UZmiPHG5qVTXO6DddguS4 cQc0SUO3e422lUPCoBmTULZ+vlctb4LJsWXPQHYNKC73KqMJtWte7T2HBiFDK5RB O0S34DZtkvOW4tHIu0RwlwCXZ0gcO+zsjxKA8K/P6sMtKBQU1/rRkZZCx2KCvq0F Rx3NTGoY4if/C83YBq1cEn8BvXrcQQH4ZOOWuySsLGJGRPZ1dXGP+JtfRWutk/f9 HZztlaXcEz71dJXlhBxc0Eb/86uC3POEq7ZYvQdzLbsSZ/3AbalksL9CLyxgdHtc U964SuwOVPcYfEytd4TWb1nu7JgOR0olYK+l4AbCt4EdKst5TADCJ7rtlZV3Idp+ Yg1GN3TwJcKItUNX9Szk+7MbvB8EWOEl7Obahfm48qDK1pqFe08qhOzSCeRXu+Bb QiupC3ndzUB1Yjf3DPV6wQl4Fl/TscrAVrPlnGCOJEKXtUKFxvcKquy/W29UD//w NuxKO2zK05UDsbBEwnZiCrdSGGNiLBYUbHfx2UvA7M0rfrrbjmG4rCFPotxhNb54 UuqgdM8G45MkyBV3qSSh3VC0XeD7UqzQtMYgUjjhvtapLlsri69vzL2DnQUcajSG dgjzIg9O3g== =kBrz -----END PGP SIGNATURE----- Merge tag 'timers-ptp-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "Updates for NTP/timekeeping and PTP: - Expand timekeeping snapshot mechanisms The various snapshot functions are mostly used for PTP to collect "atomic" snapshots of various involved clocks. They lack support for the recently introduced AUX clocks and do not provide the underlying counter value (e.g. TSC) to user space. Exposing the counter value snapshot allows for better control and steering. Convert the hard wired ktime_get_snapshot() to take a clock ID, which allows the caller to select the clock ID to be captured along with CLOCK_MONONOTONIC_RAW. Additionally capture the underlying hardware counter value and the clock source ID of the counter. Expand the hardware based snapshot capture where devices provide a mechanism to snapshot the hardware PTP clock and the system counter (usually via PCI/PTM) to support AUX clocks and also provide the captured counter value back to the caller and not only the clock timestamps derived from it. - Add a new optional read_snapshot() callback to clocksources That is required to capture atomic snapshots from clocksources which are derived from TSC with a scaling mechanism (e.g. Hyper-V, KVMclock). The value pair is handed back in the snapshot structure to the callers, so they can do the necessary correlations in a more precise way. This touches usage sites of the affected functions and data structure all over the tree, but stays fully backwards compatible for the existing user space exposed interfaces. New PTP IOCTLs will provide access to the extended functionality in later kernel versions" * tag 'timers-ptp-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (28 commits) ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing x86/kvmclock: Implement read_snapshot() for kvmclock clocksource clocksource/hyperv: Implement read_snapshot() for TSC page clocksource timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshot ptp: Switch to ktime_get_snapshot_id() for pre/post timestamps timekeeping: Add support for AUX clock cross timestamping timekeeping: Remove system_device_crosststamp::sys_realtime ALSA: hda/common: Use system_device_crosststamp::sys_systime wifi: iwlwifi: Use system_device_crosststamp::sys_systime ptp: Use system_device_crosststamp::sys_systime timekeeping: Prepare for cross timestamps on arbitrary clock IDs timekeeping: Remove ktime_get_snapshot() virtio_rtc: Use provided clock ID for history snapshot net/mlx5: Use provided clock ID for history snapshot igc: Use provided clock ID for history snapshot ice/ptp: Use provided clock ID for history snapshot wifi: iwlwifi: Adopt PTP cross timestamps to core changes timekeeping: Add CLOCK ID to system_device_crosststamp timekeeping: Add system_counterval_t to struct system_device_crosststamp timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id() ...	2026-06-15 13:51:27 +05:30
Linus Torvalds	a53fcff8fc	Updates for the NOHZ subsystem: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmotm7YQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoWEOEADG/RzW9l7fwUcsnubRyHER7GoIEZGP/VO/ nLQnI+M1LwdVyK8Oq8WntSnqOGevVdUKQlLFYMXdwovS8TWCi7PWcQeMIsfJnwT/ 6ugwo4E3mSBcseMjN8eHkEYH+1YmN6JSYQ+5eXT0JUXUxJlMFgtv3ZnTbQOF6Y3k Nkz1THojzdgTMn6WU/01AOXDR8Nhb2gOOQLDF/1ItZWnsDhbrDE2l99OaAELfeIo 8BZFQBPYRfuR8HaGgd8m2OPGnbw+cXMHUrTscMMQvmYv2wJVcNMB4AmVaUmGMRTR c3xs8QG2nUmIm72ENfe2pPDPyxy9JUJ54ro6/rLtcNQZ2wVCznhNahm6wRUa8f/0 7gnQE5nS2SNtdpL5StDVzk2+AZl6SrU8+ss51f1owPcNwKs+tt9XE+GVyL5wZpC8 IeD2SzVQLIIc0+ZvIkt9n76lLJkiLZkmt6UAnvhiz5NNn8XQlgC/4Uadk9CwC5vx t6swpE06D7vx7lm+ycfYUl6VPje5g/YvkkDFJqJulIlEDNpvb85+W6lMTyO7rwVl ln5TzOGPp7lJDwu6dAfg0QtHcUmUIFq05FuOIRNC4ZNxkxL2XDMzoe5IHxcWzU9O CHFGBUNZf+30vxvKeh+rkjZalt7/vgQaxb5somGmqv8g/6SJtJyB2qlq+gqH/+lQ MeG8tFGtVQ== =+cVw -----END PGP SIGNATURE----- Merge tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. * tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: sched/cputime: Handle dyntick-idle steal time correctly sched/cputime: Handle idle irqtime gracefully sched/cputime: Provide get_cpu_[idle\|iowait]_time_us() off-case tick/sched: Consolidate idle time fetching APIs tick/sched: Account tickless idle cputime only when tick is stopped tick/sched: Remove unused fields tick/sched: Move dyntick-idle cputime accounting to cputime code tick/sched: Remove nohz disabled special case in cputime fetch tick/sched: Unify idle cputime accounting s390/time: Prepare to stop elapsing in dynticks-idle powerpc/time: Prepare to stop elapsing in dynticks-idle sched/cputime: Correctly support generic vtime idle time sched/cputime: Remove superfluous and error prone kcpustat_field() parameter sched/idle: Handle offlining first in idle loop tick/sched: Fix TOCTOU in nohz idle time fetch	2026-06-15 13:48:52 +05:30
Linus Torvalds	a60ce761d9	Updates for the time/timer core subsystem: - Harden the user space controllable hrtimer interfaces further to protect against unpriviledged DoS attempts by arming timers in the past. - Add per-capacity hierarchies to the timer migration code to prevent timer migration accross different capacity domains. This code has been disabled last minute as there is a pathological problem with SoCs which advertise a larger number of capacity domains. The problem is under investigation and the code won't be active before v7.3, but that turned out to be less intrusive than a full revert as it preserves the preparatory steps and allows people to work on the final resolution - Export time namespace functionality as a recent user can be built as a module. - Initialize the jiffies clocksource before using it. The recent hardening against time moving backward requires that the related members of struct clocksource have been initialized, otherwise it clamps the readout to 0, which makes time stand sill and causes boot delays. - Fix a more than twenty year old PID reference count leak in an error path of the POSIX CPU timer code. - The usual small fixes, improvements and cleanups all over the place. -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmotj9MQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoZgMEADHAbRUevdga3L6BkPy5X/4HwKp5YpwIzAO NXrIugAWjObQnhiPoVV9IYrBSj7MKX1hXTOIt/4G8vj/LK1Q7f/YtLIhp5mi7Qhe EJXS5gRvjbHwlOLko+IRyRNJIy0PcL5VRCSEqmS/wbAblfmuGsVXP6Z4mlbWdG3i hux6w3sbbVy1rYo/hYEM+sA4IATH6cepzSNbT3jStyqZGWYDRuGlimcMGlIKe0hG mczpQx+MoBTXL3PsZJrTu8cLACSqZRNkiNaX1yh3xKGLQdrchbydSmL55ucUvQQb HWWotjt0PWHJgWNQVZpwFA+V2otV+XT7nVoxQlOQHwQxe1qroa781Sg6ZPXNH6Av KH88cvUuu7f6vVRtijUmiar9IVBy4MGZ7CpZLqCEBeAxMhwATidHOzxYZIGKOxJF k9Im1nIWkem3eMqGdwrmAuDHNQA9KCDgencDpT+6jJ/6rMuezIteVJjlJFsR4KGa 1bXwX6NqlLvs4dWfiihP8mLsn73t1CeN0HJSJkdjib6tHS0AQpVCV35z8QgRmeMs bSiFfVFutqAwu1KUC3ESKtzGDVirIjI6FAhbUOzE5SUq0APu1gXBih8142jPuQGz 9hyWBdVVsUIfQdEqY9xhSnXhUXcekPOe5rnDGECNP8MHxP9bNASTBHeOY3ELeKxp 7+Bjk34eqg== =KRcs -----END PGP SIGNATURE----- Merge tag 'timers-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: "Updates for the time/timer core subsystem: - Harden the user space controllable hrtimer interfaces further to protect against unpriviledged DoS attempts by arming timers in the past. - Add per-capacity hierarchies to the timer migration code to prevent timer migration accross different capacity domains. This code has been disabled last minute as there is a pathological problem with SoCs which advertise a larger number of capacity domains. The problem is under investigation and the code won't be active before v7.3, but that turned out to be less intrusive than a full revert as it preserves the preparatory steps and allows people to work on the final resolution - Export time namespace functionality as a recent user can be built as a module. - Initialize the jiffies clocksource before using it. The recent hardening against time moving backward requires that the related members of struct clocksource have been initialized, otherwise it clamps the readout to 0, which makes time stand sill and causes boot delays. - Fix a more than twenty year old PID reference count leak in an error path of the POSIX CPU timer code. - The usual small fixes, improvements and cleanups all over the place" * tag 'timers-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (31 commits) posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path time/jiffies: Register jiffies clocksource before usage timers/migration: Temporarily disable per capacity hierarchies timers/migration: Turn tmigr_hierarchy level_list into a flexible array timers/migration: Deactivate per-capacity hierarchies under nohz_full timers/migration: Fix hotplug migrator selection target on asymetric capacity machines ntsync: Honour caller's time namespace for absolute MONOTONIC timeouts time/namespace: Export init_time_ns and do_timens_ktime_to_host() timers/migration: Update stale @online doc to @available timers: Fix flseep() typo in kernel-doc comment hrtimer: Fix the bogus return type of __hrtimer_start_range_ns() hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without() clocksource: Clean up clocksource_update_freq() functions alarmtimer: Remove stale return description from alarm_handle_timer() selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements scripts/timers: Add timer_migration_tree.py timers/migration: Handle capacity in connect tracepoints timers/migration: Split per-capacity hierarchies timers/migration: Track CPUs in a hierarchy timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness ...	2026-06-15 13:39:12 +05:30
Linus Torvalds	f20e2fdaae	Updates for clocksource/clockevent drivers: - Add devm helpers for clocksources, which allows to simplify driver teardown and probe failure handling. - More module conversion work - Update the support for the ARM EL2 virtual timer including the required ACPI changes. - Add clockevent and clocksource support for the TI Dual Mode Timer - Fix the support for multiple watchdog instances in the TEGRA186 driver - Add D1 timer support to the SUN5I driver - The usual devicetree updates, cleanups and small fixes all over the place -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmotjBUQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoaXND/0Z+wx27W4g4n9OTc//dfG+iV4eg05On9Fg LhlnyanjvIiZwWEBR0AdLDFHgxCqQV0fHpgldsz/Qlk/5uXcpsrGIhqwAiWynppU SLcZyW5XrZp4syTQfIxaAwAqMzGfW3nHWEkaRYCo/Q8oTDfvrXzY0AWYoK+bDkL3 NmnIuNGdLD+VoD4E3tn0/Q/MTtA4+DI9Cq98S5lOOchkO01ZOxx9OTalhAGFEvOo zk6YSW2qeUtvc3stahPSvAOaC6EcAalp7wNjWDeHV/c2kUmoKlGpsgvb8Mt+0vPD exNdctuTFGU2zrS1s0IB22n8RkSCzloxiTBs4Ec88ecn+cLIwq3oeuqchNiH/Quw YzQWuYhG1ML5i5/g3sAnYu+Hj5DOvB/0IBiYlCxKH4rf49RtBn1AQWt4OzdKzuQS /fnYqw+4N5XCodHzQbFv7Kbs+21lXXYubt9RLvSl6QYc4xfJRfWCS7rJdXUExw3D 5visGs+tUozi8C23t+BCWJ7jJsd0uNk51gmvOPUI3ybgL5YFwz1YdY4OwAAQT57i MGiDbC4nMz7nlyWkl40ZVjOvl8bNvCGozq30tRux+CPmpYUi/87ToraKZy8ulSra +26/ekATKXSfcyUh+TinA10AnnmCdxUyfA1WpZ6z/bCVdZqqHXK6HBf0OY8+u3um zhrXH2BJBQ== =4vQE -----END PGP SIGNATURE----- Merge tag 'timers-clocksource-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for clocksource/clockevent drivers: - Add devm helpers for clocksources, which allows to simplify driver teardown and probe failure handling. - More module conversion work - Update the support for the ARM EL2 virtual timer including the required ACPI changes. - Add clockevent and clocksource support for the TI Dual Mode Timer - Fix the support for multiple watchdog instances in the TEGRA186 driver - Add D1 timer support to the SUN5I driver - The usual devicetree updates, cleanups and small fixes all over the place" * tag 'timers-clocksource-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (24 commits) clocksource: move NXP timer selection to drivers/clocksource clocksource/drivers/timer-tegra186: Reserve and service a kernel watchdog clocksource/drivers/timer-tegra186: Register all accessible watchdog timers clocksource/drivers/timer-tegra186: Correct num_wdts for Tegra186 and Tegra234 clocksource/drivers/timer-tegra186: Fix support for multiple watchdog instances clocksource/drivers/timer-ti-dm: Add clockevent support clocksource/drivers/timer-ti-dm: Add clocksource support clocksource/drivers/timer-ti-dm: Fix property name in comment dt-bindings: timer: arm,arch_timer: Fix requirements for interrupt description clocksource/drivers/arm_arch_timer: Default to EL2 virtual timer when running VHE ACPI: GTDT: Parse information related to the EL2 virtual timer ACPI: GTDT: Account for GTDTv3 size when walking the platform timer descriptors clocksource: Add devm_clocksource_register_*() helpers clocksource/drivers/sun5i: Add D1 hstimer support dt-bindings: timer: allwinner,sun5i-a13-hstimer: add H616 and D1 dt-bindings: timer: Add StarFive JHB100 clint dt-bindings: timer: renesas,rz-mtu3: document RZ/{T2H,N2H} dt-bindings: timer: renesas,rz-mtu3: Remove TCIU8 interrupt dt-bindings: timer: Remove sifive,fine-ctr-bits property clocksource/drivers/timer-of: Make the code compatible with modules ...	2026-06-15 13:34:03 +05:30
WenTao Liang	87bd2ad568	posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference via get_pid() and stores it in timer.it.cpu.pid. If the subsequent posix_cpu_timer_set() call fails, the function returns immediately without calling posix_cpu_timer_del() to release the pid reference, causing a leak. Fix it by calling posix_cpu_timer_del() before the unlock-and-return on the error path, consistent with the other exit paths in the same function. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: WenTao Liang <vulab@iscas.ac.cn> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn	2026-06-13 16:16:02 +02:00
Thomas Gleixner	f24df84cbe	time/jiffies: Register jiffies clocksource before usage Teddy reported that a XEN HVM has a long boot delay, which was bisected to the recent enhancements to the negative motion detection. It turned out that the jiffies clocksource is used in early boot before it is registered, which leaves the max_delta_raw field at zero. That causes the read out to be clamped to the max delta of 0, which means time is not making progress. Cure it by ensuring that it is initialized before its first usage in timekeeping_init(). Fixes: `76031d9536` ("clocksource: Make negative motion detection more robust") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Teddy Astie <teddy.astie@vates.tech> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13 Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech	2026-06-13 15:22:40 +02:00
Frederic Weisbecker	bb0c250e8e	timers/migration: Temporarily disable per capacity hierarchies Some workloads with different CPU capacities consume more power with timer migration than before. The recently introduced per capacity hierarchies were supposed to alleviate this problem. However it appears to also regress other types of workloads, especially when plenty of capacities live together in the same machine. Disable the feature until a reasonable solution is found. Fixes: `098cbaad8e` ("timers/migration: Split per-capacity hierarchies") Reported-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260609123356.28449-1-frederic@kernel.org Closes: https://lore.kernel.org/all/3b79338f-6cfc-4722-8062-9103db2c8ad1@arm.com	2026-06-09 17:25:54 +02:00
Amit Matityahu	d486b4934a	timers/migration: Fix livelock in tmigr_handle_remote_up() tmigr_handle_remote_cpu() skips timer_expire_remote() when cpu == smp_processor_id(), assuming the local softirq path already handled this CPU's timers. This assumption is wrong because jiffies can advance after the handling of the CPU's global timers in run_timer_base(BASE_GLOBAL) and before tmigr_handle_remote() evaluates the expiry times. As a consequence a timer which expires after the CPU local timer wheel advanced and becomes expired in the remote handling is ignored and the callback is never invoked and removed from the timer wheel. What's worse is that fetch_next_timer_interrupt_remote() keeps reporting it as expired, and the event is re-queued with expires == now on each iteration. The goto-again loop spins indefinitely. Fix this by calling timer_expire_remote() unconditionally. That's minimal overhead for the common case as __run_timer_base() returns immediately if there is nothing to expire in the local wheel. [ tglx: Amend change log and add a comment ] Fixes: `7ee9887703` ("timers: Implement the hierarchical pull model") Reported-by: Alon Kariv <alonka@amazon.com> Signed-off-by: Amit Matityahu <amitmat@amazon.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260603170139.33628-1-amitmat@amazon.com	2026-06-04 14:35:33 +02:00
David Woodhouse	ca1ec8bfac	timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshot Add a read_snapshot() callback to struct clocksource which returns the derived clocksource value while also providing the underlying hardware counter reading and the related clocksource ID. This allows ktime_get_snapshot_id() to populate new hw_cycles and hw_csid fields in struct system_time_snapshot. For clocksources that are derived from an underlying counter (e.g., Hyper-V TSC page scales TSC to 10MHz, kvmclock scales TSC to 1GHz), this provides atomic access to both the derived value needed for timekeeping calculations, and the raw hardware counter needed by consumers like KVM's master clock and the vmclock PTP driver. [ tglx: Reworked it slightly ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Assisted-by: Kiro:claude-opus-4.6-1m Link: https://patch.msgid.link/20260526230635.136914-1-dwmw2@infradead.org Link: https://patch.msgid.link/20260529195558.202568489@kernel.org	2026-06-04 11:04:18 +02:00
Thomas Gleixner	9aebde8af6	timekeeping: Add support for AUX clock cross timestamping Now that all prerequisites are in place add the final support for AUX clocks in get_device_system_crosststamp(), which enables the PTP layer to support hardware cross timestamps with a new IOTCL. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195558.097464513@kernel.org	2026-06-04 11:04:18 +02:00
Thomas Gleixner	09fb74f77e	timekeeping: Prepare for cross timestamps on arbitrary clock IDs PTP device system crosstime stamps support only CLOCK_REALTIME, which is meaningless for AUX clocks. The PTP core hands in the clock ID already, so prepare the core code to honor it. - Add a new sys_systime field to struct system_device_crosststamp which aliases the sys_realtime field. Once all users are converted sys_realtime can be removed. - Prepare get_device_system_crosststamp() and the related code for it by switching to sys_systime and providing the initial changes to utilize different time keepers. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.846634842@kernel.org	2026-06-04 11:04:17 +02:00
Thomas Gleixner	6c14771816	timekeeping: Add system_counterval_t to struct system_device_crosststamp An upcoming extension to the PTP IOCTL requires to return the system counter value and the clocksource ID to user space. get_device_system_crosststamp() has this information already. Extend struct system_device_crosststamp with a system_counterval_t member and fill in the data. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.429406675@kernel.org	2026-06-04 11:04:16 +02:00
Thomas Gleixner	23c1bfa9f8	timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id() Now that all users are converted it's possible to enable snapshotting of CLOCK_AUX time. The underlying clocksource is the same as for all other CLOCK variants. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.380601005@kernel.org	2026-06-04 11:04:16 +02:00
Thomas Gleixner	705e106807	timekeeping: Remove system_time_snapshot::real/boot/raw All users are converted over to ktime_get_snapshot_id() and system_time_snapshot::systime and ::monoraw. Remove the leftovers. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.330029635@kernel.org	2026-06-04 11:04:16 +02:00
Thomas Weißschuh	96942092d5	vdso/treewide: Drop GENERIC_TIME_VSYSCALL This Kconfig symbol is not used anymore, remove it. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519-vdso-generic_time_vsyscal-v1-3-5c2a5905d5f5@linutronix.de	2026-06-02 21:41:23 +02:00
Rosen Penev	45b49d7e3a	timers/migration: Turn tmigr_hierarchy level_list into a flexible array The level_list array is allocated separately right after the parent struct. The size of the array is already known. Move level_list to the struct tail as a flexible array member and fold the two allocations into a single kzalloc_flex(). Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:Opus-4.7 Link: https://patch.msgid.link/20260522231618.41622-1-rosenp@gmail.com	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	d4f198c136	timers/migration: Deactivate per-capacity hierarchies under nohz_full NOHZ_FULL CPUs global timers are guaranteed to be handled by the timekeeper CPU, which never stops its tick and therefore remains active in the hierarchy. But since the introduction of per-capacity hierarchies, this guarantee is broken because the timekeeper may not belong to the same hierarchy as all the NOHZ_FULL CPUs. Fix it with simply turning off capacity awareness when NOHZ_FULL is running and force a single hierarchy. NOHZ_FULL is not exactly optimized powerwise anyway. Fixes: `098cbaad8e` ("timers/migration: Split per-capacity hierarchies") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519220926.63437-3-frederic@kernel.org	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	e4a70f5fbd	timers/migration: Fix hotplug migrator selection target on asymetric capacity machines When a top-level migrator is deactivated, either at CPU down hotplug time or when a CPU is domain isolated, a new migrator is elected among the available CPUs and woken up to take over the migration duty. However that election must happen at the scope of a given hierarchy and not globally, which the introduction of per-capacity hierarchies failed to handle. As a result a given hierarchy may end up without migrator to handle global timers. Fix it by making sure that the new migrator belongs to the same hierarchy as the outgoing CPU. Fixes: `098cbaad8e` ("timers/migration: Split per-capacity hierarchies") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260519220926.63437-2-frederic@kernel.org	2026-06-02 21:34:03 +02:00
Frederic Weisbecker	6a1f6a9dd0	tick/sched: Account tickless idle cputime only when tick is stopped There is no real point in switching to dyntick-idle cputime accounting mode if the tick is not actually stopped. This just adds overhead, notably fetching the GTOD, on each idle exit and each idle IRQ entry for no reason during short idle trips. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-12-frederic@kernel.org	2026-06-02 21:27:26 +02:00
Frederic Weisbecker	29807c524d	tick/sched: Remove unused fields Remove fields after the dyntick-idle cputime migration to scheduler code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-11-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	a5fe724e20	tick/sched: Move dyntick-idle cputime accounting to cputime code Although the dynticks-idle cputime accounting is necessarily tied to the tick subsystem, the actual related accounting code has no business residing there and should be part of the scheduler cputime code. Move away the relevant pieces and state machine to where they belong. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-10-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	bd0c77cd46	tick/sched: Remove nohz disabled special case in cputime fetch Even when nohz is not runtime enabled, the dynticks idle cputime accounting can run and the common idle cputime accessors are still relevant. Remove the nohz disabled special case accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-9-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	cf6444c3e1	tick/sched: Unify idle cputime accounting The non-vtime dynticks-idle cputime accounting is a big mess that accumulates within two concurrent statistics, each having their own shortcomings: * The accounting for online CPUs which is based on the delta between tick_nohz_start_idle() and tick_nohz_stop_idle(). Pros: - Works when the tick is off - Has nsecs granularity Cons: - Account idle steal time but doesn't substract it from idle cputime. - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the IRQ time is simply ignored when CONFIG_IRQ_TIME_ACCOUNTING=n - The windows between 1) idle task scheduling and the first call to tick_nohz_start_idle() and 2) idle task between the last tick_nohz_stop_idle() and the rest of the idle time are blindspots wrt. cputime accounting (though mostly insignificant amount) - Relies on private fields outside of kernel stats, with specific accessors. * The accounting for offline CPUs which is based on ticks and the jiffies delta during which the tick was stopped. Pros: - Handles steal time correctly - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n correctly. - Handles the whole idle task - Accounts directly to kernel stats, without midlayer accumulator. Cons: - Doesn't elapse when the tick is off, which doesn't make it suitable for online CPUs. - Has TICK_NSEC granularity (jiffies) - Needs to track the dyntick-idle ticks that were accounted and substract them from the total jiffies time spent while the tick was stopped. This is an ugly workaround. Having two different accounting for a single context is not the only problem: since those accountings are of different natures, it is possible to observe the global idle time going backward after a CPU goes offline. Clean up the situation with introducing a hybrid approach that stays coherent and works for both online and offline CPUs: * Tick based or native vtime accounting operate before the idle loop is entered and resume once the idle loop prepares to exit. * When the idle loop starts, switch to dynticks-idle accounting as is done currently, except that the statistics accumulate directly to the relevant kernel stat fields. * Private dyntick cputime accounting fields are removed. * Works on both online and offline case. Further improvement will include: * Only switch to dynticks-idle cputime accounting when the tick actually goes in dynticks mode. * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the dynticks-idle accounting still elapses while on IRQs. * Correctly substract idle steal cputime from idle time Reported-by: Xin Zhao <jackzxcui1989@163.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-8-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	650a59805a	sched/cputime: Correctly support generic vtime idle time Currently whether generic vtime is running or not, the idle cputime is fetched from the nohz accounting. However generic vtime already does its own idle cputime accounting. Only the kernel stat accessors are not plugged to support it. Read the idle generic vtime cputime when it's running, this will allow to later more clearly split nohz and vtime cputime accounting. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-5-frederic@kernel.org	2026-06-02 21:27:25 +02:00
Frederic Weisbecker	86db4084b4	tick/sched: Fix TOCTOU in nohz idle time fetch When the nohz idle time is fetched, the current clock timestamp is taken outside the seqcount, which can result in a race as reported by Sashiko: get_cpu_sleep_time_us() tick_nohz_start_idle() ----------------------- --------------------- now = ktime_get() write_seqcount_begin(idle_sleeptime_seq); idle_entrytime = ktime_get() tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE); write_seqcount_end(&ts->idle_sleeptime_seq); read_seqcount_begin(idle_sleeptime_seq) delta = now - idle_entrytime); //!! But now < idle_entrytime idle = *sleeptime + delta; read_seqcount_retry(&ts->idle_sleeptime_seq, seq) Here the read side fetches the timestamp before the write side and its update. As a result the time delta computed on the read side is negative (ktime_t is signed) and breaks the cputime monotonicity guarantee. This could possibly be fixed with reading the current clock timestamp inside the seqcount but the reader overhead might then increase. Also simply checking that the current timestamp is above the idle entry time is enough to prevent any issue of the like. Fixes: `620a30fa0b` ("timers/nohz: Protect idle/iowait sleep time under seqcount") Reported-by: Sashiko Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260508131647.43868-2-frederic@kernel.org	2026-06-02 21:27:24 +02:00
Naveen Kumar Chaudhary	ce4abda5e1	time: Fix off-by-one in settimeofday() usec validation The validation check uses '>' instead of '>=' when comparing tv_usec against USEC_PER_SEC, allowing the value 1000000 through. After conversion to nanoseconds (*= 1000), this produces tv_nsec == NSEC_PER_SEC, violating the timespec invariant that tv_nsec must be less than NSEC_PER_SEC. Use '>=' to reject tv_usec values that are not in the valid range of 0 to 999999. Fixes: `5e0fb1b57b` ("y2038: time: avoid timespec usage in settimeofday()") Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/4rikk44zew3s6577dugmx4jyblz7o5c57niuap6ct3td5yfm6w@gh7pcumg7qor	2026-06-02 21:07:55 +02:00
Naveen Kumar Chaudhary	c1ca14ca22	clockevents: Fix duplicate type specifier in stub function parameter The stub for arch_inlined_clockevent_set_next_coupled() has 'u64 u64 cycles' in its parameter list. Since u64 is a typedef, the compiler parses the second 'u64' as the parameter name, making 'cycles' an unused token. Remove the duplicate so the parameter is correctly named. Fixes: `89f951a1e8` ("clockevents: Provide support for clocksource coupled comparators") Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/7tostpvxzdn6tobmyow63a5rweatls5kux3scqp2vzhe7mv6uq@ecr746b4hyhf	2026-06-02 21:07:55 +02:00
Maoyi Xie	766e828b01	time/namespace: Export init_time_ns and do_timens_ktime_to_host() timens_ktime_to_host() in compares the current time namespace against init_time_ns for the fast path. It calls do_timens_ktime_to_host() for the offset case. Both symbols are needed at link time by any caller of the inline. All current callers are builtin, but ntsync can be built as module, which prevents it from using it. Export both with EXPORT_SYMBOL_GPL. Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260528063311.3300393-2-maoyixie.tju@gmail.com	2026-06-02 21:05:36 +02:00
Thomas Gleixner	ef22786707	timekeeping: Use system_time_snapshot::systime/monoraw instead of ::real/raw system_time_snapshot::systime provides the same information as system_time_snapshot::real when the snapshot was taken with ktime_get_snapshot_id(CLOCK_REALTIME). Convert the history interpolation over to use 'systime' and 'monoraw' as 'real/raw' are going away once all users are converted. As a side effect this is the first step to support CLOCK_AUX with get_device_crosstime_stamp() and the history interpolation. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.024415766@kernel.org	2026-06-02 11:39:57 +02:00
Thomas Gleixner	eba302268a	timekeeping: Provide ktime_get_snapshot_id() ktime_get_snapshot() provides a snapshot of the underlying clocksource counter value and the corresponding CLOCK_MONOTONIC_RAW, CLOCK_REALTIME and CLOCK_BOOTTIME timestamps. There is no usage of CLOCK_REALTIME and CLOCK_BOOTTIME at the same time and CLOCK_BOOTTIME support was just added for the ARM64 KVM tracing mechanism, which needs CLOCK_BOOTTIME and the underlying clocksource counter value. ktime_get_snapshot() is also not suitable for usage with CLOCK_AUX, but that's a prerequisite to support PTP hardware timestamping for CLOCK_AUX steering. As a first step, rename ktime_get_snapshot() to ktime_get_snapshot_id(), which now takes a clockid argument to select the clock which needs to be captured. The result is stored in system_time_snapshot::systime, which will replace the system_time_snapshot::real/boot members once all usage sites have been converted. ktime_get_snapshot() is a simple wrapper which hands in CLOCK_REALTIME as clockid argument for the conversion period. That means CLOCK_REALTIME is now captured twice, but that redunancy is only temporary. As all usage sites of struct system_time_snapshot has to be updated anyway, rename the 'raw' member to 'monoraw' for clarity. No functional change vs. current users of ktime_get_snapshot() Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195556.971591633@kernel.org	2026-06-02 11:39:57 +02:00
Zhan Xusheng	45a13ba52c	timers/migration: Update stale @online doc to @available Commit `8312cab5ff` ("timers/migration: Rename 'online' bit to 'available'") renamed the 'online' field of struct tmigr_cpu to 'available'. The kernel doc comment above the struct still describes the old field name. Update it to reflect the actual field name and use the 'available' wording in the description. Fixes: `8312cab5ff` ("timers/migration: Rename 'online' bit to 'available'") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260526022106.1302279-1-zhanxusheng@xiaomi.com	2026-05-28 18:35:11 +02:00
Daniel Lezcano	3eb4923e68	clocksource: Add devm_clocksource_register_() helpers Introduce device-managed helpers for clocksource registration. The clocksource framework currently provides __clocksource_register_scale() along with convenience wrappers for Hz and kHz registration. However, drivers must handle error paths and cleanup manually, typically by pairing registration with an explicit clocksource_unregister() call. Add a devm-based variant, __devm_clocksource_register_scale(), along with devm_clocksource_register_hz() and devm_clocksource_register_khz() helpers. These helpers register the clocksource and attach a devres action to automatically unregister it on driver detach or probe failure. This simplifies driver code by: removing explicit cleanup paths * ensuring correct teardown ordering * aligning with the devm-based resource management model widely used across the kernel While drivers can open-code devm_add_action_or_reset(), providing a dedicated helper avoids duplication, reduces boilerplate, and ensures consistent usage across drivers, following patterns used in other subsystems. This is also particularly useful for drivers built as modules, where device-managed resource handling avoids manual cleanup in remove paths and ensures correct teardown on module unload. This helper is self-contained and can be adopted progressively by drivers. No functional change. Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260506153831.605159-1-daniel.lezcano@oss.qualcomm.com	2026-05-18 11:02:51 +02:00
Thomas Gleixner	5d330d652d	hrtimer: Fix the bogus return type of __hrtimer_start_range_ns() __hrtimer_start_range_ns() has a bool return type, but returns actually three different values, which are checked at the call site. Make the return type int. Fixes: `bd5956166d` ("hrtimer: Provide hrtimer_start_range_ns_user()") Reported-by: Dan Carpenter <error27@gmail.com Signed-off-by: Thomas Gleixner <tglx@kernel.org>	2026-05-13 22:41:12 +02:00
Thomas Weißschuh	3af1f49f41	hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without() These functions really work in terms of ktime_t and not u64. Change their return types and adapt the callers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260504-hrtimer-next_event-v2-1-7a5d0550b42f@linutronix.de	2026-05-06 08:33:09 +02:00
Thomas Weißschuh	33d4bfc496	clocksource: Clean up clocksource_update_freq() functions Remove the unused functions __clocksource_update_freq_hz() and __clocksource_update_freq_khz(). Then make __clocksource_update_freq_scale() static as it is not used from external callers anymore. Also clean up the comment accordingly. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260504-clocksource-update_freq-v2-1-3e696fb01776@linutronix.de	2026-05-06 08:33:08 +02:00
Zhan Xusheng	ed3b3c4976	alarmtimer: Remove stale return description from alarm_handle_timer() alarm_handle_timer() was converted from returning enum alarmtimer_restart to void, but the kernel-doc "Return:" line was not removed. Remove the stale description. Fixes: `2634303f87` ("alarmtimers: Remove return value from alarm functions") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260429080635.166790-1-zhanxusheng@xiaomi.com	2026-05-06 08:33:08 +02:00
Frederic Weisbecker	5a7dfbcbbd	timers/migration: Handle capacity in connect tracepoints This let tracers know to which hierarchy a CPU belongs to. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-6-frederic@kernel.org	2026-05-06 08:33:07 +02:00
Frederic Weisbecker	098cbaad8e	timers/migration: Split per-capacity hierarchies Systems with heterogeneous CPU capacities, such as big.LITTLE, have reported power issues since the introduction of the new timer migration code. Timers migrate from small capacity CPUs to big ones, degrading their target residency and thus overall power consumption. Solve this with splitting hierarchies per CPU capacity. For example in a big.LITTLE machine, split a single hierarchy in two: one for big capacity CPUs and another one for small capacity CPUs. This way global timers only migrate across CPUs of the same capacity. For simplicity purpose, split hierarchies keep the same number of possible levels as if there were a single hierarchy, even though the CPUs are distributed between multiple hierarchies. This could be a problem on NUMA systems with heterogeneous CPU capacities (provided that ever exists yet) where useless intermediate nodes may be created. Solving this properly will imply on boot to know in advance how many capacities are available and the number of CPUs for each of them. Reported-by: Sehee Jeong <sehee1.jeong@samsung.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-5-frederic@kernel.org	2026-05-06 08:33:07 +02:00
Frederic Weisbecker	3ba2548838	timers/migration: Track CPUs in a hierarchy When a new root is created, the old root is connected to it and propagates up its own assumed to be active state, since the hotplug control CPU is itself active and part of the old root. However with per-capacity hierarchies, this assumption won't be true anymore because the hotplug control CPU calling the timer migration prepare callback may not belong to the same hierarchy as the booting CPU. To solve this, track the available CPUs per hierarchies so that the root connection can be offlined to safe CPUs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-4-frederic@kernel.org	2026-05-06 08:33:06 +02:00
Frederic Weisbecker	ff65875f80	timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness In order to prepare for separating out CPUs from different capacities in distinct hierarchies, create a hierarchy structure that group setup must rely upon. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-3-frederic@kernel.org	2026-05-06 08:33:06 +02:00
Thomas Gleixner	86db5e92f2	Merge branch 'timers/urgent' into timers/core Pull in the timer migration fix so further changes can be applied.	2026-05-06 08:21:45 +02:00
Frederic Weisbecker	bd3c45dd01	timers/migration: Fix another hotplug activation race The hotplug control CPU is assumed to be active in the hierarchy but that doesn't imply that the root is active. If the current CPU is not the one that activated the current hierarchy, and the CPU performing this duty is still halfway through the tree, the root may still be observed inactive. And this can break the activation of a new root as in the following scenario: 1) Initially, the whole system has 64 CPUs and only CPU 63 is awake. [GRP1:0] active / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle active / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle active 2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the [GRP1:0]->parent dereference (that would be NULL and stop the walk) in __walk_groups_from(). [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle idle 3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate up to GRP1:0 due to yet another #VMEXIT. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle active idle 3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1 role. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 active active idle 4) CPU 0 boots CPU 64. It creates a new root for it. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 5) CPU 0 activates the new root, but note that GRP1:0 is still idle, waiting for CPU 1 to resume from #VMEXIT and activate it. [GRP2:0] active / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent. Therefore it propagates the stale inactive state of GRP1:0 up to GRP2:0. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it doesn't observe its parent link because no ordering enforced that. Therefore GRP2:0 is spuriously left idle. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] active idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline Such races are highly theoretical and the problem would solve itself once the old root ever becomes idle again. But it still leaves a taste of discomfort. Fix it with enforcing a fully ordered atomic read of the old root state before propagating the activate state up to the new root. It has a two directions ordering effect: * Acquire + release of the latest old root state: If the hotplug control CPU is not the one that woke up the old root, make sure to acquire its active state and propagate it upwards through the ordered chain of activation (the acquire pairs with the cmpxchg() in tmigr_active_up() and subsequent releases will pair with atomic_read_acquire() and smp_mb__after_atomic() in tmigr_inactive_up()). * Release: If the hotplug control CPU is not the one that must wake up the old root, but the CPU covering that is lagging behind its duty, publish the links from the old root to the new parents. This way the lagging CPU will propagate the active state itself. Fixes: `7ee9887703` ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org	2026-05-06 08:21:12 +02:00
Thomas Gleixner	ed78a70194	alarmtimer: Remove unused interfaces All alarmtimer users are converted to alarm_start_timer(). Remove the now unused interfaces. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20260408114952.670899355@kernel.org	2026-05-01 21:36:14 +02:00
Thomas Gleixner	f4b58f61da	alarmtimer: Convert posix timer functions to alarm_start_timer() Use the new alarm_start_timer() for arming and rearming posix interval timers and for clock_nanosleep() so that already expired timers do not go through the full timer interrupt cycle. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260408114952.400451460@kernel.org	2026-05-01 21:36:13 +02:00

1 2 3 4 5 ...

2967 Commits