linux/kernel
John Stultz d9e8fada0c timekeeping: Avoid possible deadlock from clock_was_set_delayed
commit 6fdda9a9c5 upstream.

As part of normal operaions, the hrtimer subsystem frequently calls
into the timekeeping code, creating a locking order of
  hrtimer locks -> timekeeping locks

clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
between the timekeeping the hrtimer subsystem, so that we could
notify the hrtimer subsytem the time had changed while holding
the timekeeping locks. This was done by scheduling delayed work
that would run later once we were out of the timekeeing code.

But unfortunately the lock chains are complex enoguh that in
scheduling delayed work, we end up eventually trying to grab
an hrtimer lock.

Sasha Levin noticed this in testing when the new seqlock lockdep
enablement triggered the following (somewhat abrieviated) message:

[  251.100221] ======================================================
[  251.100221] [ INFO: possible circular locking dependency detected ]
[  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
[  251.101967] -------------------------------------------------------
[  251.101967] kworker/10:1/4506 is trying to acquire lock:
[  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
[  251.101967]
[  251.101967] but task is already holding lock:
[  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] which lock already depends on the new lock.
[  251.101967]
[  251.101967]
[  251.101967] the existing dependency chain (in reverse order) is:
[  251.101967]
-> #5 (hrtimer_bases.lock#11){-.-...}:
[snipped]
-> #4 (&rt_b->rt_runtime_lock){-.-...}:
[snipped]
-> #3 (&rq->lock){-.-.-.}:
[snipped]
-> #2 (&p->pi_lock){-.-.-.}:
[snipped]
-> #1 (&(&pool->lock)->rlock){-.-...}:
[  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
[  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
[  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
[  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
[  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
[  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
[  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
[  251.101967]
-> #0 (timekeeper_seq){----..}:
[snipped]
[  251.101967] other info that might help us debug this:
[  251.101967]
[  251.101967] Chain exists of:
  timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11

[  251.101967]  Possible unsafe locking scenario:
[  251.101967]
[  251.101967]        CPU0                    CPU1
[  251.101967]        ----                    ----
[  251.101967]   lock(hrtimer_bases.lock#11);
[  251.101967]                                lock(&rt_b->rt_runtime_lock);
[  251.101967]                                lock(hrtimer_bases.lock#11);
[  251.101967]   lock(timekeeper_seq);
[  251.101967]
[  251.101967]  *** DEADLOCK ***
[  251.101967]
[  251.101967] 3 locks held by kworker/10:1/4506:
[  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] stack backtrace:
[  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
[  251.101967] Workqueue: events clock_was_set_work

So the best solution is to avoid calling clock_was_set_delayed() while
holding the timekeeping lock, and instead using a flag variable to
decide if we should call clock_was_set() once we've released the locks.

This works for the case here, where the do_adjtimex() was the deadlock
trigger point. Unfortuantely, in update_wall_time() we still hold
the jiffies lock, which would deadlock with the ipi triggered by
clock_was_set(), preventing us from calling it even after we drop the
timekeeping lock. So instead call clock_was_set_delayed() at that point.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:04 -08:00
..
cpu sched, idle: Fix the idle polling state logic 2013-11-29 11:11:42 -08:00
debug
events perf: Fix perf ring buffer memory ordering 2013-11-20 12:27:47 -08:00
gcov
irq irq: Enable all irqs unconditionally in irq_resume 2013-12-11 22:36:27 -08:00
power PM / hibernate: Avoid overflow in hibernate_preallocate_memory() 2013-12-04 10:56:58 -08:00
sched sched: Guarantee new group-entities always have weight 2014-01-15 15:28:54 -08:00
time timekeeping: Avoid possible deadlock from clock_was_set_delayed 2014-02-13 13:48:04 -08:00
trace ftrace: Have function graph only trace based on global_ops filters 2014-02-13 13:48:03 -08:00
.gitignore
acct.c
async.c
audit_tree.c kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules() 2013-06-12 16:29:46 -07:00
audit_watch.c
audit.c audit: reset audit backlog wait time after error recovery 2014-02-13 13:47:59 -08:00
audit.h audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record 2013-12-04 10:57:03 -08:00
auditfilter.c
auditsc.c audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record 2013-12-04 10:57:03 -08:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c
cgroup.c cgroup: use a dedicated workqueue for cgroup destruction 2013-12-04 10:57:20 -08:00
compat.c
configs.c
context_tracking.c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-06-20 08:18:35 -10:00
cpu_pm.c
cpu.c CPU hotplug: provide a generic helper to disable/enable CPU hotplug 2013-06-12 16:29:44 -07:00
cpuset.c cpuset: Fix memory allocator deadlock 2013-12-04 10:57:16 -08:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c move exit_task_namespaces() outside of exit_notify() 2013-06-15 05:39:08 +04:00
extable.c
fork.c mm: fix TLB flush race between migration, and change_protection_range 2014-01-09 12:24:23 -08:00
freezer.c libata, freezer: avoid block device removal while system is frozen 2014-01-09 12:24:23 -08:00
futex_compat.c
futex.c futex: fix handling of read-only-mapped hugepages 2013-12-20 07:45:08 -08:00
groups.c
hrtimer.c hrtimers: Move SMP function call to thread context 2013-07-28 16:30:22 -07:00
hung_task.c
irq_work.c
itimer.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c PCI: Disable Bus Master only on kexec reboot 2013-12-20 07:45:08 -08:00
kmod.c
kprobes.c
ksysfs.c
kthread.c
latencytop.c
lglock.c
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
lockdep.c
Makefile
modsign_certificate.S
modsign_pubkey.c
module_signing.c
module-internal.h
module.c module: do percpu allocation after uniqueness check. No, really! 2013-07-13 11:42:26 -07:00
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c
nsproxy.c
padata.c
panic.c
params.c
pid_namespace.c
pid.c pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup 2013-09-26 17:18:27 -07:00
posix-cpu-timers.c
posix-timers.c
printk.c printk: Fix rq->lock vs logbuf_lock unlock lock inversion 2013-07-25 14:07:31 -07:00
profile.c
ptrace.c exec/ptrace: fix get_dumpable() incorrect tests 2013-11-29 11:11:44 -08:00
range.c range: Do not add new blank slot with add_range_with_merge 2013-06-18 11:32:10 -05:00
rcu.h
rcupdate.c
rcutiny_plugin.h
rcutiny.c
rcutorture.c
rcutree_plugin.h
rcutree_trace.c
rcutree.c rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration 2013-06-10 13:37:12 -07:00
rcutree.h rcu: Don't call wakeup() with rcu_node structure ->lock held 2013-06-10 13:37:11 -07:00
relay.c
res_counter.c
resource.c
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rwsem.c
seccomp.c
semaphore.c
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c irq: Force hardirq exit's softirq processing on its own stack 2013-10-13 16:08:34 -07:00
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys_ni.c
sys.c reboot: rigrate shutdown/reboot to boot cpu 2013-06-12 16:29:44 -07:00
sysctl_binary.c
sysctl.c
task_work.c
taskstats.c
test_kprobes.c
time.c
timeconst.bc
timer.c timer: Fix jiffies wrap behavior of round_jiffies_common() 2013-07-21 18:21:31 -07:00
tracepoint.c
tsacct.c
uid16.c
up.c
user_namespace.c userns: limit the maximum depth of user_namespace->parent chain 2013-08-11 18:35:25 -07:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
wait.c
watchdog.c
workqueue_internal.h
workqueue.c workqueue: fix ordered workqueues in NUMA setups 2013-12-04 10:57:16 -08:00