linux/kernel
Steven Rostedt f266611ef3 sched: Try not to migrate higher priority RT tasks
Commit: 43fa5460fe upstream

When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
	ftrace_write("thread %ld iter %d, taking lock %d\n",
		     id, iter, l);
	pthread_mutex_lock(&locks[l]);
	ftrace_write("thread %ld iter %d, took lock %d\n",
		     id, iter, l);
	busy_loop(nr_tasks - id);
	ftrace_write("thread %ld iter %d, unlock lock %d\n",
		     id, iter, l);
	pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
	[...]
	while (!done) {
		for (l = 0; l < nr_locks; l++) {
			grab_lock(id, i, l);
			ftrace_write("thread %ld iter %d sleeping\n",
				     id, i);
			ms_sleep(id);
		}
		i++;
	}
	[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.

At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         53      3220       1470             98
  1:        562       773        724             98
  2:        752       933       1375             98
  3:        749        39        697             98
  4:        758         5        515             98
  5:        764         2        679             99
  6:        761         2        535             99
  7:        757         3        346             99

total:     5156       4977      6341            787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         37      2835       1937             98
  1:        666      1821       1865             98
  2:        654      1003       1385             98
  3:        664       635        973             99
  4:        698       197        352             99
  5:        703       101        159             99
  6:        708         1         75             99
  7:        713         1          2             99

total:     4843       6594      6748            789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         56      3101       1892             95
  1:        594       713        937             95
  2:        625       188        618             95
  3:        628         4        491             96
  4:        640         7        468             96
  5:        631         2        501             96
  6:        641         1        466             96
  7:        643         2        497             96

total:     4458       4018      5870            765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:        115      3439       2782             98
  1:        633      1354       1583             99
  2:        652       919       1218             99
  3:        645       713        934             99
  4:        690         3          3             99
  5:        694         1          4             99
  6:        720         3          4             99
  7:        747         0          1            100

Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17 15:37:20 -08:00
..
gcov gcov: fix null-pointer dereference for certain module types 2010-09-20 13:17:53 -07:00
irq irq: Add new IRQ flag IRQF_NO_SUSPEND 2010-08-13 13:19:50 -07:00
power PM / Hibernate: Fix PM_POST_* notification with user-space suspend 2011-01-07 14:43:06 -08:00
time timekeeping: Fix clock_gettime vsyscall time warp 2010-08-13 13:20:13 -07:00
trace tracing: Fix panic when lseek() called on "trace" opened for writing 2011-01-07 14:43:10 -08:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-18 14:03:52 -08:00
async.c
audit_tree.c fix more leaks in audit_tree.c tag_chunk() 2010-01-18 10:19:50 -08:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h
auditfilter.c
auditsc.c Audit: rearrange audit_context to save 16 bytes per struct 2009-09-24 03:50:26 -04:00
backtracetest.c
bounds.c
capability.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
cgroup_freezer.c Freezer: Fix buggy resume test for tasks frozen with cgroup freezer 2010-04-26 07:41:17 -07:00
cgroup.c cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() 2010-01-18 10:19:32 -08:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-20 13:17:57 -07:00
configs.c
cpu.c sched: _cpu_down(): Don't play with current->cpus_allowed 2010-09-20 13:18:08 -07:00
cpuset.c sched: Make select_fallback_rq() cpuset friendly 2010-09-20 13:18:08 -07:00
cred.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
extable.c
fork.c sched: Fix fork vs hotplug vs cpuset namespaces 2010-09-20 13:18:02 -07:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
futex.c futex: Fix errors in nested key ref-counting 2010-11-22 10:47:31 -08:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-20 13:17:54 -07:00
hrtimer.c hrtimer: Preserve timer state in remove_hrtimer() 2010-10-28 21:44:01 -07:00
hung_task.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
itimer.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
kallsyms.c kallsyms: use new arch_is_kernel_text() 2009-09-23 07:39:30 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.preempt
kexec.c kexec: fix omitting offset in extended crashkernel syntax 2009-07-29 19:10:34 -07:00
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c
kmod.c Revert "kmod: fix race in usermodehelper code" 2009-09-23 18:12:10 -07:00
kprobes.c const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
ksysfs.c sched: Remove USER_SCHED 2011-02-17 15:37:19 -08:00
kthread.c cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node 2010-04-01 15:58:46 -07:00
latencytop.c latencytop: fix per task accumulator 2010-12-09 13:26:51 -08:00
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c Revert "lockdep: fix incorrect percpu usage" 2010-06-01 09:45:46 -07:00
Makefile SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
module.c dynamic debug: move ddebug_remove_module() down into free_module() 2010-08-02 10:20:47 -07:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h
mutex.c mutex: Fix optimistic spinning vs. BKL 2010-07-05 11:10:31 -07:00
mutex.h
notifier.c
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c
panic.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-08 12:16:35 -07:00
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c Fix racy use of anon_inode_getfd() in perf_event.c 2010-07-05 11:10:30 -07:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c mm: also use alloc_large_system_hash() for the PID hash table 2009-09-22 07:17:38 -07:00
pm_qos_params.c
posix-cpu-timers.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
posix-timers.c posix_timer: Fix error path in timer_create 2010-07-05 11:10:30 -07:00
printk.c nohz: Fix printk_needs_cpu() return value on offline cpus 2011-01-07 14:43:03 -08:00
profile.c profile: fix stats and data leakage 2010-05-26 14:29:18 -07:00
ptrace.c ptrace: use safer wake up on ptrace_detach() 2011-02-17 15:37:03 -08:00
rcupdate.c rcu: Move rcu_barrier() to rcutree 2009-10-07 08:11:20 +02:00
rcutorture.c rcu: Clean up code to address Ingo's checkpatch feedback 2009-09-23 19:46:30 +02:00
rcutree_plugin.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
rcutree_trace.c rcu: Make hot-unplugged CPU relinquish its own RCU callbacks 2009-10-07 08:11:20 +02:00
rcutree.c rcu: Fix note_new_gpnum() uses of ->gpnum 2009-12-18 14:03:01 -08:00
rcutree.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
relay.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c walk system ram range 2009-09-23 07:39:41 -07:00
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() 2009-08-06 05:50:21 +02:00
rtmutex.h
rwsem.c
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2010-01-22 15:18:30 -08:00
sched_cpupri.c sched: Add new prio to cpupri before removing old prio 2009-08-02 14:26:09 +02:00
sched_cpupri.h
sched_debug.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
sched_fair.c sched: Fix select_idle_sibling() logic in select_task_rq_fair() 2010-09-20 13:18:12 -07:00
sched_features.h sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_idletask.c sched: Fix TASK_WAKING vs fork deadlock 2010-09-20 13:18:09 -07:00
sched_rt.c sched: Try not to migrate higher priority RT tasks 2011-02-17 15:37:20 -08:00
sched_stats.h
sched.c sched: Increment cache_nice_tries only on periodic lb 2011-02-17 15:37:20 -08:00
seccomp.c
semaphore.c
signal.c signals: check_kill_permission(): don't check creds if same_thread_group() 2010-07-05 11:10:56 -07:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c slow-work: use get_ref wrapper instead of directly calling get_ref 2010-08-10 10:20:45 -07:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c kernel/smp.c: fix smp_call_function_many() SMP race 2011-02-17 15:37:07 -08:00
softirq.c softirq: add BLOCK_IOPOLL to softirq_to_name 2009-09-17 15:53:44 -04:00
softlockup.c softlockup: Stop spurious softlockup messages due to overflow 2010-04-01 15:58:47 -07:00
spinlock.c locking: Allow arch-inlined spinlocks 2009-08-31 18:08:50 +02:00
srcu.c
stacktrace.c
stop_machine.c
sys_ni.c Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-09-24 15:13:11 -07:00
sys.c sched: Remove USER_SCHED 2011-02-17 15:37:19 -08:00
sysctl_check.c NET: fix oops at bootime in sysctl code 2010-02-09 04:51:02 -08:00
sysctl.c kernel/sysctl.c: fix stable merge error in NOMMU mmap_min_addr 2010-01-18 10:19:49 -08:00
taskstats.c
test_kprobes.c
time.c time: Prevent 32 bit overflow with set_normalized_timespec() 2009-09-15 10:17:30 +02:00
timeconst.pl
timer.c nohz: Fix get_next_timer_interrupt() vs cpu hotplug 2011-01-07 14:43:03 -08:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c
user.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
utsname_sysctl.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
utsname.c
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c workqueue: fix race condition in schedule_on_each_cpu() 2009-11-17 17:40:33 -08:00