linux/kernel
Nikhil Rao 1d3d2371a6 sched: Drop group_capacity to 1 only if local group has extra capacity
Commit: 75dd321d79 upstream

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).

For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:

- find_busiest_group() will always pick the sched group containing the niced
  task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
  nice0 task (never picks the cpu with the nice -15 task since
  weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
  and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
  at which point it kicks off the active load balancer, wakes up the migration
  thread and kicks the nice 0 task off the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
not relevant because the niced task has a very large weight.

In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.

It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
  group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
  likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
  capacity, we do not pick the group with the niced task as the busiest group.
  This prevents failed balances, active migration and the under-utilization
  described above.

Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17 15:37:24 -08:00
..
gcov gcov: fix null-pointer dereference for certain module types 2010-09-20 13:17:53 -07:00
irq irq: Add new IRQ flag IRQF_NO_SUSPEND 2010-08-13 13:19:50 -07:00
power PM / Hibernate: Fix PM_POST_* notification with user-space suspend 2011-01-07 14:43:06 -08:00
time timekeeping: Fix clock_gettime vsyscall time warp 2010-08-13 13:20:13 -07:00
trace tracing: Fix panic when lseek() called on "trace" opened for writing 2011-01-07 14:43:10 -08:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-18 14:03:52 -08:00
async.c async: Fix lack of boot-time console due to insufficient synchronization 2009-06-08 12:31:53 -07:00
audit_tree.c fix more leaks in audit_tree.c tag_chunk() 2010-01-18 10:19:50 -08:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
auditfilter.c Audit: clean up all op= output to include string quoting 2009-06-24 00:00:52 -04:00
auditsc.c Audit: rearrange audit_context to save 16 bytes per struct 2009-09-24 03:50:26 -04:00
backtracetest.c
bounds.c
capability.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
cgroup_freezer.c Freezer: Fix buggy resume test for tasks frozen with cgroup freezer 2010-04-26 07:41:17 -07:00
cgroup.c cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() 2010-01-18 10:19:32 -08:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-20 13:17:57 -07:00
configs.c
cpu.c sched: _cpu_down(): Don't play with current->cpus_allowed 2010-09-20 13:18:08 -07:00
cpuset.c sched: Make select_fallback_rq() cpuset friendly 2010-09-20 13:18:08 -07:00
cred.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c Get rid of indirect include of fs_struct.h 2009-03-31 23:00:27 -04:00
exit.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
extable.c Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-04-05 11:04:19 -07:00
fork.c sched: Fix fork vs hotplug vs cpuset namespaces 2010-09-20 13:18:02 -07:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
futex.c futex: Fix errors in nested key ref-counting 2010-11-22 10:47:31 -08:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-20 13:17:54 -07:00
hrtimer.c hrtimer: Preserve timer state in remove_hrtimer() 2010-10-28 21:44:01 -07:00
hung_task.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
itimer.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
kallsyms.c kallsyms: use new arch_is_kernel_text() 2009-09-23 07:39:30 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.preempt
kexec.c kexec: fix omitting offset in extended crashkernel syntax 2009-07-29 19:10:34 -07:00
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c sysrq, intel_fb: fix sysrq g collision 2009-05-15 07:56:24 -05:00
kmod.c Revert "kmod: fix race in usermodehelper code" 2009-09-23 18:12:10 -07:00
kprobes.c const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
ksysfs.c sched: Remove USER_SCHED 2011-02-17 15:37:19 -08:00
kthread.c cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node 2010-04-01 15:58:46 -07:00
latencytop.c latencytop: fix per task accumulator 2010-12-09 13:26:51 -08:00
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c Revert "lockdep: fix incorrect percpu usage" 2010-06-01 09:45:46 -07:00
Makefile SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
module.c dynamic debug: move ddebug_remove_module() down into free_module() 2010-08-02 10:20:47 -07:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h
mutex.c mutex: Fix optimistic spinning vs. BKL 2010-07-05 11:10:31 -07:00
mutex.h
notifier.c
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c nsproxy: extract create_nsproxy() 2009-06-18 13:03:56 -07:00
panic.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-08 12:16:35 -07:00
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c Fix racy use of anon_inode_getfd() in perf_event.c 2010-07-05 11:10:30 -07:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c mm: also use alloc_large_system_hash() for the PID hash table 2009-09-22 07:17:38 -07:00
pm_qos_params.c
posix-cpu-timers.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
posix-timers.c posix_timer: Fix error path in timer_create 2010-07-05 11:10:30 -07:00
printk.c nohz: Fix printk_needs_cpu() return value on offline cpus 2011-01-07 14:43:03 -08:00
profile.c profile: fix stats and data leakage 2010-05-26 14:29:18 -07:00
ptrace.c ptrace: use safer wake up on ptrace_detach() 2011-02-17 15:37:03 -08:00
rcupdate.c rcu: Move rcu_barrier() to rcutree 2009-10-07 08:11:20 +02:00
rcutorture.c rcu: Clean up code to address Ingo's checkpatch feedback 2009-09-23 19:46:30 +02:00
rcutree_plugin.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
rcutree_trace.c rcu: Make hot-unplugged CPU relinquish its own RCU callbacks 2009-10-07 08:11:20 +02:00
rcutree.c rcu: Fix note_new_gpnum() uses of ->gpnum 2009-12-18 14:03:01 -08:00
rcutree.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
relay.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c walk system ram range 2009-09-23 07:39:41 -07:00
rtmutex_common.h rt_mutex: add proxy lock routines 2009-04-06 11:14:02 +02:00
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() 2009-08-06 05:50:21 +02:00
rtmutex.h
rwsem.c
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2010-01-22 15:18:30 -08:00
sched_cpupri.c sched: Add new prio to cpupri before removing old prio 2009-08-02 14:26:09 +02:00
sched_cpupri.h cpumask: remove cpumask_t from core 2009-03-30 22:05:17 +10:30
sched_debug.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
sched_fair.c sched: suppress RCU lockdep splat in task_fork_fair 2011-02-17 15:37:21 -08:00
sched_features.h sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_idletask.c sched: Fix TASK_WAKING vs fork deadlock 2010-09-20 13:18:09 -07:00
sched_rt.c sched: Give CPU bound RT tasks preference 2011-02-17 15:37:21 -08:00
sched_stats.h sched: remove unused fields from struct rq 2009-03-24 23:16:51 +01:00
sched.c sched: Drop group_capacity to 1 only if local group has extra capacity 2011-02-17 15:37:24 -08:00
seccomp.c x86-64: seccomp: fix 32/64 syscall hole 2009-03-02 15:41:30 -08:00
semaphore.c
signal.c signals: check_kill_permission(): don't check creds if same_thread_group() 2010-07-05 11:10:56 -07:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c slow-work: use get_ref wrapper instead of directly calling get_ref 2010-08-10 10:20:45 -07:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c kernel/smp.c: fix smp_call_function_many() SMP race 2011-02-17 15:37:07 -08:00
softirq.c softirq: add BLOCK_IOPOLL to softirq_to_name 2009-09-17 15:53:44 -04:00
softlockup.c softlockup: Stop spurious softlockup messages due to overflow 2010-04-01 15:58:47 -07:00
spinlock.c locking: Allow arch-inlined spinlocks 2009-08-31 18:08:50 +02:00
srcu.c
stacktrace.c
stop_machine.c cpumask: remove cpumask_t from core 2009-03-30 22:05:17 +10:30
sys_ni.c Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-09-24 15:13:11 -07:00
sys.c sched: Remove USER_SCHED 2011-02-17 15:37:19 -08:00
sysctl_check.c NET: fix oops at bootime in sysctl code 2010-02-09 04:51:02 -08:00
sysctl.c kernel/sysctl.c: fix stable merge error in NOMMU mmap_min_addr 2010-01-18 10:19:49 -08:00
taskstats.c genetlink: make netns aware 2009-07-12 14:03:27 -07:00
test_kprobes.c
time.c time: Prevent 32 bit overflow with set_normalized_timespec() 2009-09-15 10:17:30 +02:00
timeconst.pl
timer.c nohz: Fix get_next_timer_interrupt() vs cpu hotplug 2011-01-07 14:43:03 -08:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c Fix fixpoint divide exception in acct_update_integrals 2009-03-09 08:13:35 -07:00
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c Fix recursive lock in free_uid()/free_user_ns() 2009-02-27 16:26:21 -08:00
user.c sched: Remove remaining USER_SCHED code 2011-02-17 15:37:19 -08:00
utsname_sysctl.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
utsname.c utsns: extract creeate_uts_ns() 2009-06-18 13:03:55 -07:00
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c workqueue: fix race condition in schedule_on_each_cpu() 2009-11-17 17:40:33 -08:00