Commit Graph

15851 Commits

Author SHA1 Message Date
Mark Brown
c04ee7fcbf This is the 3.10.2 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.20 (GNU/Linux)
 
 iQIcBAABAgAGBQJR7ImhAAoJEDjbvchgkmk+jKoQAI2jmoRbE7e3lqoxsVvIAbmU
 HdzcFP708XMNdylVygXtGVcOXLME4raaH/bSOa+CwXL45fjRLqSASpKcbIrl3AOq
 XVmiHwYkpXvclFUTgCYqL8yFpv4zLkRnE5mh1ctvi74/cjKZ8IXy+3iCCNoYfTO8
 bmN30Not6+BNbPjEzjj9lcqrqckMyIvg9JBQtz0/Rfp0AykR19cPGVePLhDASQvq
 OP3KfXjWz1UHQLXHuzanlm7wiTI5tqyT96Bg13Mata/JTxSoLQXIHqn7pdV6UKto
 6Ae7H3TMAzkabRwIwH22ZmGwnNLLrbgkYHgbWv7tSOWQGfN7fhqXE5C0/GnPgcgA
 2kfmibRSHZUbrq4BPGUMVd/GJBD9Z2NXZndOXJ3JJFrsznqmcq99BlVsjJvLbrQG
 R5e4hmuddks6k4nfRhNk9QhL7QQeHgooV/Z9/Z+kYbq2bZSe27MNybymJAhBR+eh
 lLqS0nHfYWuScsg7VG3ELQtwJSQ0SXna1I9tHrUPjMoz9J3Nm3yUagxtPP5aTF4k
 rUCtE01IvpiHpJaaq5R3i5lYtzzKmPO6whMdfrsUprNV1AeyOlXu6GGKeYL8ZHPj
 Wiq6v9u5osduhPQtls8oJBGb7WAahasRZyQCX3wqnUgLoADP5PRccehFWjCqsf4u
 SeQ3CufmojuVLQMn8pcP
 =X9Ud
 -----END PGP SIGNATURE-----

Merge tag 'v3.10.2' into linux-linaro-lsk

This is the 3.10.2 stable release
2013-07-22 11:16:31 +01:00
Bart Van Assche
d2f5933432 timer: Fix jiffies wrap behavior of round_jiffies_common()
commit 9e04d3804d upstream.

Direct compare of jiffies related values does not work in the wrap
around case. Replace it with time_is_after_jiffies().

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/519BC066.5080600@acm.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21 18:21:31 -07:00
Ben Hutchings
3dc8601b54 genirq: Fix can_request_irq() for IRQs without an action
commit 2779db8d37 upstream.

Commit 02725e7471 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action.  This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none.  Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).

Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 709647@bugs.debian.org
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21 18:21:27 -07:00
Li Zefan
83d0eb7975 cgroup: fix umount vs cgroup_event_remove() race
commit 1c8158eeae upstream.

 commit 5db9a4d99b
 Author: Tejun Heo <tj@kernel.org>
 Date:   Sat Jul 7 16:08:18 2012 -0700

     cgroup: fix cgroup hierarchy umount race

This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21 18:21:25 -07:00
Mark Brown
0a37d59e3c This is the 3.10.1 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.20 (GNU/Linux)
 
 iQIcBAABAgAGBQJR4Z+/AAoJEDjbvchgkmk+GzkP/RD9oh04MLqPvlpMk480QYwj
 FXFsRr3Gm6FADM8/jGkzrG4G5V0X1paFTjDOptCsCup7TVxm6wuVh6ab8BeXjPOR
 YpjjWffqBzd1RPchmU2zjHWh94wVWleirAieYpwS5w8QtCPYrACnoIlG4Jo8gs7U
 PkjwQT0e+dmbVJbLMNSdzWFX6iGPHwjmqbIdzR4ECjGd2I8AWxLCookEd5aa8qU7
 g+QemXmhMjqqntsq94R+4faarKMSZfWcUrEzL+NQQvtHIHIOgqBaOmqGJoBbgnY+
 ZXpUvxrTsS+oO8shoVFHpjsUSa9BZ62P3tunu3ASq7ojKkiwfDg28LTXjDu1IYoN
 iEDCq8SHfUIhkb9PkBfeYGIh9v1rVqNq7JW/yhmUrIeXKktRYR43AxT9+QcwMBlp
 xf/9yA1H2HImOkpYOMnCmrk+nkCAXBlPlVa1DbW755EZxqqRrIzrNhe5AO8a0gJO
 rnXz/C+79oAKiRqgg9k6Z5HvRTADXOQ2WWck30yaHVXABjTV3aMQqekIlKhvCpeG
 PbEEg+hC4WsULCdjR6ngc2FMCzO08RhOzs/aV85zwSZ7ABKeGZvgGSoc1w7uGpji
 SK0DBd6cQbG/4J3AgfQnbnjQxjtUezo+P3MKI+CP5EsvTK1jYgITIMmPIi+kUUHq
 Y/e2A9Y11VcRv+1p0VWJ
 =1u6w
 -----END PGP SIGNATURE-----

Merge tag 'v3.10.1' into linux-linaro-lsk

This is the 3.10.1 stable release
2013-07-19 10:30:43 +01:00
Mark Brown
38b5268356 Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsk 2013-07-18 16:42:46 +01:00
Viresh Kumar
7399c448c5 workqueue: Add system wide power_efficient workqueues
This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.

tj: updated comments a bit.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 0668106ca3)
Signed-off-by: Mark Brown <broonie@linaro.org>
2013-07-18 14:30:02 +01:00
Viresh Kumar
61a10f9763 workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.

Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system.  To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.

tj: Updated config description and comments.  Renamed
    CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit cee22a1505)
Signed-off-by: Mark Brown <broonie@linaro.org>
2013-07-18 14:29:31 +01:00
Jon Medhurst
a0e1bccdf9 Merge branches 'master-arm-multi_pmu_v2', 'master-config-fragments', 'master-hw-bkpt-fix', 'master-misc-patches' and 'master-task-placement-v2-updates' into big-LITTLE-MP-master-v19
Updates:
 -------
 - Rebased over 3.10 final
 - Differences from big-LITTLE-MP-master-v18
   - New Patches:
     - master-config-fragments: 1 new patch
       - "config: Disable priority filtering for HMP Scheduler"
     - master-misc-patches: 1 new patch
       - "mm: make vmstat_update periodic run conditional"
   - New Branches:
     - master-task-placement-v2-updates: 7 patches
       New patches from ARM added in a new topic branch stacked on top
       of master-task-placement-v2-sysfs...
       - Revert "sched: Enable HMP priority filter by default"
       - "HMP: Use unweighted load for hmp migration decisions"
       - "HMP: Select least-loaded CPU when performing HMP Migrations"
       - "HMP: Avoid multiple calls to hmp_domain_min_load in fast path"
       - "HMP: Force new non-kernel tasks onto big CPUs until load stabilises"
       - "sched: Restrict nohz balance kicks to stay in the HMP domain"
       - "HMP: experimental: Force all rt tasks to start on little domain."

 Commands used for merge:
 -----------------------
 $ git checkout -b big-LITTLE-MP-master-v19 v3.10
 $ git merge master-arm-multi_pmu_v2 master-config-fragments \
     master-hw-bkpt-fix master-misc-patches master-task-placement-v2 \
     master-task-placement-v2-sysfs master-task-placement-v2-updates
2013-07-18 11:49:27 +01:00
Dietmar Eggemann
4ab2679351 HMP: experimental: Force all rt tasks to start on little domain.
This patch restricts the allowed cpu mask for rt tasks initially started
with a full cpu mask to the little domain.

An rt task is specified as real time in __setscheduler() which is finally
called for all rt tasks (kernel and user land). In this function we
restrict the allowed cpu mask to the little domain.

This also prevents that a rt tasks can later be pushed to the big domain
because the function find_lowest_rq() will only recognize the allowed cpu
mask of a task to find the new cpu the task runs on.

Current kludges of the patch:

* Since we do not have an API to get the cpu mask of the A7 cluster,
hmp_slow_cpu_mask is made global in arm/kernel/topology.c for now.

* The watchdog_enable() function calls sched_setscheduler() before
kthread_bind() for the cpu specific watchdog kernel threads. The order of
these two calls has to be changed to make this patch work.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2013-07-17 11:32:41 +01:00
Chris Redpath
6eada00873 sched: Restrict nohz balance kicks to stay in the HMP domain
There is little point in doing a nohz balance kick on a CPU from a
different HMP domain, since the unset SD_LOAD_BALANCE flag on the CPU
domain level prevents tasks from being balanced across clusters
except through the per-task load driven hmp_migrate/hmp_offload paths.

Further, the nohz balance kick is actively harmful to power usage if
all the tasks fit into the little domain since it causes the big
domain to wake up and do a lot of calculation to determine that
there is nothing to do.

A more generic solution is to walk the sched domain tree and determine
the intersection of potential idle balance cpus with visibility of
tasks on the current CPU, however HMP domains are more easily
accessible.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:32:30 +01:00
Chris Redpath
954978dd2c HMP: Force new non-kernel tasks onto big CPUs until load stabilises
Initialise the load stats for new tasks so that they do not
see the instability in early task life which makes it so hard to
decide which CPU is appropriate.

Also, change the fork balance algorithm so that the least loaded of
the CPUs in the big cluster is chosen regardless of the bigness of
the parent task.

This is intended to help performance for applications which use
many short-lived tasks. Although best practise is usually to use
a thread pool, apps which do not do this should not be subject to
the randomness of the early stats.

We should ignore real-time threads for forking on big CPUs, but
it is not possible to figure out if a new thread is real-time or
not at the fork stage. Instead, we prevent kernel threads from
getting the initial boost - when they later become real-time they
will only be on big if their compute requirements demand it.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:32:30 +01:00
Chris Redpath
3f3b210703 HMP: Avoid multiple calls to hmp_domain_min_load in fast path
When evaluating a migration we make two calls to hmp_domain_min_load.
This is unnecessary if we pass on the target CPU information from the
hmp_up_migration path.

In hmp_down_migration, we don't consider the load of the target CPUS.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:32:29 +01:00
Chris Redpath
08d7db89a2 HMP: Select least-loaded CPU when performing HMP Migrations
The reference patch set always selects the first CPU in an HMP
domain as a migration target. In busy situations, this means that
the migrated thread cannot make immediate use of an idle CPU but
must share a busy one until the load balancer runs across the big
domain.

This patch uses the hmp_domain_min_load function introduced in
global balancing to figure out which of the CPUs is the least busy
and selects that as a migration target - in both directions.

This essentially implements a task-spread strategy and is intended
to maximise performance of migrated threads but is likely
to use more power than the packing strategy previously employed.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:32:29 +01:00
Chris Redpath
ede58a69a3 HMP: Use unweighted load for hmp migration decisions
Normal task and runqueue loading is scaled according to priority
to end up with a weighted load, known as the contribution.

We want the CPU time to be allotted according to priority, but
we also want to make big/little decisions based upon raw load.

It is common, for example, for Android apps following the dev
guide to end up with all their long-running or async action
threads as low priority unless they override the AsyncThread
constructor. All these threads are such low priority that they
become invisible to the hmp_offload routine.

Using unweighted load here allows us to maximise CPU usage in busy
situations.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:32:28 +01:00
Chris Redpath
7e64466300 sched: cfs.nr_running does not contain the intended metric
rq->nr_running is the actual number of runnable tasks we wish to use
to determine if a task is alone on a CPU.

Change-Id: Icaf3022e02924ecdc94e14d4146c6fadd9580e2b
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:12:27 +01:00
Morten Rasmussen
cf71912f48 sched: Basic global balancing support for HMP
This patch introduces an extra-check at task up-migration to
prevent overloading the cpus in the faster hmp_domain while the
slower hmp_domain is not fully utilized. The patch also introduces
a periodic balance check that can down-migrate tasks if the faster
domain is oversubscribed and the slower is under-utilized.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2013-07-17 11:12:27 +01:00
Chris Redpath
ae570aeb1d ARM: Fix build breakage when big.LITTLE.conf is not used.
Change-Id: I8641f5e930c65b5672130bd4a18d9868bb3ca594
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
2013-07-17 11:12:27 +01:00
Chris Redpath
71b5dbd6d5 ARM: Experimental Frequency-Invariant Load Scaling Patch
Evaluation Patch to investigate using load as a representation of the
amount of POTENTIAL cpu compute capacity used rather than a representation
of the CURRENT cpu compute capacity.

If CPUFreq is enabled, scales load in accordance with frequency.

Powersave/performance CPUFreq governors are detected and scaling is
disabled while these governors are in use. This is because when a
single-frequency governor is in use, potential CPU capacity is static.

So long as the governors and CPUFreq subsystem correctly report the
frequencies available, the scaling should self tune.

Adds an additional file to sysfs to allow this feature to be disabled
for experimentation.

/sys/kernel/hmp/frequency_invariant_load_scale

write 0 to disable, 1 to enable.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:12:27 +01:00
Olivier Cozette
0e48eed05c ARM: Change load tracking scale using sysfs
These functions allow to change the load average period used
in the task load average computation through
/sys/kernel/hmp/load_avg_period_ms. This period is the time
in ms to go from 0 to 0.5 load average while running or the
time from 1 to 0.5 while sleeping.

The default one used is 32 and gives the same load_avg_ratio
computation than without this patch. These functions also allow
to change the up and down threshold of HMP using
/sys/kernel/hmp/{up,down}_threshold. Both must be between 0 and
1024. The thresholds are divided by 1024 before being compared
to the load_avg_ratio.

If /sys/kernel/hmp/load_avg_period_ms is 128 and
/sys/kernel/hmp/up_threshold is 512, a task will be migrated
to a bigger cluster after running for 128ms. Because after
load_avg_period_ms the load average is 0.5 and real up_threshold
us 512 / 1024 = 0.5.

Signed-off-by: Olivier Cozette <olivier.cozette@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:12:27 +01:00
Chris Redpath
b64cc6f7e5 sched: Ignore offline CPUs in HMP migration & load stats
Previously, an offline CPU would always appear to have a zero load
and this would distort the offload functionality used for balancing
big and little domains.

Maintain a mask of online CPUs in each domain and use this instead.

Change-Id: I639b564b2f40cb659af8ceb8bd37f84b8a1fe323
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:12:26 +01:00
Chris Redpath
d2c920023c sched: Do not ignore grouped tasks during HMP forced migration.
If the entity is not a task, it is a cfs group rq. Iterate up to
find the task entity.

Change-Id: I7cab7aba0798f6f14e38ad32e566d90e5937ffbc
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2013-07-17 11:12:26 +01:00
Morten Rasmussen
eeebbf595c sched: Only down migrate low priority tasks if allowed by affinity mask
Adds an extra check intersection of the task affinity mask and the slower
hmp_domain cpumask before down migrating low priority tasks.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2013-07-17 11:12:26 +01:00
Morten Rasmussen
76525733b4 sched: SCHED_HMP multi-domain task migration control
We need a way to prevent tasks that are migrating up and down the
hmp_domains from migrating straight on through before the load has
adapted to the new compute capacity of the CPU on the new hmp_domain.
This patch adds a next up/down migration delay that prevents the task
from doing another migration in the same direction until the delay
has expired.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:25 +01:00
Morten Rasmussen
0d811e649a sched: Add HMP task migration ftrace event
Adds ftrace event for tracing task migrations using HMP
optimized scheduling.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:25 +01:00
Morten Rasmussen
b9d3d56128 sched: Add ftrace events for entity load-tracking
Adds ftrace events for key variables related to the entity
load-tracking to help debugging scheduler behaviour. Allows tracing
of load contribution and runqueue residency ratio for both entities
and runqueues as well as entity CPU usage ratio.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:25 +01:00
Morten Rasmussen
943106d943 sched: Introduce priority-based task migration filter
Introduces a priority threshold which prevents low priority task
from migrating to faster hmp_domains (cpus). This is useful for
user-space software which assigns lower task priority to background
task.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:24 +01:00
Morten Rasmussen
2dd22b22c9 sched: Forced task migration on heterogeneous systems
This patch introduces forced task migration for moving suitable
currently running tasks between hmp_domains. Task behaviour is likely
to change over time. Tasks running in a less capable hmp_domain may
change to become more demanding and should therefore be migrated up.
They are unlikely go through the select_task_rq_fair() path anytime
soon and therefore need special attention.

This patch introduces a period check (SCHED_TICK) of the currently
running task on all runqueues and sets up a forced migration using
stop_machine_no_wait() if the task needs to be migrated.

Ideally, this should not be implemented by polling all runqueues.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:24 +01:00
Morten Rasmussen
798e82cab1 sched: Task placement for heterogeneous systems based on task load-tracking
This patch introduces the basic SCHED_HMP infrastructure. Each class of
cpus is represented by a hmp_domain and tasks will only be moved between
these domains when their load profiles suggest it is beneficial.

SCHED_HMP relies heavily on the task load-tracking introduced in Paul
Turners fair group scheduling patch set:

<https://lkml.org/lkml/2012/8/23/267>

SCHED_HMP requires that the platform implements arch_get_hmp_domains()
which should set up the platform specific list of hmp_domains. It is
also assumed that the platform disables SD_LOAD_BALANCE for the
appropriate sched_domains.
Tasks placement takes place every time a task is to be inserted into
a runqueue based on its load history. The task placement decision is
based on load thresholds.

There are no restrictions on the number of hmp_domains, however,
multiple (>2) has not been tested and the up/down migration policy is
rather simple.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:24 +01:00
Morten Rasmussen
be6ef1d56e sched: entity load-tracking load_avg_ratio
This patch adds load_avg_ratio to each task. The load_avg_ratio is a
variant of load_avg_contrib which is not scaled by the task priority. It
is calculated like this:

runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1).

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:24 +01:00
Paul Turner
0841c6ae0b sched: implement usage tracking
With the frame-work for runnable tracking now fully in place.  Per-entity usage
tracking is a simple and low-overhead addition.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
2013-07-17 11:12:23 +01:00
Thomas Gleixner
1dfb365825 genirq: Add default affinity mask command line option
If we isolate CPUs, then we don't want random device interrupts on
them. Even w/o the user space irq balancer enabled we can end up with
irqs on non boot cpus.

Allow to restrict the default irq affinity mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-17 11:12:18 +01:00
Zhang Yi
ab1842f10b futex: Take hugepages into account when generating futex_key
commit 13d60f4b6a upstream.

The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.

Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.

Steps to reproduce the bug:

1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
   and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
   mapping.

   The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
   PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
   their keys solely depend on the user space address.

2. Lock mutex1 and mutex2

3. Create thread1 and in the thread function lock mutex1, which
   results in thread1 blocking on the locked mutex1.

4. Create thread2 and in the thread function lock mutex2, which
   results in thread2 blocking on the locked mutex2.

5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
   still blocks on mutex2 because the futex_key points to mutex1.

To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.

Mappings which are not based on hugetlbfs are not affected and still
use page->index.

Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.

[ tglx: Massaged changelog ]

Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn>
Reviewed-by: 'Mel Gorman' <mgorman@suse.de>
Acked-by: 'Darren Hart' <dvhart@linux.intel.com>
Cc: 'Peter Zijlstra' <peterz@infradead.org>
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13 11:42:26 -07:00
Rusty Russell
0b72ac9665 module: do percpu allocation after uniqueness check. No, really!
commit 8d8022e8ab upstream.

v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
percpu memory on large machines:

    Now we have a new state MODULE_STATE_UNFORMED, we can insert the
    module into the list (and thus guarantee its uniqueness) before we
    allocate the per-cpu region.

In my defence, it didn't actually say the patch did this.  Just that
we "can".

This patch actually *does* it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Jim Hull <jim.hull@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13 11:42:26 -07:00
Mathieu Desnoyers
706b23bde2 Fix: kernel/ptrace.c: ptrace_peek_siginfo() missing __put_user() validation
This __put_user() could be used by unprivileged processes to write into
kernel memory.  The issue here is that even if copy_siginfo_to_user()
fails, the error code is not checked before __put_user() is executed.

Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
so it has not hit a stable release yet.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-29 11:29:08 -07:00
Linus Torvalds
a75930c633 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "Correct an ordering issue in the tick broadcast code.  I really wish
  we'd get compensation for pain and suffering for each line of code we
  write to work around dysfunctional timer hardware."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Fix tick_broadcast_pending_mask not cleared
2013-06-29 10:27:19 -07:00
Linus Torvalds
54faf77d06 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Three small fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
  hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
  kprobes: Fix arch_prepare_kprobe to handle copy insn failures
2013-06-26 08:51:44 -10:00
Linus Torvalds
f71194a7d4 Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "This series fixes a couple of build failures, and fixes MTRR cleanup
  and memory setup on very specific memory maps.

  Finally, it fixes triggering backtraces on all CPUs, which was
  inadvertently disabled on x86."

* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Fix dummy variable buffer allocation
  x86: Fix trigger_all_cpu_backtrace() implementation
  x86: Fix section mismatch on load_ucode_ap
  x86: fix build error and kconfig for ia32_emulation and binfmt
  range: Do not add new blank slot with add_range_with_merge
  x86, mtrr: Fix original mtrr range get for mtrr_cleanup
2013-06-21 06:33:48 -10:00
Daniel Lezcano
ea8deb8dfa tick: Fix tick_broadcast_pending_mask not cleared
The recent modification in the cpuidle framework consolidated the
timer broadcast code across the different drivers by setting a new
flag in the idle state. It tells the cpuidle core code to enter/exit
the broadcast mode for the cpu when entering a deep idle state. The
broadcast timer enter/exit is no longer handled by the back-end
driver.

This change made the local interrupt to be enabled *before* calling
CLOCK_EVENT_NOTIFY_EXIT.

On a tegra114, a four cores system, when the flag has been introduced
in the driver, the following warning appeared:

WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
[<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c)
[<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84)
[<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20)
[<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170)
[<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168)
[<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38)
[<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134)
[<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4)

I don't have the hardware, so I wasn't able to reproduce the warning
but after looking a while at the code, I deduced the following:

 1. the CPU2 enters a deep idle state and sets the broadcast timer

 2. the timer expires, the tick_handle_oneshot_broadcast function is
    called, setting the tick_broadcast_pending_mask and waking up the
    idle cpu CPU2

 3. the CPU2 exits idle handles the interrupt and then invokes
    tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
    runs the following code:

    [...]
    if (dev->next_event.tv64 == KTIME_MAX)
            goto out;

    if (cpumask_test_and_clear_cpu(cpu,
                                 tick_broadcast_pending_mask))
            goto out;
    [...]

    So if there is no next event scheduled for CPU2, we fulfil the
    first condition and jump out without clearing the
    tick_broadcast_pending_mask.

 4. CPU2 goes to deep idle again and calls
    tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
    with the tick_broadcast_pending_mask set for CPU2, triggering the
    warning.

The issue only surfaced due to the modifications of the cpuidle
framework, which resulted in interrupts being enabled before the call
to the clockevents code. If the call happens before interrupts have
been enabled, the warning cannot trigger, because there is still the
event pending which caused the broadcast timer expiry.

Move the check for the next event below the check for the pending bit,
so the pending bit gets cleared whether an event is scheduled on the
cpu or not.

[ tglx: Massaged changelog ]

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-and-tested-by: Joseph Lo <josephl@nvidia.com>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-21 13:10:34 +02:00
Linus Torvalds
a3d5c3460a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two smaller fixes - plus a context tracking tracing fix that is a bit
  bigger"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing/context-tracking: Add preempt_schedule_context() for tracing
  sched: Fix clear NOHZ_BALANCE_KICK
  sched/x86: Construct all sibling maps if smt
2013-06-20 08:18:35 -10:00
Linus Torvalds
86c76676cf Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Four fixes.  The mmap ones are unfortunately larger than desired -
  fuzzing uncovered bugs that needed perf context life time management
  changes to fix properly"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP
  perf: Fix mmap() accounting hole
  perf: Fix perf mmap bugs
  kprobes: Fix to free gone and unused optprobes
2013-06-20 08:17:36 -10:00
Linus Torvalds
805e318548 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu idle fixes from Thomas Gleixner:
 - Add a missing irq enable. Fallout of the idle conversion
 - Fix stackprotector wreckage caused by the idle conversion

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  idle: Enable interrupts in the weak arch_cpu_idle() implementation
  idle: Add the stack canary init to cpu_startup_entry()
2013-06-20 08:16:07 -10:00
Linus Torvalds
4db88eb4c3 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 - Fix inconstinant clock usage in virtual time accounting
 - Fix a build error in KVM caused by the NOHZ work
 - Remove a pointless timekeeping duty assignment which breaks NOHZ
 - Use a proper notifier return value to avoid random behaviour

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Remove useless timekeeping duty attribution to broadcast source
  nohz: Fix notifier return val that enforce timekeeping
  kvm: Move guest entry/exit APIs to context_tracking
  vtime: Use consistent clocks among nohz accounting
2013-06-20 08:15:13 -10:00
Oleg Nesterov
c790b0ad23 hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
fetch_bp_busy_slots() and toggle_bp_slot() use
for_each_online_cpu(), this is obviously wrong wrt cpu_up() or
cpu_down(), we can over/under account the per-cpu numbers.

For example:

	# echo 0 >> /sys/devices/system/cpu/cpu1/online
	# perf record -e mem:0x10 -p 1 &
	# echo 1 >> /sys/devices/system/cpu/cpu1/online
	# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a &
	# taskset -p 0x2 1

triggers the same WARN_ONCE("Can't find any breakpoint slot") in
arch_install_hw_breakpoint().

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:57:01 +02:00
Oleg Nesterov
8b4d801b2b hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
slot") in arch_install_hw_breakpoint() but the problem is not
arch-specific.

The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
but this doesn't account the "all cpus" events with iter->cpu <
0.

This means that, say, register_user_hw_breakpoint(tsk) can
happily create the arbitrary number > HBP_NUM of breakpoints
which can not be activated. toggle_bp_task_slot() is equally
wrong by the same reason and nr_task_bp_pinned[] can have
negative entries.

Simple test:

	# perl -e 'sleep 1 while 1' &
	# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`

Before this patch this triggers the same problem/WARN_ON(),
after the patch it correctly fails with -ENOSPC.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:57:00 +02:00
Steven Rostedt
29bb9e5a75 tracing/context-tracking: Add preempt_schedule_context() for tracing
Dave Jones hit the following bug report:

 ===============================
 [ INFO: suspicious RCU usage. ]
 3.10.0-rc2+ #1 Not tainted
 -------------------------------
 include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
 other info that might help us debug this:
 RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
 RCU used illegally from extended quiescent state!
 2 locks held by cc1/63645:
  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0
  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0

 CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
 Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
  0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
  ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
  0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
 Call Trace:
  [<ffffffff816ae383>] dump_stack+0x19/0x1b
  [<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130
  [<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0
  [<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0
  [<ffffffff8108dffc>] update_curr+0xec/0x240
  [<ffffffff8108f528>] put_prev_task_fair+0x228/0x480
  [<ffffffff816b3a71>] __schedule+0x161/0x9b0
  [<ffffffff816b4721>] preempt_schedule+0x51/0x80
  [<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60
  [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
  [<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210
  [<ffffffff816be280>] ftrace_call+0x5/0x2f
  [<ffffffff816b681d>] ? retint_careful+0xb/0x2e
  [<ffffffff816b4805>] ? schedule_user+0x5/0x70
  [<ffffffff816b4805>] ? schedule_user+0x5/0x70
  [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
 ------------[ cut here ]------------

What happened was that the function tracer traced the schedule_user() code
that tells RCU that the system is coming back from userspace, and to
add the CPU back to the RCU monitoring.

Because the function tracer does a preempt_disable/enable_notrace() calls
the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
then preempt_schedule() is called. But this is called before the user_exit()
function can inform the kernel that the CPU is no longer in user mode and
needs to be accounted for by RCU.

The fix is to create a new preempt_schedule_context() that checks if
the kernel is still in user mode and if so to switch it to kernel mode
before calling schedule. It also switches back to user mode coming back
from schedule in need be.

The only user of this currently is the preempt_enable_notrace(), which is
only used by the tracing subsystem.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:55:10 +02:00
Vincent Guittot
873b4c65b5 sched: Fix clear NOHZ_BALANCE_KICK
I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:55:09 +02:00
Peter Zijlstra
9bb5d40cd9 perf: Fix mmap() accounting hole
Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:44:13 +02:00
Yinghai Lu
0541881502 range: Do not add new blank slot with add_range_with_merge
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
during add range) broke mtrr cleanup on his setup in 3.9.5.
corresponding commit in upstream is fbe06b7bae.

The reason is add_range_with_merge could generate blank spot.

We could avoid that by searching new expanded start/end, that
new range should include all connected ranges in range array.
At last add the new expanded start/end to the range array.
Also move up left array so do not add new blank slot in the
range array.

-v2: move left array to avoid enhance add_range()
-v3: include fix from Joshua about memmove declaring when
     DYN_DEBUG is used.

Reported-by: Joshua Covington <joshuacov@googlemail.com>
Tested-by: Joshua Covington <joshuacov@googlemail.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371154622-8929-3-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-06-18 11:32:10 -05:00
Linus Torvalds
d0ff934881 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Several fixes + obvious cleanup (you've missed a couple of open-coded
  can_lookup() back then)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  snd_pcm_link(): fix a leak...
  use can_lookup() instead of direct checks of ->i_op->lookup
  move exit_task_namespaces() outside of exit_notify()
  fput: task_work_add() can fail if the caller has passed exit_task_work()
  ncpfs: fix rmdir returns Device or resource busy
2013-06-14 19:18:56 -10:00