mirror of
https://github.com/torvalds/linux.git
synced 2026-07-03 11:37:20 +02:00
e883110aad
61148 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3b38722efd |
memcg, vmscan: integrate soft reclaim tighter with zone shrinking code
This patchset is sitting out of tree for quite some time without any objections. I would be really happy if it made it into 3.12. I do not want to push it too hard but I think this work is basically ready and waiting more doesn't help. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which has been updated in the third step to have only negligible overhead. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before when it wasn't integrated into the zone shrinking code and it reclaimed at priority 0 (the testing results show that some workloads suffers from such an aggressive reclaim). The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes sense for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. The third step (patches 4-8) addresses the tree walk overhead by enhancing memcg iterators to enable skipping whole subtrees and tracking number of over soft limit children at each level of the hierarchy. This information is updated same way the old soft limit tree was updated (from memcg_check_events) so we shouldn't see an additional overhead. In fact mem_cgroup_update_soft_limit is much simpler than tree manipulation done previously. __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for mem_cgroup_iter so the decision whether a particular group should be visited is done at the iterator level which allows us to decide to skip the whole subtree as well (if there is no child in excess). This reduces the tree walk overhead considerably. * TEST 1 ======== My primary test case was a parallel kernel build with 2 groups (make is running with -j8 with a distribution .config in a separate cgroup without any hard limit) on a 32 CPU machine booted with 1GB memory and both builds run taskset to Node 0 cpus. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. base is mmotm-2013-07-18-16-40 rework all 8 patches applied on top of base * No-limit User no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6 no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6 System no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6 no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6 Elapsed no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6 no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6 The results are within noise. Elapsed time has a bigger variance but the average looks good. * 0-limit User 0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6 0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6 System 0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6 0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6 Elapsed 0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6 0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6 The improvement is really huge here (even bigger than with my previous testing and I suspect that this highly depends on the storage). Page fault statistics tell us at least part of the story: Minor 0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6 0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6 Major 0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6 0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6 Same as with my previous testing Minor faults are more or less within noise but Major fault count is way bellow the base kernel. While this looks as a nice win it is fair to say that 0-limit configuration is quite artificial. So I was playing with 0-no-limit loads as well. * TEST 2 ======== The following results are from 2 groups configuration on a 16GB machine (single NUMA node). - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with 2*TotalMem with 0 soft limit. - B running a mem_eater which consumes TotalMem-1G without any limit. The mem_eater consumes the memory in 100 chunks with 1s nap after each mmap+poppulate so that both loads have chance to fight for the memory. The expected result is that B shouldn't be reclaimed and A shouldn't see a big dropdown in elapsed time. User base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3 rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3 System base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3 rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3 Elapsed base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3 rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3 System time improved slightly as well as Elapsed. My previous testing has shown worse numbers but this again seem to depend on the storage speed. My theory is that the writeback doesn't catch up and prio-0 soft reclaim falls into wait on writeback page too often in the base kernel. The patched kernel doesn't do that because the soft reclaim is done from the kswapd/direct reclaim context. This can be seen on the following graph nicely. The A's group usage_in_bytes regurarly drops really low very often. All 3 runs http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png resp. a detail of the single run http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png mem_eater seems to be doing better as well. It gets to the full allocation size faster as can be seen on the following graph: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png /proc/meminfo collected during the test also shows that rework kernel hasn't swapped that much (well almost not at all): base: max: 123900 K avg: 56388.29 K rework: max: 300 K avg: 128.68 K kswapd and direct reclaim statistics are of no use unfortunatelly because soft reclaim is not accounted properly as the counters are hidden by global_reclaim() checks in the base kernel. * TEST 3 ======== Another test was the same configuration as TEST2 except the stream IO was replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB left. Kbuild did better with the rework kernel here as well: User base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3 rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3 System base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3 rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3 Elapsed base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3 rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3 Minor base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3 rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3 Major base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3 rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3 Again we can see a significant improvement in Elapsed (it also seems to be more stable), there is a huge dropdown for the Major page faults and much more swapping: base: max: 583736 K avg: 112547.43 K rework: max: 4012 K avg: 124.36 K Graphs from all three runs show the variability of the kbuild quite nicely. It even seems that it took longer after every run with the base kernel which would be quite surprising as the source tree for the build is removed and caches are dropped after each run so the build operates on a freshly extracted sources everytime. http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png My other testing shows that this is just a matter of timing and other runs behave differently the std for Elapsed time is similar ~50. Example of other three runs: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png So to wrap this up. The series is still doing good and improves the soft limit. The testing results for bunch of cgroups with both stream IO and kbuild loads can be found in "memcg: track children in soft limit excess to improve soft limit". This patch: Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c2d95729e3 |
Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton: - Some pidns/fork/exec tweaks - OCFS2 updates - Most of MM - there remain quite a few memcg parts which depend on pending core cgroups changes. Which might have been already merged - I'll check tomorrow... - Various misc stuff all over the place - A few block bits which I never got around to sending to Jens - relatively minor things. - MAINTAINERS maintenance - A small number of lib/ updates - checkpatch updates - epoll - firmware/dmi-scan - Some kprobes work for S390 - drivers/rtc updates - hfsplus feature work - vmcore feature work - rbtree upgrades - AOE updates - pktcdvd cleanups - PPS - memstick - w1 - New "inittmpfs" feature, which does the obvious - More IPC work from Davidlohr. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (303 commits) lz4: fix compression/decompression signedness mismatch ipc: drop ipc_lock_check ipc, shm: drop shm_lock_check ipc: drop ipc_lock_by_ptr ipc, shm: guard against non-existant vma in shmdt(2) ipc: document general ipc locking scheme ipc,msg: drop msg_unlock ipc: rename ids->rw_mutex ipc,shm: shorten critical region for shmat ipc,shm: cleanup do_shmat pasta ipc,shm: shorten critical region for shmctl ipc,shm: make shmctl_nolock lockless ipc,shm: introduce shmctl_nolock ipc: drop ipcctl_pre_down ipc,shm: shorten critical region in shmctl_down ipc,shm: introduce lockless functions to obtain the ipc object initmpfs: use initramfs if rootfstype= or root= specified initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled initmpfs: move rootfs code from fs/ramfs/ to init/ initmpfs: move bdi setup from init_rootfs to init_ramfs ... |
||
|
|
b34081f1cd |
lz4: fix compression/decompression signedness mismatch
LZ4 compression and decompression functions require different in signedness input/output parameters: unsigned char for compression and signed char for decompression. Change decompression API to require "(const) unsigned char *". Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Kyungsik Lee <kyungsik.lee@lge.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yann Collet <yann.collet.73@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d9a605e40b |
ipc: rename ids->rw_mutex
Since in some situations the lock can be shared for readers, we shouldn't be calling it a mutex, rename it to rwsem. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
57f150a58c |
initmpfs: move rootfs code from fs/ramfs/ to init/
When the rootfs code was a wrapper around ramfs, having them in the same file made sense. Now that it can wrap another filesystem type, move it in with the init code instead. This also allows a subsequent patch to access rootfstype= command line arg. Signed-off-by: Rob Landley <rob@landley.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5e4c0d9741 |
lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
2b52908925 |
rbtree: add rbtree_postorder_for_each_entry_safe() helper
Because deletion (of the entire tree) is a relatively common use of the rbtree_postorder iteration, and because doing it safely means fiddling with temporary storage, provide a helper to simplify postorder rbtree iteration. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9dee5c5151 |
rbtree: add postorder iteration functions
Postorder iteration yields all of a node's children prior to yielding the node itself, and this particular implementation also avoids examining the leaf links in a node after that node has been yielded. In what I expect will be its most common usage, postorder iteration allows the deletion of every node in an rbtree without modifying the rbtree nodes (no _requirement_ that they be nulled) while avoiding referencing child nodes after they have been "deleted" (most commonly, freed). I have only updated zswap to use this functionality at this point, but numerous bits of code (most notably in the filesystem drivers) use a hand rolled postorder iteration that NULLs child links as it traverses the tree. Each of those instances could be replaced with this common implementation. 1 & 2 add rbtree postorder iteration functions. 3 adds testing of the iteration to the rbtree runtime tests 4 allows building the rbtree runtime tests as builtins 5 updates zswap. This patch: Add postorder iteration functions for rbtree. These are useful for safely freeing an entire rbtree without modifying the tree at all. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9cb218131d |
vmcore: introduce remap_oldmem_pfn_range()
For zfcpdump we can't map the HSA storage because it is only available via a read interface. Therefore, for the new vmcore mmap feature we have introduce a new mechanism to create mappings on demand. This patch introduces a new architecture function remap_oldmem_pfn_range() that should be used to create mappings with remap_pfn_range() for oldmem areas that can be directly mapped. For zfcpdump this is everything besides of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault() is called. This handler works as follows: * Get already available or new page from page cache (find_or_create_page) * Check if /proc/vmcore page is filled with data (PageUptodate) * If yes: Return that page * If no: Fill page using __vmcore_read(), set PageUptodate, and return page Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Jan Willeke <willeke@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
be8a8d069e |
vmcore: introduce ELF header in new memory feature
For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump). We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.
The advantages of this mechanism are:
* No crashkernel memory has to be defined in the old kernel.
* Early boot problems (before kexec_load has been done) can be dumped
* Non-Linux systems can be dumped.
We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.
Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.
So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.
The following steps are done during zfcpdump execution:
1. Production system crashes
2. User boots a SCSI disk that has been prepared with the zfcpdump tool
3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4. Boot loader loads kernel into low memory area
5. Kernel boots and uses only HSA_SIZE bytes of memory
6. Kernel saves registers of non-boot CPUs
7. Kernel does memory detection for dump memory map
8. Kernel creates ELF header for /proc/vmcore
9. /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
- copy_oldmem_page() copies from HSA for memory below HSA_SIZE
- copy_oldmem_page() copies from real memory for memory above HSA_SIZE
Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick. We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data. This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.
This patch:
Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.
To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:
* elfcorehdr_alloc: Allocate ELF header
* elfcorehdr_free: Free the memory of the ELF header
* elfcorehdr_read: Read from ELF header
* elfcorehdr_read_notes: Read from ELF notes
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
131b2f9f12 |
exec: kill "int depth" in search_binary_handler()
Nobody except search_binary_handler() should touch ->recursion_depth, "int depth" buys nothing but complicates the code, kill it. Probably we should also kill "fn" and the !NULL check, ->load_binary should be always defined. And it can not go away after read_unlock() or this code is buggy anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Zach Levis <zml@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
af96397de8 |
kprobes: allow to specify custom allocator for insn caches
The current two insn slot caches both use module_alloc/module_free to allocate and free insn slot cache pages. For s390 this is not sufficient since there is the need to allocate insn slots that are either within the vmalloc module area or within dma memory. Therefore add a mechanism which allows to specify an own allocator for an own insn slot cache. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c802d64a35 |
kprobes: unify insn caches
The current kpropes insn caches allocate memory areas for insn slots
with module_alloc(). The assumption is that the kernel image and module
area are both within the same +/- 2GB memory area.
This however is not true for s390 where the kernel image resides within
the first 2GB (DMA memory area), but the module area is far away in the
vmalloc area, usually somewhere close below the 4TB area.
For new pc relative instructions s390 needs insn slots that are within
+/- 2GB of each area. That way we can patch displacements of
pc-relative instructions within the insn slots just like x86 and
powerpc.
The module area works already with the normal insn slot allocator,
however there is currently no way to get insn slots that are within the
first 2GB on s390 (aka DMA area).
Therefore this patch set modifies the kprobes insn slot cache code in
order to allow to specify a custom allocator for the insn slot cache
pages. In addition architecure can now have private insn slot caches
withhout the need to modify common code.
Patch 1 unifies and simplifies the current insn and optinsn caches
implementation. This is a preparation which allows to add more
insn caches in a simple way.
Patch 2 adds the possibility to specify a custom allocator.
Patch 3 makes s390 use the new insn slot mechanisms and adds support for
pc-relative instructions with long displacements.
This patch (of 3):
The two insn caches (insn, and optinsn) each have an own mutex and
alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).
Since there is the need for yet another insn cache which satifies dma
allocations on s390, unify and simplify the current implementation:
- Move the per insn cache mutex into struct kprobe_insn_cache.
- Move the alloc/free functions to kprobe.h so they are simply
wrappers for the generic __get_insn_slot/__free_insn_slot functions.
The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
which provides the alloc/free functions for each cache if needed.
- move the struct kprobe_insn_cache to kprobe.h which allows to generate
architecture specific insn slot caches outside of the core kprobes
code.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
f9597f24c0 |
syscalls.h: add forward declarations for inplace syscall wrappers
Unclutter -Wmissing-prototypes warning types (enabled at make W=1)
linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
^
linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^
by adding forward declarations right before definitions.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
bff2dc42bc |
smp.h: move !SMP version of on_each_cpu() out-of-line
All of the other non-trivial !SMP versions of functions in smp.h are out-of-line in up.c. Move on_each_cpu() there as well. This allows us to get rid of the #include <linux/irqflags.h>. The drawback is that this makes both the x86_64 and i386 defconfig !SMP kernels about 200 bytes larger each. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
fa688207c9 |
smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond
As in commit
|
||
|
|
f9121153fd |
mm/hwpoison: don't need to hold compound lock for hugetlbfs page
compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
is used to serialize put_page against __split_huge_page_refcount(). In
addition, transparent hugepages will be splitted in hwpoison handler and
just one subpage will be poisoned. There is unnecessary to hold compound
lock for hugetlbfs page. This patch replace compound_trans_order by
compond_order in the place where the page is hugetlbfs page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
5a53748568 |
mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.
The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.
The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.
To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".
The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.
Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.
And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.
After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).
[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
7d9f073b8d |
mm/writeback: make writeback_inodes_wb static
It's not used globally and could be static. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
6e543d5780 |
mm: vmscan: fix do_try_to_free_pages() livelock
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at |
||
|
|
7a8010cd36 |
mm: munlock: manual pte walk in fast path instead of follow_page_mask()
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d9104d1ca9 |
mm: track vma changes with VM_SOFTDIRTY bit
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e76b63f80d |
memblock, numa: binary search node id
Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.
We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.
Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().
3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
83467efbdb |
mm: migrate: check movability of hugepage in unmap_and_move_huge_page()
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c8721bbbdd |
mm: memory-hotplug: enable memory hotplug to handle hugepage
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [yongjun_wei@trendmicro.com.cn: remove duplicated include] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71ea2efb1e |
mm: migrate: remove VM_HUGETLB from vma flag check in vma_migratable()
Enable hugepage migration from migrate_pages(2), move_pages(2), and mbind(2). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
74060e4d78 |
mm: mbind: add hugepage migration code to mbind()
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b8ec1cee5a |
mm: soft-offline: use migrate_pages() instead of migrate_huge_page()
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
as an argument, instead of taking a pointer to the list of hugepages to be
migrated. This behavior was introduced in commit
|
||
|
|
31caf665e6 |
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
674470d979 |
lib/genalloc.c: fix overflow of ending address of memory chunk
In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2bb921e526 |
vmstat: create separate function to fold per cpu diffs into local counters
The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d2cf5ad631 |
swap: clean-up #ifdef in page_mapping()
PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
81c0a2bb51 |
mm: page_alloc: fair zone allocator policy
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.
But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.
The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.
Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021
Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937
When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
f92310c187 |
mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint()
In the current code, the value of fallback_migratetype that is printed using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype *after* it has been set to the preferred migratetype (if the ownership was changed). Obviously that wouldn't have been the original intent. (We already have a separate 'change_ownership' field to tell whether the ownership of the pageblock was changed from the fallback_migratetype to the preferred type.) The intent of the fallback_migratetype field is to show the migratetype from which we borrowed pages in order to satisfy the allocation request. So fix the code to print that value correctly. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ebc2a1a691 |
swap: make cluster allocation per-cpu
swap cluster allocation is to get better request merge to improve performance. But the cluster is shared globally, if multiple tasks are doing swap, this will cause interleave disk access. While multiple tasks swap is quite common, for example, each numa node has a kswapd thread doing swap and multiple threads/processes doing direct page reclaim. ioscheduler can't help too much here, because tasks don't send swapout IO down to block layer in the meantime. Block layer does merge some IOs, but a lot not, depending on how many tasks are doing swapout concurrently. In practice, I've seen a lot of small size IO in swapout workloads. We makes the cluster allocation per-cpu here. The interleave disk access issue goes away. All tasks swapout to their own cluster, so swapout will become sequential, which can be easily merged to big size IO. If one CPU can't get its per-cpu cluster (for example, there is no free cluster anymore in the swap), it will fallback to scan swap_map. The CPU can still continue swap. We don't need recycle free swap entries of other CPUs. In my test (swap to a 2-disk raid0 partition), this improves around 10% swapout throughput, and request size is increased significantly. How does this impact swap readahead is uncertain though. On one side, page reclaim always isolates and swaps several adjancent pages, this will make page reclaim write the pages sequentially and benefit readahead. On the other side, several CPU write pages interleave means the pages don't live _sequentially_ but relatively _near_. In the per-cpu allocation case, if adjancent pages are written by different cpus, they will live relatively _far_. So how this impacts swap readahead depends on how many pages page reclaim isolates and swaps one time. If the number is big, this patch will benefit swap readahead. Of course, this is about sequential access pattern. The patch has no impact for random access pattern, because the new cluster allocation algorithm is just for SSD. Alternative solution is organizing swap layout to be per-mm instead of this per-cpu approach. In the per-mm layout, we allocate a disk range for each mm, so pages of one mm live in swap disk adjacently. per-mm layout has potential issues of lock contention if multiple reclaimers are swap pages from one mm. For a sequential workload, per-mm layout is better to implement swap readahead, because pages from the mm are adjacent in disk. But per-cpu layout isn't very bad in this workload, as page reclaim always isolates and swaps several pages one time, such pages will still live in disk sequentially and readahead can utilize this. For a random workload, per-mm layout isn't beneficial of request merge, because it's quite possible pages from different mm are swapout in the meantime and IO can't be merged in per-mm layout. while with per-cpu layout we can merge requests from any mm. Considering random workload is more popular in workloads with swap (and per-cpu approach isn't too bad for sequential workload too), I'm choosing per-cpu layout. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
815c2c543d |
swap: make swap discard async
swap can do cluster discard for SSD, which is good, but there are some problems here: 1. swap do the discard just before page reclaim gets a swap entry and writes the disk sectors. This is useless for high end SSD, because an overwrite to a sector implies a discard to original sector too. A discard + overwrite == overwrite. 2. the purpose of doing discard is to improve SSD firmware garbage collection. Idealy we should send discard as early as possible, so firmware can do something smart. Sending discard just after swap entry is freed is considered early compared to sending discard before write. Of course, if workload is already bound to gc speed, sending discard earlier or later doesn't make 3. block discard is a sync API, which will delay scan_swap_map() significantly. 4. Write and discard command can be executed parallel in PCIe SSD. Making swap discard async can make execution more efficiently. This patch makes swap discard async and moves discard to where swap entry is freed. Discard and write have no dependence now, so above issues can be avoided. Idealy we should do discard for any freed sectors, but some SSD discard is very slow. This patch still does discard for a whole cluster. My test does a several round of 'mmap, write, unmap', which will trigger a lot of swap discard. In a fusionio card, with this patch, the test runtime is reduced to 18% of the time without it, so around 5.5x faster. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2a8f944934 |
swap: change block allocation algorithm for SSD
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30% CPU time (when cluster is hard to find, the CPU time can be up to 80%), which becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page cluster, which is very slow. Here I introduced a simple algorithm to search cluster. Since we only care about 256 pages cluster, we can just use a counter to track if a cluster is free. Every 256 pages use one int to store the counter. If the counter of a cluster is 0, the cluster is free. All free clusters will be added to a list, so searching cluster is very efficient. With this, scap_swap_map() overhead disappears. This might help low end SD card swap too. Because if the cluster is aligned, SD firmware can do flash erase more efficiently. We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has downside with the algorithm which might introduce regression (see below). The patch slightly changes which cluster is choosen. It always adds free cluster to list tail. This can help wear leveling for low end SSD too. And if no cluster found, the scan_swap_map() will do search from the end of last cluster. So if no cluster found, the scan_swap_map() will do search from the end of last free cluster, which is random. For SSD, this isn't a problem at all. Another downside is the cluster must be aligned to 256 pages, which will reduce the chance to find a cluster. I would expect this isn't a big problem for SSD because of the non-seek penality. (And this is the reason I only enable the algorithm for SSD). Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
6df46865ff |
mm: vmstats: track TLB flush stats on UP too
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.
UP systems do not do remote TLB flushes, so compile those counters out on
UP.
arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4. It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
9824cf9753 |
mm: vmstats: tlb flush counters
I was investigating some TLB flush scaling issues and realized that we do not have any good methods for figuring out how many TLB flushes we are doing. It would be nice to be able to do these in generic code, but the arch-independent calls don't explicitly specify whether we actually need to do remote flushes or not. In the end, we really need to know if we actually _did_ global vs. local invalidations, so that leaves us with few options other than to muck with the counters from arch-specific code. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ef0855d334 |
mm: mempolicy: turn vma_set_policy() into vma_dup_policy()
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bab55417b1 |
block: support embedded device command line partition
Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [akpm@linux-foundation.org: fix printk text] [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e1403b8edf |
include/linux/sched.h: don't use task->pid/tgid in same_thread_group/has_group_leader_pid
task_struct->pid/tgid should go away. 1. Change same_thread_group() to use task->signal for comparison. 2. Change has_group_leader_pid(task) to compare task_pid(task) with signal->leader_pid. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3b8967d713 |
include/linux/smp.h:on_each_cpu(): switch back to a C function
Revert commit
|
||
|
|
bbda1baeeb |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Brown paper bag fix in HTB scheduler, class options set incorrectly
due to a typoe. Fix from Vimalkumar.
2) It's possible for the ipv6 FIB garbage collector to run before all
the necessary datastructure are setup during init, defer the
notifier registry to avoid this problem. Fix from Michal Kubecek.
3) New i40e ethernet driver from the Intel folks.
4) Add new qmi wwan device IDs, from Bjørn Mork.
5) Doorbell lock in bnx2x driver is not initialized properly in some
configurations, fix from Ariel Elior.
6) Revert an ipv6 packet option padding change that broke standardized
ipv6 implementation test suites. From Jiri Pirko.
7) Fix synchronization of ARP information in bonding layer, from
Nikolay Aleksandrov.
8) Fix missing error return resulting in illegal memory accesses in
openvswitch, from Daniel Borkmann.
9) SCTP doesn't signal poll events properly due to mistaken operator
precedence, fix also from Daniel Borkmann.
10) __netdev_pick_tx() passes wrong index to sk_tx_queue_set() which
essentially disables caching of TX queue in sockets :-/ Fix from
Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
net_sched: htb: fix a typo in htb_change_class()
net: qmi_wwan: add new Qualcomm devices
ipv6: don't call fib6_run_gc() until routing is ready
net: tilegx driver: avoid compiler warning
fib6_rules: fix indentation
irda: vlsi_ir: Remove casting the return value which is a void pointer
irda: donauboe: Remove casting the return value which is a void pointer
net: fix multiqueue selection
net: sctp: fix smatch warning in sctp_send_asconf_del_ip
net: sctp: fix bug in sctp_poll for SOCK_SELECT_ERR_QUEUE
net: fib: fib6_add: fix potential NULL pointer dereference
net: ovs: flow: fix potential illegal memory access in __parse_flow_nlattrs
bcm63xx_enet: remove deprecated IRQF_DISABLED
net: korina: remove deprecated IRQF_DISABLED
macvlan: Move skb_clone check closer to call
qlcnic: Fix warning reported by kbuild test robot.
bonding: fix bond_arp_rcv setting and arp validate desync state
bonding: fix store_arp_validate race with mode change
ipv6/exthdrs: accept tlv which includes only padding
bnx2x: avoid atomic allocations during initialization
...
|
||
|
|
2c861cc65e |
ipv6: don't call fib6_run_gc() until routing is ready
When loading the ipv6 module, ndisc_init() is called before ip6_route_init(). As the former registers a handler calling fib6_run_gc(), this opens a window to run the garbage collector before necessary data structures are initialized. If a network device is initialized in this window, adding MAC address to it triggers a NETDEV_CHANGEADDR event, leading to a crash in fib6_clean_all(). Take the event handler registration out of ndisc_init() into a separate function ndisc_late_init() and move it after ip6_route_init(). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
2b76db6a0f |
for-linus-3.12-merge minor 9p fixes and tweaks for 3.12 merge window
The first fixes namespace issues which causes a kernel NULL pointer dereference, the second fixes uevent handling to work better with udev, and the third switches some code to use srlcpy instead of strncpy in order to be safer. All changes have been baking in for-next for at least 2 weeks. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: GPGTools - http://gpgtools.org iQIcBAABAgAGBQJSMJjZAAoJEDZk62b0Tg6x81sQAKa60QStBKhnL65bvG+ooIsS mhwfmFyaWOKw1ezwY2Vk0+JnmKDBpKmqjjwyL3nLP18TcRZStPiFdcJBKWl+czge FTv14t54CcjysYPbYN7+gUap4F5mfg0mcHaR0UGow505dNyjwd7mqkZhy1IqhdvP Ue/h0RE46GeNtdirxrKBdEfW/7TAL0tcoRgjKu0ev1V2sXCJZywuXgkzWjByRXwT JOg04gGnYThuek0/KUPRhf0KxB0CyKrZiics7LGb40HkYYxs7ahADACttLyiDr8l GntfHXLgvVlU5QcSbKRfLp0zNbi7AxWmJrwYsEwpas4tUw1Q+pVJ2EE2Ameuq5G+ LrMGmRVQCVYw8UN+OYUO7glhXEJcCPJj6vxgm+NVXx24yaQyGI1aTsIEjHwZ/hkm wlQHC47z6/fIypkXpsU6pYWF/r3GwXHokYReejATQWEPIzIxvHeThe0jjqMLth7F zmsHZTpmECqtti1fizy5wBZD25wAIxdf+rf8nKy1VvcSN4s08ESSlC/kV/siNeko efFnL8xbjP5SPEVoBtXM6eTDHrQ0S+ACSGWtp0FGXKOW4PKzS60ve2Stp+FYZgQc WgXI7+NBU6Z9z+cZ9bsY0hrGwK1YZiR4F3KJ5ofTuxAO6n7zd+N3fGBuQJ2tiW9P pKtIXNozWqnAU9Wx4rGa =YbFT -----END PGP SIGNATURE----- Merge tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p updates from Eric Van Hensbergen: "Minor 9p fixes and tweaks for 3.12 merge window The first fixes namespace issues which causes a kernel NULL pointer dereference, the second fixes uevent handling to work better with udev, and the third switches some code to use srlcpy instead of strncpy in order to be safer. All changes have been baking in for-next for at least 2 weeks" * tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: avoid accessing utsname after namespace has been torn down 9p: send uevent after adding/removing mount_tag attribute fs: 9p: use strlcpy instead of strncpy |
||
|
|
a22a0fdba4 |
New drivers:
- APM X-Gene system reboot driver by Feng Kan and Loc Ho (APM).
- Qualcomm MSM reboot/poweroff driver by Abhimanyu Kapur (Codeaurora).
- Texas Instruments BQ24190 charger driver by Mark A. Greer (Animal Creek
Technologies).
- Texas Instruments TWL4030 MADC battery driver by Lukas Märdian and Marek
Belisko (Golden Delicious Computers). The driver is used on Freerunner
GTA04 phones.
Highlighted fixes and improvements:
- Suspend/wakeup logic improvements: power supply objects will block
system suspend until all power supply events are processed. Thanks to
Zoran Markovic (Linaro), Arve Hjonnevag and Todd Poynor (Google).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJSL/sOAAoJEGgI9fZJve1bDMAP/ih4C5NsuRsQxZMzTWIft80+
S9r/co8Q/1LXgxqdMe1xq0pPgceuXCTHXjSK+ZM+/A9Qi/jFVg1S8skrAlUzMq2m
PaRbyAVxtahU1uUIXqQLNbYO/IuRKNdXnREVt6w4K9BMjbjhsNdIRG6zOdsLxhKL
r64sm3fqEJNRxIGuGzXPnMLNTL99uDJmMs1oPHAJwGXqWQA74K8iBHRgzi+kTAxG
juDerMGoxU7modEQMlK4jKYj2bju4liYsrQ9pQBU1QUmHgPl7QLigXHtnkbWivrN
nPPSJCNdtCXXtazU0ptkI717aPo0SBtEVQETF960ZTASdoqt2Ver8h4RW7dBAOIw
tc4hdbJjZvuaDz4Os2Wy0NW+1uXVmj/Jyl9WtNtI8XYBMT0GWNN16xZZYKX303UG
h0Myw2XhNko95g6XxQBfiat5VlH8aSIDz3LVTM1RC9zrDvjmpsI7bx+17Tw2+McK
WdPNNHjCcgRNOIVt/vLYxQYxbJWfPf6IW2nn1v9Kc96K5WD2+3n6bn833szxFu75
FnNghYLKkp9iXKXgGysHKoczAQa0eUxjAE/UAo10i4Eg7JmXF2Q4A+wwa2QrMutK
FlfQo8pJGAiw0xGjkbPw8DkgBEkY8v9cJEIr0BUI6LwUoraeAjlkLjC5Im3V9bYM
YRSzcSEESc3J/1YTGmam
=k4RI
-----END PGP SIGNATURE-----
Merge tag 'for-v3.12' of git://git.infradead.org/battery-2.6
Pull battery/power supply driver updates from Anton Vorontsov:
"New drivers:
- APM X-Gene system reboot driver by Feng Kan and Loc Ho (APM).
- Qualcomm MSM reboot/poweroff driver by Abhimanyu Kapur (Codeaurora).
- Texas Instruments BQ24190 charger driver by Mark A. Greer (Animal
Creek Technologies).
- Texas Instruments TWL4030 MADC battery driver by Lukas Märdian and
Marek Belisko (Golden Delicious Computers). The driver is used on
Freerunner GTA04 phones.
Highlighted fixes and improvements:
- Suspend/wakeup logic improvements: power supply objects will block
system suspend until all power supply events are processed. Thanks
to Zoran Markovic (Linaro), Arve Hjonnevag and Todd Poynor (Google)"
* tag 'for-v3.12' of git://git.infradead.org/battery-2.6:
rx51_battery: Fix channel number when reading adc value
power: Add twl4030_madc battery driver.
bq24190_charger: Workaround SS definition problem on i386 builds
power_supply: Prevent suspend until power supply events are processed
vexpress-poweroff: Should depend on the required infrastructure
twl4030-charger: Fix compiler warning with regulator_enable()
rx51_battery: Replace hardcoded channels values.
bq24190_charger: Add support for TI BQ24190 Battery Charger
ab8500-charger: We print an unintended error message
max8925_power: Fix missing of_node_put
power_supply: Replace strict_strtol() with kstrtol()
power: Add APM X-Gene system reboot driver
power_supply: tosa_battery: Get rid of irq_to_gpio usage
power supply: collie_battery: Convert to use dev_pm_ops
power_supply: Make goldfish_battery depend on GOLDFISH || COMPILE_TEST
power: reset: Add msm restart support
MAINTAINERS: drivers/power: add entry for SmartReflex AVS drivers
|
||
|
|
fa1586a7e4 |
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie: "Daniel had some fixes queued up, that were delayed, the stolen memory ones and vga arbiter ones are quite useful, along with his usual bunch of stuff, nothing for HSW outputs yet. The one nouveau fix is for a regression I caused with the poweroff stuff" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (30 commits) drm/nouveau: fix oops on runtime suspend/resume drm/i915: Delay disabling of VGA memory until vgacon->fbcon handoff is done drm/i915: try not to lose backlight CBLV precision drm/i915: Confine page flips to BCS on Valleyview drm/i915: Skip stolen region initialisation if none is reserved drm/i915: fix gpu hang vs. flip stall deadlocks drm/i915: Hold an object reference whilst we shrink it drm/i915: fix i9xx_crtc_clock_get for multiplied pixels drm/i915: handle sdvo input pixel multiplier correctly again drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock drm/i915: fix up the relocate_entry refactoring drm/i915: Fix pipe config warnings when dealing with LVDS fixed mode drm/i915: Don't call sg_free_table() if sg_alloc_table() fails i915: Update VGA arbiter support for newer devices vgaarb: Fix VGA decodes changes vgaarb: Don't disable resources that are not owned drm/i915: Pin pages whilst mapping the dma-buf drm/i915: enable trickle feed on Haswell x86: add early quirk for reserving Intel graphics stolen memory v5 drm/i915: split PCI IDs out into i915_drm.h v4 ... |
||
|
|
cf596766fc |
Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields: "This was a very quiet cycle! Just a few bugfixes and some cleanup" * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: rpc: let xdr layer allocate gssproxy receieve pages rpc: fix huge kmalloc's in gss-proxy rpc: comment on linux_cred encoding, treat all as unsigned rpc: clean up decoding of gssproxy linux creds svcrpc: remove unused rq_resused nfsd4: nfsd4_create_clid_dir prints uninitialized data nfsd4: fix leak of inode reference on delegation failure Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR" sunrpc: prepare NFS for 2038 nfsd4: fix setlease error return nfsd: nfs4_file_get_access: need to be more careful with O_RDWR |
||
|
|
31f7c3a688 |
Device tree core updates for v3.12
Generally minor changes. A bunch of bug fixes, particularly for
initialization and some refactoring. Most notable change if feeding the
entire flattened tree into the random pool at boot. May not be
significant, but shouldn't hurt either.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSL12LAAoJEEFnBt12D9kB64gP/RBipnYbo3RPanHg+lE/J1V7
KSVFNGKWJHxTg47VVC1YJGIG21jqxAilpdS2MQL5FP7iyd+IzvtHpQiJgp+2G+pq
di06yrdyrYErxRgZgGQi8IpR538ZzOEVLCKJGdb09YelkRzPT5au7CC1MAsX3qco
yba7PHk0/Nc4hZE4aGbgR1DlRmn86ob7mM0KFE/LORaSN2BueMgWcwKhQXYNGyoh
assX4yNhAbUG6Bgw7paBLDGqHh8c5Ei5AppU8yPb+N094jgYHBJryUoDlzzUHD23
qqiEqHhUKT0TpgHNs8KH0WZFugcmjKvYEbzdzadBxqfXnJN4fKSEcdfF3iz4T14j
U6EZks89GoHwA523OghUZkKNOqlsUdWfdKz+8/grQqKisYwDcf3fCxEYk/4weDCQ
b6fFlOv6+AI3btjXp6F511ZKxyT4ZZzkHjp/ZSrhBygyamNZfax0ma0j+ZS9AZql
kPxQS0nOve6NKaP7vXxMmW5sGMnL19ER/Hm31wthGcWI43GVebUdklnzfGaEeSjs
pmP8oiCNemceqVpiPKxcOxiguf/eyIjP1SFXbguASygUmQeTDbbJ8n1FYznCitue
xJgWttKWsEf/aMR3eJtQ3aBmHR3rijAV4E28Wlq8XMkocwvpQm2zMocS2Z5BJ80S
hi1kQVy8+RxNX96tOSp1
=GSWl
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree core updates from Grant Likely:
"Generally minor changes. A bunch of bug fixes, particularly for
initialization and some refactoring. Most notable change if feeding
the entire flattened tree into the random pool at boot. May not be
significant, but shouldn't hurt either"
Tim Bird questions whether the boot time cost of the random feeding may
be noticeable. And "add_device_randomness()" is definitely not some
speed deamon of a function.
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
of/platform: add error reporting to of_amba_device_create()
irq/of: Fix comment typo for irq_of_parse_and_map
of: Feed entire flattened device tree into the random pool
of/fdt: Clean up casting in unflattening path
of/fdt: Remove duplicate memory clearing on FDT unflattening
gpio: implement gpio-ranges binding document fix
of: call __of_parse_phandle_with_args from of_parse_phandle
of: introduce of_parse_phandle_with_fixed_args
of: move of_parse_phandle()
of: move documentation of of_parse_phandle_with_args
of: Fix missing memory initialization on FDT unflattening
of: consolidate definition of early_init_dt_alloc_memory_arch()
of: Make of_get_phy_mode() return int i.s.o. const int
include: dt-binding: input: create a DT header defining key codes.
of/platform: Staticize of_platform_device_create_pdata()
of: Specify initrd location using 64-bit
dt: Typo fix
OF: make of_property_for_each_{u32|string}() use parameters if OF is not enabled
|