linux/block
Jeff Moyer 8a9a6a1337 cfq-iosched: fix incorrect filing of rt async cfqq
commit c6ce194325 upstream.

Hi,

If you can manage to submit an async write as the first async I/O from
the context of a process with realtime scheduling priority, then a
cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
ends up in the best effort array, but actually has realtime I/O
scheduling priority set in cfqq->ioprio.

The reason is that cfq_get_queue assumes the default scheduling class and
priority when there is no information present (i.e. when the async cfqq
is created):

static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
	      struct bio *bio, gfp_t gfp_mask)
{
	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);

cic->ioprio starts out as 0, which is "invalid".  So, class of 0
(IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:

		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);

static struct cfq_queue **
cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
{
        switch (ioprio_class) {
        case IOPRIO_CLASS_RT:
                return &cfqd->async_cfqq[0][ioprio];
        case IOPRIO_CLASS_NONE:
                ioprio = IOPRIO_NORM;
                /* fall through */
        case IOPRIO_CLASS_BE:
                return &cfqd->async_cfqq[1][ioprio];
        case IOPRIO_CLASS_IDLE:
                return &cfqd->async_idle_cfqq;
        default:
                BUG();
        }
}

Here, instead of returning a class mapped from the process' scheduling
priority, we get back the bucket associated with IOPRIO_CLASS_BE.

Now, there is no queue allocated there yet, so we create it:

		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);

That function ends up doing this:

			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
			cfq_init_prio_data(cfqq, cic);

cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
data does this:

	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
	switch (ioprio_class) {
	default:
		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
	case IOPRIO_CLASS_NONE:
		/*
		 * no prio set, inherit CPU scheduling settings
		 */
		cfqq->ioprio = task_nice_ioprio(tsk);
		cfqq->ioprio_class = task_nice_ioclass(tsk);
		break;

So we basically have two code paths that treat IOPRIO_CLASS_NONE
differently, which results in an RT async cfqq filed into a best effort
bucket.

Attached is a patch which fixes the problem.  I'm not sure how to make
it cleaner.  Suggestions would be welcome.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06 14:40:50 -08:00
..
partitions partitions/efi.c: replace useless kzalloc's by kmalloc's 2013-04-30 08:34:25 +02:00
blk-cgroup.c blkcg: don't call into policy draining if root_blkg is already gone 2014-09-17 09:04:02 -07:00
blk-cgroup.h Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. 2013-12-11 22:36:27 -08:00
blk-core.c blktrace: fix accounting of partially completed requests 2014-05-30 21:52:11 -07:00
blk-exec.c Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block 2013-02-28 12:52:24 -08:00
blk-flush.c Block: blk-flush: Fixed indent code style 2013-03-22 12:22:51 -06:00
blk-integrity.c scatterlist: introduce sg_unmark_end 2013-03-20 15:43:04 +10:30
blk-ioc.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
blk-iopoll.c tree-wide: fix assorted typos all over the place 2009-12-04 15:39:55 +01:00
blk-lib.c block: add cond_resched() to potentially long running ioctl discard loop 2014-02-22 12:41:28 -08:00
blk-map.c block: re-use existing 'reading' variable instead of checking direction again 2011-12-21 15:27:24 +01:00
blk-merge.c scatterlist: introduce sg_unmark_end 2013-03-20 15:43:04 +10:30
blk-settings.c block: fix alignment_offset math that assumes io_min is a power-of-2 2014-11-14 08:47:55 -08:00
blk-softirq.c sched, block: Unify cache detection 2012-01-27 13:28:48 +01:00
blk-sysfs.c block: avoid using uninitialized value in from queue_var_store 2013-04-03 21:53:57 +02:00
blk-tag.c block: don't assume last put of shared tags is for the host 2014-07-31 12:53:48 -07:00
blk-throttle.c block: Rename queue dead flag 2012-12-06 14:30:58 +01:00
blk-timeout.c block: fix race between request completion and timeout handling 2013-11-29 11:11:50 -08:00
blk.h block: __elv_next_request() shouldn't call into the elevator if bypassing 2014-02-22 12:41:28 -08:00
bsg-lib.c bsg: Remove unused function bsg_goose_queue() 2012-12-06 14:33:02 +01:00
bsg.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
cfq-iosched.c cfq-iosched: fix incorrect filing of rt async cfqq 2015-03-06 14:40:50 -08:00
compat_ioctl.c block: provide compat ioctl for BLKZEROOUT 2014-07-31 12:53:48 -07:00
deadline-iosched.c elevator: Fix a race in elevator switching 2013-08-20 08:43:03 -07:00
elevator.c elevator: acquire q->sysfs_lock in elevator_change() 2013-12-08 07:29:27 -08:00
genhd.c genhd: check for int overflow in disk_expand_part_tbl() 2015-01-16 06:59:02 -08:00
ioctl.c Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block 2012-10-11 09:04:23 +09:00
Kconfig block: don't select PERCPU_RWSEM 2013-02-22 10:42:45 +01:00
Kconfig.iosched blkcg: make CONFIG_BLK_CGROUP bool 2012-03-06 21:27:21 +01:00
Makefile separate partition format handling from generic code 2012-01-03 22:54:06 -05:00
noop-iosched.c elevator: Fix a race in elevator switching 2013-08-20 08:43:03 -07:00
partition-generic.c block: Fix dev_t minor allocation lifetime 2014-10-05 14:54:12 -07:00
scsi_ioctl.c scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND 2014-11-14 08:47:59 -08:00