linux/drivers
Christian Brauner 90ee6ed776
fs: port files to file_ref
Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

- We continue to WARN during get_file() in case a file that is already
  marked dead is revived as get_file() is only valid if the caller
  already holds a reference to the file. This hasn't changed just the
  check changes.

- The semantics for epoll and ttm's dmabuf usage have changed. Both
  epoll and ttm synchronize with __fput() to prevent the underlying file
  from beeing freed.

  (1) epoll

      Explaining epoll is straightforward using a simple diagram.
      Essentially, the mutex of the epoll instance needs to be taken in both
      __fput() and around epi_fget() preventing the file from being freed
      while it is polled or preventing the file from being resurrected.

          CPU1                                   CPU2
          fput(file)
          -> __fput(file)
             -> eventpoll_release(file)
                -> eventpoll_release_file(file)
                                                 mutex_lock(&ep->mtx)
                                                 epi_item_poll()
                                                 -> epi_fget()
                                                    -> file_ref_get(file)
                                                 mutex_unlock(&ep->mtx)
                   mutex_lock(&ep->mtx);
                   __ep_remove()
                   mutex_unlock(&ep->mtx);
             -> kmem_cache_free(file)

  (2) ttm dmabuf

      This explanation is a bit more involved. A regular dmabuf file stashed
      the dmabuf in file->private_data and the file in dmabuf->file:

          file->private_data = dmabuf;
          dmabuf->file = file;

      The generic release method of a dmabuf file handles file specific
      things:

          f_op->release::dma_buf_file_release()

      while the generic dentry release method of a dmabuf handles dmabuf
      freeing including driver specific things:

          dentry->d_release::dma_buf_release()

      During ttm dmabuf initialization in ttm_object_device_init() the ttm
      driver copies the provided struct dma_buf_ops into a private location:

          struct ttm_object_device {
                  spinlock_t object_lock;
                  struct dma_buf_ops ops;
                  void (*dmabuf_release)(struct dma_buf *dma_buf);
                  struct idr idr;
          };

          ttm_object_device_init(const struct dma_buf_ops *ops)
          {
                  // copy original dma_buf_ops in private location
                  tdev->ops = *ops;

                  // stash the release method of the original struct dma_buf_ops
                  tdev->dmabuf_release = tdev->ops.release;

                  // override the release method in the copy of the struct dma_buf_ops
                  // with ttm's own dmabuf release method
                  tdev->ops.release = ttm_prime_dmabuf_release;
          }

      When a new dmabuf is created the struct dma_buf_ops with the overriden
      release method set to ttm_prime_dmabuf_release is passed in exp_info.ops:

          DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
          exp_info.ops = &tdev->ops;
          exp_info.size = prime->size;
          exp_info.flags = flags;
          exp_info.priv = prime;

      The call to dma_buf_export() then sets

          mutex_lock_interruptible(&prime->mutex);
          dma_buf = dma_buf_export(&exp_info)
          {
                  dmabuf->ops = exp_info->ops;
          }
          mutex_unlock(&prime->mutex);

      which creates a new dmabuf file and then install a file descriptor to
      it in the callers file descriptor table:

          ret = dma_buf_fd(dma_buf, flags);

      When that dmabuf file is closed we now get:

          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);

      Where we can see that prime->dma_buf is set to NULL. So when we have
      the following diagram:

          CPU1                                                          CPU2
          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                                                                        ttm_prime_handle_to_fd()
                                                                        mutex_lock_interruptible(&prime->mutex)
                                                                        dma_buf = prime->dma_buf
                                                                        dma_buf && get_dma_buf_unless_doomed(dma_buf)
                                                                        -> file_ref_get(dma_buf->file)
                                                                        mutex_unlock(&prime->mutex);

                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);
             -> kmem_cache_free(file)

      The logic of the mechanism is the same as for epoll: sync with
      __fput() preventing the file from being freed. Here the
      synchronization happens through the ttm instance's prime->mutex.
      Basically, the lifetime of the dma_buf and the file are tighly
      coupled.

  Both (1) and (2) used to call atomic_inc_not_zero() to check whether
  the file has already been marked dead and then refuse to revive it.

  This is only safe because both (1) and (2) sync with __fput() and thus
  prevent kmem_cache_free() on the file being called and thus prevent
  the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU.

  Both (1) and (2) have been ported from atomic_inc_not_zero() to
  file_ref_get(). That means a file that is already in the process of
  being marked as FILE_REF_DEAD:

  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  can be revived again:

  CPU1                                                             CPU2
  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
                                                                   file_ref_get()
                                                                   // Brings reference back to FILE_REF_ONEREF
                                                                   atomic_long_add_negative()
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  This is fine and inherent to the file_ref_get()/file_ref_put()
  semantics. For both (1) and (2) this is safe because __fput() is
  prevented from making progress if file_ref_get() fails due to the
  aforementioned synchronization mechanisms.

  Two cases need to be considered that affect both (1) epoll and (2) ttm
  dmabuf:

   (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but
       before that fput() can mark the file as FILE_REF_DEAD someone
       manages to sneak in a file_ref_get() and brings the refcount back
       from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original
       fput() doesn't call __fput(). For epoll the poll will finish and
       for ttm dmabuf the file can be used again. For ttm dambuf this is
       actually an advantage because it avoids immediately allocating
       a new dmabuf object.

       CPU1                                                             CPU2
       file_ref_put()
       cnt = atomic_long_dec_return()
       -> __file_ref_put(cnt)
          if (cnt == FIlE_REF_NOREF)
                                                                        file_ref_get()
                                                                        // Brings reference back to FILE_REF_ONEREF
                                                                        atomic_long_add_negative()
                  atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and
       also suceeds in actually marking it FILE_REF_DEAD and then calls
       into __fput() to free the file.

       When either (1) or (2) call file_ref_get() they fail as
       atomic_long_add_negative() will return true.

       At the same time, both (1) and (2) all file_ref_get() under
       mutexes that __fput() must also acquire preventing
       kmem_cache_free() from freeing the file.

  So while this might be treated as a change in semantics for (1) and
  (2) it really isn't. It if should end up causing issues this can be
  fixed by adding a helper that does something like:

  long cnt = atomic_long_read(&ref->refcnt);
  do {
          if (cnt < 0)
                  return false;
  } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1));
  return true;

  which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions.

- Jann correctly pointed out that kmem_cache_zalloc() cannot be used
  anymore once files have been ported to file_ref_t.

  The kmem_cache_zalloc() call will memset() the whole struct file to
  zero when it is reallocated. This will also set file->f_ref to zero
  which mens that a concurrent file_ref_get() can return true:

  CPU1                            CPU2
                                  __get_file_rcu()
                                    rcu_dereference_raw()
  close()
    [frees file]
  alloc_empty_file()
    kmem_cache_zalloc()
      [reallocates same file]
      memset(..., 0, ...)
                                    file_ref_get()
                                      [increments 0->1, returns true]
    init_file()
      file_ref_init(..., 1)
        [sets to 0]
                                    rcu_dereference_raw()
                                    fput()
                                      file_ref_put()
                                        [decrements 0->FILE_REF_NOREF, frees file]
    [UAF]

   causing a concurrent __get_file_rcu() call to acquire a reference to
   the file that is about to be reallocated and immediately freeing it
   on realizing that it has been recycled. This causes a UAF for the
   task that reallocated/recycled the file.

   This is prevented by switching from kmem_cache_zalloc() to
   kmem_cache_alloc() and initializing the fields manually. With
   file->f_ref initialized last.

   Note that a memset() also isn't guaranteed to atomically update an
   unsigned long so it's theoretically possible to see torn and
   therefore bogus counter values.

Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30 09:57:43 +01:00
..
accel dma-mapping updates for linux 6.12 2024-09-19 11:12:49 +02:00
accessibility
acpi cxl fixes for 6.12-rc2 2024-10-05 10:40:16 -07:00
amba ARM: 9416/1: amba: make amba_bustype constant 2024-09-04 15:01:17 +01:00
android binder: modify the comment for binder_proc_unlock 2024-09-11 16:02:45 +02:00
ata move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
atm
auxdisplay move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
base move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
bcma PCI: Rename CRS Completion Status to RRS 2024-09-10 19:52:30 -05:00
block block-6.12-20241004 2024-10-04 10:43:44 -07:00
bluetooth Including fixes from ieee802154, bluetooth and netfilter. 2024-10-03 09:44:00 -07:00
bus Driver core update for 6.12-rc1 2024-09-27 08:48:37 -07:00
cache
cdrom
cdx
char move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
clk move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
clocksource Updates for x86 timers: 2024-09-17 15:27:01 +02:00
comedi move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
connector
counter move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
cpufreq Power management fixes for 6.12-rc2 2024-10-04 11:57:15 -07:00
cpuidle pmdomain core: 2024-09-18 10:49:45 +02:00
crypto move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
cxl move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
dax
dca
devfreq PM / devfreq: imx-bus: Use of_property_present() 2024-09-05 01:23:56 +09:00
dio
dma soc: convert ep93xx to devicetree 2024-09-26 12:00:25 -07:00
dma-buf drm next for 6.12-rc1 2024-09-19 10:18:15 +02:00
dpll
edac - Drop a now obsolete ppc4xx_edac driver 2024-09-16 06:36:37 +02:00
eisa
extcon Char/Misc and other driver changes for 6.12-rc1 2024-09-26 10:13:08 -07:00
firewire move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
firmware drm fixes for 6.12-rc2 2024-10-04 11:25:14 -07:00
fpga move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
fsi move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
gnss [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
gpio gpiolib: Fix potential NULL pointer dereference in gpiod_get_label() 2024-10-03 20:51:47 +02:00
gpu fs: port files to file_ref 2024-10-30 09:57:43 +01:00
greybus move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hid Getting rid of asm/unaligned.h includes 2024-10-02 16:42:28 -07:00
hsi
hte
hv drm next for 6.12-rc1 2024-09-19 10:18:15 +02:00
hwmon move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hwspinlock
hwtracing [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
i2c i2c-for-6.12-rc2 2024-10-05 10:31:04 -07:00
i3c i3c: master: svc: Fix use after free vulnerability in svc_i3c_master Driver Due to Race Condition 2024-09-17 16:51:45 +02:00
idle intel_idle: fix ACPI _CST matching for newer Xeon platforms 2024-09-25 22:30:33 +02:00
iio move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
infiniband [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
input Getting rid of asm/unaligned.h includes 2024-10-02 16:42:28 -07:00
interconnect
iommu [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
ipack
irqchip Merge tag 'irq-core-2024-09-16' into loongarch-next 2024-09-17 22:20:12 +08:00
isdn move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
leds move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
macintosh move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
mailbox mailbox, remoteproc: omap2+: fix compile testing 2024-09-27 09:11:05 -05:00
mcb
md Getting rid of asm/unaligned.h includes 2024-10-02 16:42:28 -07:00
media move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
memory
memstick move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
message SCSI misc on 20240928 2024-09-29 09:22:34 -07:00
mfd move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
misc move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
mmc move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
most
mtd move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
mux
net Including fixes from ieee802154, bluetooth and netfilter. 2024-10-03 09:44:00 -07:00
nfc move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
ntb ntb: Force physically contiguous allocation of rx ring buffers 2024-09-20 10:51:25 -04:00
nubus
nvdimm virtio: features, fixes, cleanups 2024-09-26 08:43:17 -07:00
nvme move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
nvmem Char/Misc and other driver changes for 6.12-rc1 2024-09-26 10:13:08 -07:00
of Kbuild updates for v6.12 2024-09-24 13:02:06 -07:00
opp Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools' 2024-09-11 19:02:23 +02:00
parisc parisc: pdc_stable: Constify struct kobj_type 2024-09-09 08:53:17 +02:00
parport
pci move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
pcmcia move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
peci move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
perf drivers/perf: riscv: Align errno for unsupported perf event 2024-10-01 02:47:39 -07:00
phy phy-for-6.12 2024-09-23 14:05:10 -07:00
pinctrl soc: convert ep93xx to devicetree 2024-09-26 12:00:25 -07:00
platform platform-drivers-x86 for v6.12-2 2024-10-06 11:11:01 -07:00
pmdomain pmdomain: core: Reduce debug summary table width 2024-09-13 13:41:33 +02:00
pnp
power move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
powercap
pps [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
ps3
ptp move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
pwm soc: convert ep93xx to devicetree 2024-09-26 12:00:25 -07:00
rapidio
ras
regulator regulator: sm5703: Remove because it is unused and fails to build 2024-09-13 19:08:14 +01:00
remoteproc mhu-v3, omap2+ : fix kconfig dependencies 2024-09-29 09:53:04 -07:00
reset
rpmsg rpmsg: glink: Avoid -Wflex-array-member-not-at-end warnings 2024-09-13 14:09:47 -07:00
rtc move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
s390 more s390 updates for 6.12 merge window 2024-09-28 09:11:46 -07:00
sbus [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
scsi move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
sh sh: intc: Replace simple_strtoul() with kstrtoul() 2024-09-26 17:25:29 +02:00
siox
slimbus
soc move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
soundwire soundwire updates for 6.12 2024-09-23 14:00:46 -07:00
spi spi: Fixes for v6.12 2024-10-05 10:25:04 -07:00
spmi
ssb
staging move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
target move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
tc
tee optee: Fix a NULL vs IS_ERR() check 2024-09-09 12:22:06 +02:00
thermal move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
thunderbolt thunderbolt: Changes for v6.12 merge window 2024-09-11 15:17:43 +02:00
tty move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
ufs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
uio uio: Constify struct kobj_type 2024-09-11 16:02:54 +02:00
usb move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
vdpa virtio: features, fixes, cleanups 2024-09-26 08:43:17 -07:00
vfio [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
vhost move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
video move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
virt [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
virtio virtio: features, fixes, cleanups 2024-09-26 08:43:17 -07:00
w1 w1: ds2482: Drop explicit initialization of struct i2c_device_id::driver_data to 0 2024-09-06 19:18:32 +02:00
watchdog move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
xen xen: Fix config option reference in XEN_PRIVCMD definition 2024-10-02 16:14:30 +02:00
zorro
Kconfig
Makefile leds: Init leds class earlier 2024-09-04 17:24:58 -05:00