mirror of
https://github.com/torvalds/linux.git
synced 2026-05-14 17:32:42 +02:00
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.
Example:
struct hash_elem {
int cnt;
struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
bpf_spin_lock(&val->lock);
val->cnt++;
bpf_spin_unlock(&val->lock);
}
Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
It drastically simplifies implementation yet allows bpf program to use
any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
insufficient preemption checks
Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
Other solutions to avoid nested bpf_spin_lock are possible.
Like making sure that all networking progs run with softirq disabled.
spin_lock_irqsave is the simplest and doesn't add overhead to the
programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
extra flag during map_create, but specifying it via BTF is cleaner.
It provides introspection for map key/value and reduces user mistakes.
Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
to request kernel to grab bpf_spin_lock before rewriting the value.
That will serialize access to map elements.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
||
|---|---|---|
| .. | ||
| 6lowpan | ||
| 9p | ||
| 802 | ||
| 8021q | ||
| appletalk | ||
| atm | ||
| ax25 | ||
| batman-adv | ||
| bluetooth | ||
| bpf | ||
| bpfilter | ||
| bridge | ||
| caif | ||
| can | ||
| ceph | ||
| core | ||
| dcb | ||
| dccp | ||
| decnet | ||
| dns_resolver | ||
| dsa | ||
| ethernet | ||
| hsr | ||
| ieee802154 | ||
| ife | ||
| ipv4 | ||
| ipv6 | ||
| iucv | ||
| kcm | ||
| key | ||
| l2tp | ||
| l3mdev | ||
| lapb | ||
| llc | ||
| mac80211 | ||
| mac802154 | ||
| mpls | ||
| ncsi | ||
| netfilter | ||
| netlabel | ||
| netlink | ||
| netrom | ||
| nfc | ||
| nsh | ||
| openvswitch | ||
| packet | ||
| phonet | ||
| psample | ||
| qrtr | ||
| rds | ||
| rfkill | ||
| rose | ||
| rxrpc | ||
| sched | ||
| sctp | ||
| smc | ||
| strparser | ||
| sunrpc | ||
| switchdev | ||
| tipc | ||
| tls | ||
| unix | ||
| vmw_vsock | ||
| wimax | ||
| wireless | ||
| x25 | ||
| xdp | ||
| xfrm | ||
| compat.c | ||
| Kconfig | ||
| Makefile | ||
| socket.c | ||
| sysctl_net.c | ||