There are a lot of different structures that need to have a "frozen" abi
for the next 5+ years. Add padding to a lot of them in order to be able
to handle any future changes that might be needed due to LTS and
security fixes that might come up.
It's a best guess, based on what has happened in the past from the
5.4.0..5.4.129 release (1 1/2 years). Yes, past changes do not mean
that future changes will also be needed in the same area, but that is a
hint that those areas are both well maintained and looked after, and
there have been previous problems found in them.
Also the list of structures that are being required based on OEM usage
in the android/ symbol lists were consulted as that's a larger list than
what has been changed in the past.
Hopefully we caught everything we need to worry about, only time will
tell...
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I880bbcda0628a7459988eeb49d18655522697664
This reverts commit 1aff922933 as we are
free to update the ABI at this point in time.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iba293953586fd78503ef5f47db40ccdc03b098f5
This reverts commit 0224432a8f as it
breaks the abi and we can't allow that at this point in time.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I21e6c33ac7011d8489e766fd9b5e32bd528dc456
Changes in 5.10.30
xfrm/compat: Cleanup WARN()s that can be user-triggered
ALSA: aloop: Fix initialization of controls
ALSA: hda/realtek: Fix speaker amp setup on Acer Aspire E1
ALSA: hda/conexant: Apply quirk for another HP ZBook G5 model
ASoC: intel: atom: Stop advertising non working S24LE support
nfc: fix refcount leak in llcp_sock_bind()
nfc: fix refcount leak in llcp_sock_connect()
nfc: fix memory leak in llcp_sock_connect()
nfc: Avoid endless loops caused by repeated llcp_sock_connect()
selinux: make nslot handling in avtab more robust
selinux: fix cond_list corruption when changing booleans
selinux: fix race between old and new sidtab
xen/evtchn: Change irq_info lock to raw_spinlock_t
net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
net: dsa: lantiq_gswip: Let GSWIP automatically set the xMII clock
net: dsa: lantiq_gswip: Don't use PHY auto polling
net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
drm/i915: Fix invalid access to ACPI _DSM objects
ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
IB/hfi1: Fix probe time panic when AIP is enabled with a buggy BIOS
LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late
gcov: re-fix clang-11+ support
ia64: fix user_stack_pointer() for ptrace()
nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
ocfs2: fix deadlock between setattr and dio_end_io_write
fs: direct-io: fix missing sdio->boundary
ethtool: fix incorrect datatype in set_eee ops
of: property: fw_devlink: do not link ".*,nr-gpios"
parisc: parisc-agp requires SBA IOMMU driver
parisc: avoid a warning on u8 cast for cmpxchg on u8 pointers
ARM: dts: turris-omnia: configure LED[2]/INTn pin as interrupt pin
batman-adv: initialize "struct batadv_tvlv_tt_vlan_data"->reserved field
ice: Continue probe on link/PHY errors
ice: Increase control queue timeout
ice: prevent ice_open and ice_stop during reset
ice: fix memory allocation call
ice: remove DCBNL_DEVRESET bit from PF state
ice: Fix for dereference of NULL pointer
ice: Use port number instead of PF ID for WoL
ice: Cleanup fltr list in case of allocation issues
iwlwifi: pcie: properly set LTR workarounds on 22000 devices
ice: fix memory leak of aRFS after resuming from suspend
net: hso: fix null-ptr-deref during tty device unregistration
libbpf: Fix bail out from 'ringbuf_process_ring()' on error
bpf: Enforce that struct_ops programs be GPL-only
bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GET
ethernet/netronome/nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
libbpf: Ensure umem pointer is non-NULL before dereferencing
libbpf: Restore umem state after socket create failure
libbpf: Only create rx and tx XDP rings when necessary
bpf: Refcount task stack in bpf_get_task_stack
bpf, sockmap: Fix sk->prot unhash op reset
bpf, sockmap: Fix incorrect fwd_alloc accounting
net: ensure mac header is set in virtio_net_hdr_to_skb()
i40e: Fix sparse warning: missing error code 'err'
i40e: Fix sparse error: 'vsi->netdev' could be null
i40e: Fix sparse error: uninitialized symbol 'ring'
i40e: Fix sparse errors in i40e_txrx.c
vdpa/mlx5: Fix suspend/resume index restoration
net: sched: sch_teql: fix null-pointer dereference
net: sched: fix action overwrite reference counting
nl80211: fix beacon head validation
nl80211: fix potential leak of ACL params
cfg80211: check S1G beacon compat element length
mac80211: fix time-is-after bug in mlme
mac80211: fix TXQ AC confusion
net: hsr: Reset MAC header for Tx path
net-ipv6: bugfix - raw & sctp - switch to ipv6_can_nonlocal_bind()
net: let skb_orphan_partial wake-up waiters.
thunderbolt: Fix a leak in tb_retimer_add()
thunderbolt: Fix off by one in tb_port_find_retimer()
usbip: add sysfs_lock to synchronize sysfs code paths
usbip: stub-dev synchronize sysfs code paths
usbip: vudc synchronize sysfs code paths
usbip: synchronize event handler with sysfs code paths
driver core: Fix locking bug in deferred_probe_timeout_work_func()
scsi: pm80xx: Fix chip initialization failure
scsi: target: iscsi: Fix zero tag inside a trace event
percpu: make pcpu_nr_empty_pop_pages per chunk type
i2c: turn recovery error on init to debug
KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched
KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched
KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn
KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed
KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp
net: sched: fix err handler in tcf_action_init()
ice: Refactor DCB related variables out of the ice_port_info struct
ice: Recognize 860 as iSCSI port in CEE mode
xfrm: interface: fix ipv4 pmtu check to honor ip header df
xfrm: Use actual socket sk instead of skb socket for xfrm_output_resume
remoteproc: qcom: pil_info: avoid 64-bit division
regulator: bd9571mwv: Fix AVS and DVFS voltage range
ARM: OMAP4: Fix PMIC voltage domains for bionic
ARM: OMAP4: PM: update ROM return address for OSWR and OFF
net: xfrm: Localize sequence counter per network namespace
esp: delete NETIF_F_SCTP_CRC bit from features for esp offload
ASoC: SOF: Intel: HDA: fix core status verification
ASoC: wm8960: Fix wrong bclk and lrclk with pll enabled for some chips
xfrm: Fix NULL pointer dereference on policy lookup
virtchnl: Fix layout of RSS structures
i40e: Added Asym_Pause to supported link modes
i40e: Fix kernel oops when i40e driver removes VF's
hostfs: fix memory handling in follow_link()
amd-xgbe: Update DMA coherency values
vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply
geneve: do not modify the shared tunnel info when PMTU triggers an ICMP reply
sch_red: fix off-by-one checks in red_check_params()
drivers/net/wan/hdlc_fr: Fix a double free in pvc_xmit
arm64: dts: imx8mm/q: Fix pad control of SD1_DATA0
xfrm: Provide private skb extensions for segmented and hw offloaded ESP packets
can: bcm/raw: fix msg_namelen values depending on CAN_REQUIRED_SIZE
can: isotp: fix msg_namelen values depending on CAN_REQUIRED_SIZE
mlxsw: spectrum: Fix ECN marking in tunnel decapsulation
ethernet: myri10ge: Fix a use after free in myri10ge_sw_tso
gianfar: Handle error code at MAC address change
net: dsa: Fix type was not set for devlink port
cxgb4: avoid collecting SGE_QBASE regs during traffic
net:tipc: Fix a double free in tipc_sk_mcast_rcv
ARM: dts: imx6: pbab01: Set vmmc supply for both SD interfaces
net/ncsi: Avoid channel_monitor hrtimer deadlock
net: qrtr: Fix memory leak on qrtr_tx_wait failure
nfp: flower: ignore duplicate merge hints from FW
net: phy: broadcom: Only advertise EEE for supported modes
I2C: JZ4780: Fix bug for Ingenic X1000.
ASoC: sunxi: sun4i-codec: fill ASoC card owner
net/mlx5e: Fix mapping of ct_label zero
net/mlx5e: Fix ethtool indication of connector type
net/mlx5: Don't request more than supported EQs
net/rds: Fix a use after free in rds_message_map_pages
xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory model
soc/fsl: qbman: fix conflicting alignment attributes
i40e: Fix display statistics for veb_tc
RDMA/rtrs-clt: Close rtrs client conn before destroying rtrs clt session files
drm/msm: Set drvdata to NULL when msm_drm_init() fails
net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...);
mptcp: forbit mcast-related sockopt on MPTCP sockets
scsi: ufs: core: Fix task management request completion timeout
scsi: ufs: core: Fix wrong Task Tag used in task management request UPIUs
net: cls_api: Fix uninitialised struct field bo->unlocked_driver_cb
net: macb: restore cmp registers on resume path
clk: fix invalid usage of list cursor in register
clk: fix invalid usage of list cursor in unregister
workqueue: Move the position of debug_work_activate() in __queue_work()
s390/cpcmd: fix inline assembly register clobbering
perf inject: Fix repipe usage
net: openvswitch: conntrack: simplify the return expression of ovs_ct_limit_get_default_limit()
openvswitch: fix send of uninitialized stack memory in ct limit reply
i2c: designware: Adjust bus_freq_hz when refuse high speed mode set
iwlwifi: fix 11ax disabled bit in the regulatory capability flags
can: mcp251x: fix support for half duplex SPI host controllers
tipc: increment the tmp aead refcnt before attaching it
net: hns3: clear VF down state bit before request link status
net/mlx5: Fix placement of log_max_flow_counter
net/mlx5: Fix PPLM register mapping
net/mlx5: Fix PBMC register mapping
RDMA/cxgb4: check for ipv6 address properly while destroying listener
perf report: Fix wrong LBR block sorting
RDMA/qedr: Fix kernel panic when trying to access recv_cq
drm/vc4: crtc: Reduce PV fifo threshold on hvs4
i40e: Fix parameters in aq_get_phy_register()
RDMA/addr: Be strict with gid size
vdpa/mlx5: should exclude header length and fcs from mtu
vdpa/mlx5: Fix wrong use of bit numbers
RAS/CEC: Correct ce_add_elem()'s returned values
clk: socfpga: fix iomem pointer cast on 64-bit
lockdep: Address clang -Wformat warning printing for %hd
dt-bindings: net: ethernet-controller: fix typo in NVMEM
net: sched: bump refcount for new action in ACT replace mode
gpiolib: Read "gpio-line-names" from a firmware node
cfg80211: remove WARN_ON() in cfg80211_sme_connect
net: tun: set tun->dev->addr_len during TUNSETLINK processing
drivers: net: fix memory leak in atusb_probe
drivers: net: fix memory leak in peak_usb_create_dev
net: mac802154: Fix general protection fault
net: ieee802154: nl-mac: fix check on panid
net: ieee802154: fix nl802154 del llsec key
net: ieee802154: fix nl802154 del llsec dev
net: ieee802154: fix nl802154 add llsec key
net: ieee802154: fix nl802154 del llsec devkey
net: ieee802154: forbid monitor for set llsec params
net: ieee802154: forbid monitor for del llsec seclevel
net: ieee802154: stop dump llsec params for monitors
Revert "net: sched: bump refcount for new action in ACT replace mode"
Linux 5.10.30
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie8754a2e4dfef03bf1f2b878843cde19a4adab21
[ Upstream commit e88add19f6 ]
A sequence counter write section must be serialized or its internal
state can get corrupted. The "xfrm_state_hash_generation" seqcount is
global, but its write serialization lock (net->xfrm.xfrm_state_lock) is
instantiated per network namespace. The write protection is thus
insufficient.
To provide full protection, localize the sequence counter per network
namespace instead. This should be safe as both the seqcount read and
write sections access data exclusively within the network namespace. It
also lays the foundation for transforming "xfrm_state_hash_generation"
data type from seqcount_t to seqcount_LOCKNAME_t in further commits.
Fixes: b65e3d7be0 ("xfrm: state: add sequence count to detect hash resizes")
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
and associated inet_is_local_unbindable_port() helper function:
use it to make explicitly binding to an unbindable port return
-EPERM 'Operation not permitted'.
Autobind doesn't honour this new sysctl since:
(a) you can simply set both if that's the behaviour you desire
(b) there could be a use for preventing explicit while allowing auto
(c) it's faster in the relatively critical path of doing port selection
during connect() to only check one bitmap instead of both
Various ports may have special use cases which are not suitable for
use by general userspace applications. Currently, ports specified in
ip_local_reserved_ports sysctl will not be returned only in case of
automatic port assignment, but nothing prevents you from explicitly
binding to them - even from an entirely unprivileged process.
In certain cases it is desirable to prevent the host from assigning the
ports even in case of explicit binds, even from superuser processes.
Example use cases might be:
- a port being stolen by the nic for remote serial console, remote
power management or some other sort of debugging functionality
(crash collection, gdb, direct access to some other microcontroller
on the nic or motherboard, remote management of the nic itself).
- a transparent proxy where packets are being redirected: in case
a socket matches this connection, packets from this application
would be incorrectly sent to one of the endpoints.
Initially I wanted to solve this problem via the simple one line:
static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port) {
- return port < net->ipv4.sysctl_ip_prot_sock;
+ return port < net->ipv4.sysctl_ip_prot_sock || inet_is_local_reserved_port(net, port);
}
However, this doesn't work for two reasons:
(a) it changes userspace visible behaviour of the existing local
reserved ports sysctl, and there appears to be enough documentation
on the internet talking about setting it to make this a bad idea
(b) it doesn't prevent privileged apps from using these ports,
CAP_BIND_SERVICE is relatively likely to be available to, for example,
a recursive DNS server so it can listed on port 53, which also needs
to do src port randomization for outgoing queries due to security
reasons (and it thus does manual port binding).
If we *know* that certain ports are simply unusable, then it's better
nothing even gets the opportunity to try to use them. This way we at
least get a quick failure, instead of some sort of timeout (or possibly
even corruption of the data stream of the non-kernel based use case).
Test:
vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
3967
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
socket.error: (1, 'Operation not permitted')
vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
socket.error: (1, 'Operation not permitted')
Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linux SCTP <linux-sctp@vger.kernel.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Bug: 140404597
Change-Id: Ie96207bea90ae1345adf7b45724d0caf4d6e52c2
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Two minor conflicts:
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the only listener of the nexthop notification chain is the
VXLAN driver. Subsequent patches will add more listeners (e.g., device
drivers such as netdevsim) that need to be able to block when processing
notifications.
Therefore, convert the notification chain to a blocking one. This is
safe as notifications are always emitted from process context.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On x86_64, each notification results in one skbuff allocation which
consumes at least 768 bytes due to the skbuff overhead.
This patch coalesces several notifications into one single skbuff, so
each notification consumes at least ~211 bytes, that ~3.5 times less
memory consumption. As a result, this is reducing the chances to exhaust
the netlink socket receive buffer.
Rule of thumb is that each notification batch only contains netlink
messages whose report flag is the same, nfnetlink_send() requires this
to do appropriate delivery to userspace, either via unicast (echo
mode) or multicast (monitor mode).
The skbuff control buffer is used to annotate the report flag for later
handling at the new coalescing routine.
The batch skbuff notification size is NLMSG_GOODSIZE, using a larger
skbuff would allow for more socket receiver buffer savings (to amortize
the cost of the skbuff even more), however, going over that size might
break userspace applications, so let's be conservative and stick to
NLMSG_GOODSIZE.
Reported-by: Phil Sutter <phil@nwl.cc>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
To support multi-prog link-based attachments for new netns attach types, we
need to keep track of more than one bpf_link per attach type. Hence,
convert net->bpf.links into a list, that currently can be either empty or
have just one item.
Instead of reusing bpf_prog_list from bpf-cgroup, we link together
bpf_netns_link's themselves. This makes list management simpler as we don't
have to allocate, initialize, and later release list elements. We can do
this because multi-prog attachment will be available only for bpf_link, and
we don't need to build a list of programs attached directly and indirectly
via links.
No functional changes intended.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
Prepare for having multi-prog attachments for new netns attach types by
storing programs to run in a bpf_prog_array, which is well suited for
iterating over programs and running them in sequence.
After this change bpf(PROG_QUERY) may block to allocate memory in
bpf_prog_array_copy_to_user() for collected program IDs. This forces a
change in how we protect access to the attached program in the query
callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from
an RCU read lock to holding a mutex that serializes updaters.
Because we allow only one BPF flow_dissector program to be attached to
netns at all times, the bpf_prog_array pointed by net->bpf.run_array is
always either detached (null) or one element long.
No functional changes intended.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
Extend bpf() syscall subcommands that operate on bpf_link, that is
LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to
network namespaces (only flow dissector at the moment).
Link-based and prog-based attachment can be used interchangeably, but only
one can exist at a time. Attempts to attach a link when a prog is already
attached directly, and the other way around, will be met with -EEXIST.
Attempts to detach a program when link exists result in -EINVAL.
Attachment of multiple links of same attach type to one netns is not
supported with the intention to lift the restriction when a use-case
presents itself. Because of that link create returns -E2BIG when trying to
create another netns link, when one already exists.
Link-based attachments to netns don't keep a netns alive by holding a ref
to it. Instead links get auto-detached from netns when the latter is being
destroyed, using a pernet pre_exit callback.
When auto-detached, link lives in defunct state as long there are open FDs
for it. -ENOLINK is returned if a user tries to update a defunct link.
Because bpf_link to netns doesn't hold a ref to struct net, special care is
taken when releasing, updating, or filling link info. The netns might be
getting torn down when any of these link operations are in progress. That
is why auto-detach and update/release/fill_info are synchronized by the
same mutex. Also, link ops have to always check if auto-detach has not
happened yet and if netns is still alive (refcnt > 0).
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-5-jakub@cloudflare.com
In order to:
(1) attach more than one BPF program type to netns, or
(2) support attaching BPF programs to netns with bpf_link, or
(3) support multi-prog attach points for netns
we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.
Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.
Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.
This is similar to how it is organized for cgroup with cgroup_bpf.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
This patch adds nexthop add/del notifiers. To be used by
vxlan driver in a later patch. Could possibly be used by
switchdev drivers in the future.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysctl to control hrtimer slack, default of 100 usec.
This gives the opportunity to reduce system overhead,
and help very short RTT flows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current route nexthop API maintains user space compatibility
with old route API by default. Dumps and netlink notifications
support both new and old API format. In systems which have
moved to the new API, this compatibility mode cancels some
of the performance benefits provided by the new nexthop API.
This patch adds new sysctl nexthop_compat_mode which is on
by default but provides the ability to turn off compatibility
mode allowing systems to run entirely with the new routing
API. Old route API behaviour and support is not modified by this
sysctl.
Uses a single sysctl to cover both ipv4 and ipv6 following
other sysctls. Covers dumps and delete notifications as
suggested by David Ahern.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
or "nstat" will show them automatically.
The MPTCP MIB counters are allocated in a distinct pcpu area in order to
avoid bloating/wasting TCP pcpu memory.
Counters are allocated once the first MPTCP socket is created in a
network namespace and free'd on exit.
If no sockets have been allocated, all-zero mptcp counters are shown.
The MIB counter list is taken from the multipath-tcp.org kernel, but
only a few counters have been picked up so far. The counter list can
be increased at any time later on.
v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit aacd9289af ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).
In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.
On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.
This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.
Fixes: ec7470b834 ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
that disables TCP ssthresh metrics cache by default. Other parts of TCP
metrics cache, e.g. rtt, cwnd, remain unchanged.
As modern networks becoming more and more dynamic, TCP metrics cache
today often causes more harm than benefits. For example, the same IP
address is often shared by different subscribers behind NAT in residential
networks. Even if the IP address is not shared by different users,
caching the slow-start threshold of a previous short flow using loss-based
congestion control (e.g. cubic) often causes the future longer flows of
the same network path to exit slow-start prematurely with abysmal
throughput.
Caching ssthresh is very risky and can lead to terrible performance.
Therefore it makes sense to make disabling ssthresh caching by
default and opt-in for specific networks by the administrators.
This practice also has worked well for several years of deployment with
CUBIC congestion control at Google.
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a per namespace counter, increment it on successful creation
of any route using the source address, decrement it on deletion
of such routes.
This allows us to check easily if the routing decision in the
current namespace depends on the packet source. Will be used
by the next patch.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:
Primary.Switchover.Max.Retrans (PSMR)
The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".
This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.
The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.
Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.
v3->v4:
- add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
sysctl 'ps_retrans'.
- add a new entry for ps_retrans on ip-sysctl.txt.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As said in rfc7829, section 3, point 12:
The SCTP stack SHOULD expose the PF state of its destination
addresses to the ULP as well as provide the means to notify the
ULP of state transitions of its destination addresses from
active to PF, and vice versa. However, it is recommended that
an SCTP stack implementing SCTP-PF also allows for the ULP to be
kept ignorant of the PF state of its destinations and the
associated state transitions, thus allowing for retention of the
simpler state transition model of [RFC4960] in the ULP.
Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.
So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.
It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.
Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.
v1->v2:
- Fix a build warning noticed by Nathan Chancellor.
v2->v3:
- set pf_expose to UNUSED by default to keep compatible with old
applications.
v3->v4:
- add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
- change this patch to 1/5, and move sctp_assoc_control_transport
change into 2/5, as Marcelo suggested.
- use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a skeleton structure for adding TLS statistics.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch improves the code reability by removing the redundant "can_"
prefix from the members of struct netns_can (as the struct netns_can itself
is the member "can" of the struct net.)
The conversion is done with:
sed -i \
-e "s/struct can_dev_rcv_lists \*can_rx_alldev_list;/struct can_dev_rcv_lists *rx_alldev_list;/" \
-e "s/spinlock_t can_rcvlists_lock;/spinlock_t rcvlists_lock;/" \
-e "s/struct timer_list can_stattimer;/struct timer_list stattimer; /" \
-e "s/can\.can_rx_alldev_list/can.rx_alldev_list/g" \
-e "s/can\.can_rcvlists_lock/can.rcvlists_lock/g" \
-e "s/can\.can_stattimer/can.stattimer/g" \
include/net/netns/can.h \
net/can/*.[ch]
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch gives the members of the struct netns_can that are holding
the statistics a sensible name, by renaming struct netns_can::can_stats
into struct netns_can::pkg_stats and struct netns_can::can_pstats into
struct netns_can::rcv_lists_stats.
The conversion is done with:
sed -i \
-e "s:\(struct[^*]*\*\)can_stats;.*:\1pkg_stats;:" \
-e "s:\(struct[^*]*\*\)can_pstats;.*:\1rcv_lists_stats;:" \
-e "s/can\.can_stats/can.pkg_stats/g" \
-e "s/can\.can_pstats/can.rcv_lists_stats/g" \
net/can/*.[ch] \
include/net/netns/can.h
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch renames both "struct s_stats" and "struct s_pstats", to
"struct can_pkg_stats" and "struct can_rcv_lists_stats" to better
reflect their meaning and improve code readability.
The conversion is done with:
sed -i \
-e "s/struct s_stats/struct can_pkg_stats/g" \
-e "s/struct s_pstats/struct can_rcv_lists_stats/g" \
net/can/*.[ch] \
include/net/netns/can.h
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch is to add ecn flag for both netns_sctp and sctp_endpoint,
net->sctp.ecn_enable is set 1 by default, and ep->ecn_enable will
be initialized with net->sctp.ecn_enable.
asoc->peer.ecn_capable will be set during negotiation only when
ep->ecn_enable is set on both sides.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.
Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.
The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.
This forces the stack to send packets with a very high network/cpu
overhead.
Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.
In some cases, it can be useful to increase the minimal value
to a saner value.
We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.
Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f1 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.
We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
CVE-2019-11479 -- tcp mss hardcoded to 48
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Barebones start point for nexthops. Implementation for RTM commands,
notifications, management of rbtree for holding nexthops by id, and
kernel side data structures for nexthops and nexthop config.
Nexthops are maintained in an rbtree sorted by id. Similar to routes,
nexthops are configured per namespace using netns_nexthop struct added
to struct net.
Nexthop notifications are sent when a nexthop is added or deleted,
but NOT if the delete is due to a device event or network namespace
teardown (which also involves device events). Applications are
expected to use the device down event to flush nexthops and any
routes used by the nexthops.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will add rcu grace period before fqdir
rhashtable destruction, so we need to dynamically allocate
fqdir structures to not force expensive synchronize_rcu() calls
in netns dismantle path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the @frags fields from structs netns_ipv4, netns_ipv6,
netns_nf_frag and netns_ieee802154_lowpan to @fqdir
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) struct netns_frags is renamed to struct fqdir
This structure is really holding many frag queues in a hash table.
2) (struct inet_frag_queue)->net field is renamed to fqdir
since net is generally associated to a 'struct net' pointer
in networking stack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use the zero and one to limit the boolean options setting.
After this patch we only set 0 or 1 to boolean options for nf
conntrack sysctl.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.
There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().
Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.
v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
isn't defined, expand the description in ip-sysctl.txt and remove
unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
pointer to it is needed because of the way proc_do_large_bitmap work.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor comment merge conflict in mlx5.
Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)
I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.
Also provide entropy regardless of CONFIG_NET_NS.
Fixes: 0b4419162a ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.
Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.
It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to icmp_echo_ignore_multicast, there is a need to also
prevent responding to pings to anycast addresses for security.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings.
IPv6 needs a similar mechanism.
v1->v2:
- Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use percpu allocation for the ipv6.icmp_sk.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-01-29
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Teach verifier dead code removal, this also allows for optimizing /
removing conditional branches around dead code and to shrink the
resulting image. Code store constrained architectures like nfp would
have hard time doing this at JIT level, from Jakub.
2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
code generation for 32-bit sub-registers. Evaluation shows that this
can result in code reduction of ~5-20% compared to 64 bit-only code
generation. Also add implementation for most JITs, from Jiong.
3) Add support for __int128 types in BTF which is also needed for
vmlinux's BTF conversion to work, from Yonghong.
4) Add a new command to bpftool in order to dump a list of BPF-related
parameters from the system or for a specific network device e.g. in
terms of available prog/map types or helper functions, from Quentin.
5) Add AF_XDP sock_diag interface for querying sockets from user
space which provides information about the RX/TX/fill/completion
rings, umem, memory usage etc, from Björn.
6) Add skb context access for skb_shared_info->gso_segs field, from Eric.
7) Add support for testing flow dissector BPF programs by extending
existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.
8) Split BPF kselftest's test_verifier into various subgroups of tests
in order better deal with merge conflicts in this area, from Jakub.
9) Add support for queue/stack manipulations in bpftool, from Stanislav.
10) Document BTF, from Yonghong.
11) Dump supported ELF section names in libbpf on program load
failure, from Taeung.
12) Silence a false positive compiler warning in verifier's BTF
handling, from Peter.
13) Fix help string in bpftool's feature probing, from Prashant.
14) Remove duplicate includes in BPF kselftests, from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Track each AF_XDP socket in a per-netns list. This will be used later
by the sock_diag interface for querying sockets from userspace.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Those were needed we still had modular trackers.
As we don't have those anymore, prefer direct calls and remove all
the (un)register infrastructure associated with this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>