linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Kuniyuki Iwashima	019c892e46	ipmr: Call ipmr_fib_lookup() under RCU. Yi Lai reported RCU splat in reg_vif_xmit() below. [0] When CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup() uses rcu_dereference() without explicit rcu_read_lock(). Although rcu_read_lock_bh() is already held by the caller __dev_queue_xmit(), lockdep requires explicit rcu_read_lock() for rcu_dereference(). Let's move up rcu_read_lock() in reg_vif_xmit() to cover ipmr_fib_lookup(). [0]: WARNING: suspicious RCU usage 7.1.0-rc2-next-20260504-9d0d467c3572 #1 Not tainted ----------------------------- net/ipv4/ipmr.c:329 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by syz.2.17/1779: #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: rcu_read_lock_bh include/linux/rcupdate.h:891 [inline] #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x239/0x4140 net/core/dev.c:4792 #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: spin_lock include/linux/spinlock.h:342 [inline] #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __netif_tx_lock include/linux/netdevice.h:4795 [inline] #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __dev_queue_xmit+0x1d5d/0x4140 net/core/dev.c:4865 stack backtrace: CPU: 1 UID: 0 PID: 1779 Comm: syz.2.17 Not tainted 7.1.0-rc2-next-20260504-9d0d467c3572 #1 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x121/0x150 lib/dump_stack.c:120 dump_stack+0x19/0x20 lib/dump_stack.c:129 lockdep_rcu_suspicious+0x15b/0x1f0 kernel/locking/lockdep.c:6878 ipmr_fib_lookup net/ipv4/ipmr.c:329 [inline] reg_vif_xmit+0x2ee/0x3c0 net/ipv4/ipmr.c:540 __netdev_start_xmit include/linux/netdevice.h:5382 [inline] netdev_start_xmit include/linux/netdevice.h:5391 [inline] xmit_one net/core/dev.c:3889 [inline] dev_hard_start_xmit+0x170/0x700 net/core/dev.c:3905 __dev_queue_xmit+0x1df1/0x4140 net/core/dev.c:4871 dev_queue_xmit include/linux/netdevice.h:3423 [inline] packet_xmit+0x252/0x370 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3082 [inline] packet_sendmsg+0x39ad/0x5650 net/packet/af_packet.c:3114 sock_sendmsg_nosec net/socket.c:797 [inline] __sock_sendmsg net/socket.c:812 [inline] ____sys_sendmsg+0xa21/0xba0 net/socket.c:2716 ___sys_sendmsg+0x121/0x1c0 net/socket.c:2770 __sys_sendmsg+0x177/0x220 net/socket.c:2802 __do_sys_sendmsg net/socket.c:2807 [inline] __se_sys_sendmsg net/socket.c:2805 [inline] __x64_sys_sendmsg+0x80/0xc0 net/socket.c:2805 x64_sys_call+0x1d9c/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc1/0x1020 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f37e563ee5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe5caa7fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000005c5fa0 RCX: 00007f37e563ee5d RDX: 0000000000000000 RSI: 00002000000012c0 RDI: 0000000000000004 RBP: 00000000005c5fa0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000005c5fac R15: 00000000005c5fa0 </TASK> Fixes: `b3b6babf47` ("ipmr: Free mr_table after RCU grace period.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yi Lai <yi1.lai@intel.com> Closes: https://lore.kernel.org/netdev/afrY34dLXNUboevf@ly-workstation/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260506065955.1695753-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-07 08:38:37 -07:00
Eric Dumazet	4f34002e2e	ipmr: prevent info-leak in pmr_cache_report() Yiming Qian reported: <quote> ipmr_cache_report()` allocates a report skb with `alloc_skb(128, GFP_ATOMIC)` and appends a `struct igmphdr` using `skb_put()`. In the non-`IGMPMSG_WHOLEPKT` path it initializes only: - `igmp->type` - `igmp->code` but does not initialize: - `igmp->csum` - `igmp->group` Later, `igmpmsg_netlink_event()` copies the bytes after `sizeof(struct igmpmsg)` into the `IPMRA_CREPORT_PKT` netlink attribute and emits `RTM_NEWCACHEREPORT` on `RTNLGRP_IPV4_MROUTE_R`. As a result, 6 bytes of stale heap data from the skb head are disclosed to userspace. </quote> Let's use skb_put_zero() instead of skb_put() to fix this bug. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260430070611.4004529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:01:40 -07:00
Kuniyuki Iwashima	b3b6babf47	ipmr: Free mr_table after RCU grace period. With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup() does not check if net->ipv4.mrt is NULL. Since default_device_exit_batch() is called after ->exit_rtnl(), a device could receive IGMP packets and access net->ipv4.mrt during/after ipmr_rules_exit_rtnl(). If ipmr_rules_exit_rtnl() had already cleared it and freed the memory, the access would trigger null-ptr-deref or use-after-free. Let's fix it by using RCU helper and free mrt after RCU grace period. In addition, check_net(net) is added to mroute_clean_tables() and ipmr_cache_unresolved() to synchronise via mfc_unres_lock. This prevents ipmr_cache_unresolved() from putting skb into c->_c.mfc_un.unres.unresolved after mroute_clean_tables() purges it. For the same reason, timer_shutdown_sync() is moved after mroute_clean_tables(). Since rhltable_destroy() holds mutex internally, rcu_work is used, and it is placed as the first member because rcu_head must be placed within <4K offset. mr_table is alraedy 3864 bytes without rcu_work. Note that IP6MR is not yet converted to ->exit_rtnl(), so this change is not needed for now but will be. Fixes: `b22b018674` ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 18:46:17 -07:00
Kuniyuki Iwashima	bddafc06ca	ipmr: Don't hold RTNL for ipmr_rtm_route(). ipmr_mfc_add() and ipmr_mfc_delete() are already protected by a dedicated mutex. rtm_to_ipmr_mfcc() calls __ipmr_get_table(), __dev_get_by_index(), amd ipmr_find_vif(). Once __dev_get_by_index() is converted to dev_get_by_index_rcu(), we can move the other two functions under that same RCU section and drop RTNL for ipmr_rtm_route(). Let's do that conversion and drop ASSERT_RTNL() in mr_call_mfc_notifiers(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	3c1e53e554	ipmr: Add dedicated mutex for mrt->{mfc_hash,mfc_cache_list}. We will no longer hold RTNL for ipmr_rtm_route() to modify the MFC hash table. Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL dependant, otherwise, we just need protection for mrt->mfc_hash and mrt->mfc_cache_list. Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(), and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	4480d5fa1f	ipmr/ip6mr: Convert net->ipv[46].ipmr_seq to atomic_t. We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete(). MFC entry can be loosely connected with VIF by its index for mrt->vif_table[] (stored in mfc_parent), but the two tables are not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1 is not automatically removed. The only field that the MFC/VIF interfaces share is net->ipv[46].ipmr_seq, which is protected by RTNL. Adding a new mutex for both just to protect a single field is overkill. Let's convert the field to atomic_t. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	478c2add78	ipmr: Call fib_rules_unregister() without RTNL. fib_rules_unregister() removes ops from net->rules_ops under spinlock, calls ops->delete() for each rule, and frees the ops. ipmr_rules_ops_template does not have ->delete(), and any operation does not require RTNL there. Let's move fib_rules_unregister() from ipmr_rules_exit_rtnl() to ipmr_net_exit(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	4a11adcd9e	ipmr: Remove RTNL in ipmr_rules_init() and ipmr_net_init(). When ipmr_free_table() is called from ipmr_rules_init() or ipmr_net_init(), the netns is not yet published. Thus, no device should have been registered, and mroute_clean_tables() will not call vif_delete(), so unregister_netdevice_many() is unnecessary. unregister_netdevice_many() does nothing if the list is empty, but it requires RTNL due to the unconditional ASSERT_RTNL() at the entry of unregister_netdevice_many_notify(). Let's remove unnecessary RTNL and ASSERT_RTNL() and instead add WARN_ON_ONCE() in ipmr_free_table(). Note that we use a local list for the new WARN_ON_ONCE() because dev_kill_list passed from ipmr_rules_exit_rtnl() may have some devices when other ops->init() fails after ipmr durnig setup_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	b22b018674	ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl(). ipmr_net_ops uses ->exit_batch() to acquire RTNL only once for dying network namespaces. ipmr does not depend on the ordering of ->exit_rtnl() and ->exit_batch() of other pernet_operations (unlike fib_net_ops). Once ipmr_free_table() is called and all devices are queued for destruction in ->exit_rtnl(), later during NETDEV_UNREGISTER, ipmr_device_event() will not see anything in vif table and just do nothing. Let's convert ipmr_net_exit_batch() to ->exit_rtnl(). Note that fib_rules_unregister() does not need RTNL and we will remove RTNL and unregister_netdevice_many() in ipmr_net_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	b7fdc3cfb6	ipmr: Move unregister_netdevice_many() out of ipmr_free_table(). This is a prep commit to convert ipmr_net_exit_batch() to ->exit_rtnl(). Let's move unregister_netdevice_many() in ipmr_free_table() to its callers. Now ipmr_rules_exit() can do batching all tables per netns. Note that later we will remove RTNL and unregister_netdevice_many() in ipmr_rules_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	3810f9529d	ipmr: Move unregister_netdevice_many() out of mroute_clean_tables(). This is a prep commit to convert ipmr_net_exit_batch() to ->exit_rtnl(). Let's move unregister_netdevice_many() in mroute_clean_tables() to its callers. As a bonus, mrtsock_destruct() can do batching for all tables. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	2c698bab29	ipmr: Convert ipmr_rtm_dumproute() to RCU. ipmr_rtm_dumproute() calls mr_table_dump() or mr_rtm_dumproute(), and mr_rtm_dumproute() finally calls mr_table_dump(). mr_table_dump() calls the passed function, _ipmr_fill_mroute(). _ipmr_fill_mroute() is a wrapper of ipmr_fill_mroute() to cast struct mr_mfc * to struct mfc_cache *. ipmr_fill_mroute() can be already called safely under RCU. Let's convert ipmr_rtm_dumproute() to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	295a17b3ea	ipmr: Convert ipmr_rtm_getroute() to RCU. ipmr_rtm_getroute() calls __ipmr_get_table(), ipmr_cache_find(), and ipmr_fill_mroute(). The table is not removed until netns dismantle, and net->ipv4.mr_tables is managed with RCU list API, so __ipmr_get_table() is safe under RCU. struct mfc_cache is freed by mr_cache_put() after RCU grace period, so we can use ipmr_cache_find() under RCU. rcu_read_lock() around it was just to avoid lockdep splat for rhl_for_each_entry_rcu(). ipmr_fill_mroute() calls mr_fill_mroute(), which properly uses RCU. Let's drop RTNL for ipmr_rtm_getroute() and use RCU instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:40 -08:00
Kuniyuki Iwashima	2bd6c9d600	ipmr: Use MAXVIFS in mroute_msgsize(). mroute_msgsize() calculates skb size needed for ipmr_fill_mroute(). The size differs based on mrt->maxvif. We will drop RTNL for ipmr_rtm_getroute() and mrt->maxvif may change under RCU. To avoid -EMSGSIZE, let's calculate the size with the maximum value of mrt->maxvif, MAXVIFS. struct rtnexthop is 8 bytes and MAXVIFS is 32, so the maximum delta is 256 bytes, which is small enough. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:39 -08:00
Kuniyuki Iwashima	402a8111d7	ipmr: Convert ipmr_rtm_dumplink() to RCU. net->ipv4.mr_tables is updated under RTNL and can be read safely under RCU. Once created, the multicast route tables are not removed until netns dismantle. ipmr_rtm_dumplink() does not need RTNL protection for ipmr_for_each_table() and ipmr_fill_table() if RCU is held. Even if mrt->maxvif changes concurrently, ipmr_fill_vif() returns true to continue dumping the next table. Let's convert it to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:39 -08:00
Kuniyuki Iwashima	261950e039	ipmr: Annotate access to mrt->mroute_do_{pim,assert,wrvifwhole}. These fields in struct mr_table are updated in ip_mroute_setsockopt() under RTNL: * mroute_do_pim * mroute_do_assert * mroute_do_wrvifwhole However, ip_mroute_getsockopt() does not hold RTNL and read the first two fields locklessly, and ip_mr_forward() reads all the three under RCU. pim_rcv_v1() also reads mroute_do_pim locklessly. Let's use WRITE_ONCE() and READ_ONCE() for them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:39 -08:00
Eric Dumazet	fe8570186f	ipv4: use dst4_mtu() instead of dst_mtu() When we expect an IPv4 dst, use dst4_mtu() instead of dst_mtu() to save some code space. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	6ad8de3cef	ipv4: start using dst_dev_rcu() Change icmpv4_xrlim_allow(), ip_defrag() to prevent possible UAF. Change ipmr_prepare_xmit(), ipmr_queue_fwd_xmit(), ip_mr_output(), ipv4_neigh_lookup() to use lockdep enabled dst_dev_rcu(). Fixes: `4a6ce2b6f2` ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 19:36:32 -07:00
Guillaume Nault	1bec9d0c00	ipv4: Convert ->flowi4_tos to dscp_t. Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:34:31 -07:00
Stanislav Fomichev	ffea116834	net: s/dev_get_port_parent_id/netif_get_port_parent_id/ Commit `cc34acd577` ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-18 17:27:46 -07:00
Eric Dumazet	beead7eea8	net: ipv4: guard ip_mr_output() with rcu syzbot found at least one path leads to an ip_mr_output() without RCU being held. Add guard(rcu)() to fix this in a concise way. WARNING: CPU: 0 PID: 0 at net/ipv4/ipmr.c:2302 ip_mr_output+0xbb1/0xe70 net/ipv4/ipmr.c:2302 Call Trace: <IRQ> igmp_send_report+0x89e/0xdb0 net/ipv4/igmp.c:799 igmp_timer_expire+0x204/0x510 net/ipv4/igmp.c:-1 call_timer_fn+0x17e/0x5f0 kernel/time/timer.c:1747 expire_timers kernel/time/timer.c:1798 [inline] __run_timers kernel/time/timer.c:2372 [inline] __run_timer_base+0x61a/0x860 kernel/time/timer.c:2384 run_timer_base kernel/time/timer.c:2393 [inline] run_timer_softirq+0xb7/0x180 kernel/time/timer.c:2403 handle_softirqs+0x286/0x870 kernel/softirq.c:579 __do_softirq kernel/softirq.c:613 [inline] invoke_softirq kernel/softirq.c:453 [inline] __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050 Fixes: `35bec72a24` ("net: ipv4: Add ip_mr_output()") Reported-by: syzbot+f02fb9e43bd85c6c66ae@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/685e841a.a00a0220.129264.0002.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petr Machata <petrm@nvidia.com> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Benjamin Poirier <bpoirier@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-06-28 10:44:21 +01:00
Petr Machata	35bec72a24	net: ipv4: Add ip_mr_output() Multicast routing is today handled in the input path. Locally generated MC packets don't hit the IPMR code today. Thus if a VXLAN remote address is multicast, the driver needs to set an OIF during route lookup. Thus MC routing configuration needs to be kept in sync with the VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC routing code instead. To that end, this patch adds support to route locally generated multicast packets. The newly-added routines do largely what ip_mr_input() and ip_mr_forward() do: make an MR cache lookup to find where to send the packets, and use ip_mc_output() to send each of them. When no cache entry is found, the packet is punted to the daemon for resolution. However, an installation that uses a VXLAN underlay netdevice for which it also has matching MC routes, would get a different routing with this patch. Previously, the MC packets would be delivered directly to the underlay port, whereas now they would be MC-routed. In order to avoid this change in behavior, introduce an IPCB flag. Only if the flag is set will ip_mr_output() actually engage, otherwise it reverts to ip_mc_output(). This code is based on work by Roopa Prabhu and Nikolay Aleksandrov. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-17 18:18:45 -07:00
Petr Machata	b2e653bcff	net: ipv4: ipmr: Split ipmr_queue_xmit() in two Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and should not take place on the output path. In order to allow reuse of the common parts, split the function into two: the ipmr_prepare_xmit() helper that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which invokes the former and encapsulates the whole forwarding algorithm. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/4e8db165572a4f8bd29a723a801e854e9d20df4d.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-17 18:18:45 -07:00
Petr Machata	3b7bc938e0	net: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev' The variable is used for caching of rt->dst.dev. The netdevice referenced therein does not change during the scope of validity of that local. At the same time, the local is only used twice, and each of these uses will end up in a different function in the following patches, further eliminating any use the local could have had. Drop the local altogether and inline the uses. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/c80600a4b51679fe78f429ccb6d60892c2f9e4de.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-17 18:18:45 -07:00
Ingo Molnar	41cb08555c	treewide, timers: Rename from_timer() to timer_container_of() Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com	2025-06-08 09:07:37 +02:00
Jakub Kicinski	33e1b1b399	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: `80f2ab46c2` ("irdma: free iwdev->rf after removing MSI-X") `4bcc063939` ("ice, irdma: fix an off by one in error handling code") `c24a65b6a2` ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-22 09:42:41 -07:00
Paolo Abeni	c46286fdd6	mr: consolidate the ipmr_can_free_table() checks. Guoyu Yin reported a splat in the ipmr netns cleanup path: WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline] WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Modules linked in: CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1 Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline] RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8 RSP: 0018:ffff888109547c58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868 RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005 RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9 R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001 R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058 FS: 00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0 Call Trace: <TASK> ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342 __do_sys_unshare kernel/fork.c:3413 [inline] __se_sys_unshare kernel/fork.c:3411 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f84f532cc29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400 RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328 </TASK> The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and the sanity check for such build is still too loose. Address the issue consolidating the relevant sanity check in a single helper regardless of the kernel configuration. Also share it between the ipv4 and ipv6 code. Reported-by: Guoyu Yin <y04609127@gmail.com> Fixes: `50b9420444` ("ipmr: tune the ipmr_can_free_table() checks.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-16 17:53:48 -07:00
Breno Leitao	04e00a849e	ipv4: Use nlmsg_payload in ipmr file Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-7-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-16 18:33:03 -07:00
Thomas Gleixner	8fa7292fee	treewide: Switch/rename to timer_delete[_sync]() timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-05 10:30:12 +02:00
Nicolas Dichtel	0c493da863	net: rename netns_local to netns_immutable The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-04 12:44:48 +01:00
Eric Dumazet	3440fa34ad	inet: ipmr: fix data-races Following fields of 'struct mr_mfc' can be updated concurrently (no lock protection) from ip_mr_forward() and ip6_mr_forward() - bytes - pkt - wrong_if - lastuse They also can be read from other functions. Convert bytes, pkt and wrong_if to atomic_long_t, and use READ_ONCE()/WRITE_ONCE() for lastuse. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-01-15 15:07:23 -08:00
Paolo Abeni	50b9420444	ipmr: tune the ipmr_can_free_table() checks. Eric reported a syzkaller-triggered splat caused by recent ipmr changes: WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419 ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Modules linked in: CPU: 2 UID: 0 PID: 6041 Comm: syz-executor183 Not tainted 6.12.0-syzkaller-10681-g65ae975e97d5 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Code: 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 58 49 83 bc 24 c0 0e 00 00 00 74 09 e8 44 ef a9 f7 90 <0f> 0b 90 e8 3b ef a9 f7 48 8d 7b 38 e8 12 a3 96 f7 48 89 df be 0f RSP: 0018:ffffc90004267bd8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88803c710000 RCX: ffffffff89e4d844 RDX: ffff88803c52c880 RSI: ffffffff89e4d87c RDI: ffff88803c578ec0 RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803c578000 R13: ffff88803c710000 R14: ffff88803c710008 R15: dead000000000100 FS: 00007f7a855ee6c0(0000) GS:ffff88806a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7a85689938 CR3: 000000003c492000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip6mr_rules_exit+0x176/0x2d0 net/ipv6/ip6mr.c:283 ip6mr_net_exit_batch+0x53/0xa0 net/ipv6/ip6mr.c:1388 ops_exit_list+0x128/0x180 net/core/net_namespace.c:177 setup_net+0x4fe/0x860 net/core/net_namespace.c:394 copy_net_ns+0x2b4/0x6b0 net/core/net_namespace.c:500 create_new_namespaces+0x3ea/0xad0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 ksys_unshare+0x45d/0xa40 kernel/fork.c:3334 __do_sys_unshare kernel/fork.c:3405 [inline] __se_sys_unshare kernel/fork.c:3403 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3403 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f7a856332d9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f7a855ee238 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f7a856bd308 RCX: 00007f7a856332d9 RDX: 00007f7a8560f8c6 RSI: 0000000000000000 RDI: 0000000062040200 RBP: 00007f7a856bd300 R08: 00007fff932160a7 R09: 00007f7a855ee6c0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a856bd30c R13: 0000000000000000 R14: 00007fff93215fc0 R15: 00007fff932160a8 </TASK> The root cause is a network namespace creation failing after successful initialization of the ipmr subsystem. Such a case is not currently matched by the ipmr_can_free_table() helper. New namespaces are zeroed on allocation and inserted into net ns list only after successful creation; when deleting an ipmr table, the list next pointer can be NULL only on netns initialization failure. Update the ipmr_can_free_table() checks leveraging such condition. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6e8cb445d4b43d006e0c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e8cb445d4b43d006e0c Fixes: `11b6e701bc` ("ipmr: add debug check for mr table cleanup") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/8bde975e21bbca9d9c27e36209b2dd4f1d7a3f00.1733212078.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-12-04 18:49:16 -08:00
Paolo Abeni	f6d7695b5a	ipmr: fix build with clang and DEBUG_NET disabled. Sasha reported a build issue in ipmr:: net/ipv4/ipmr.c:320:13: error: function 'ipmr_can_free_table' is not \ needed and will not be emitted \ [-Werror,-Wunneeded-internal-declaration] 320 \| static bool ipmr_can_free_table(struct net *net) Apparently clang is too smart with BUILD_BUG_ON_INVALID(), let's fallback to a plain WARN_ON_ONCE(). Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.11-25635-g6813e2326f1e/testrun/26111580/suite/build/test/clang-nightly-lkftconfig/details/ Fixes: `11b6e701bc` ("ipmr: add debug check for mr table cleanup") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/ee75faa926b2446b8302ee5fc30e129d2df73b90.1732810228.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-11-28 17:40:54 +01:00
Paolo Abeni	fc9c273d6d	ipmr: fix tables suspicious RCU usage Similar to the previous patch, plumb the RCU lock inside the ipmr_get_table(), provided a lockless variant and apply the latter in the few spots were the lock is already held. Fixes: `709b46e8d9` ("net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT") Fixes: `f0ad0860d0` ("ipv4: ipmr: support multiple tables") Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-11-28 10:23:23 +01:00
Paolo Abeni	11b6e701bc	ipmr: add debug check for mr table cleanup The multicast route tables lifecycle, for both ipv4 and ipv6, is protected by RCU using the RTNL lock for write access. In many places a table pointer escapes the RCU (or RTNL) protected critical section, but such scenarios are actually safe because tables are deleted only at namespace cleanup time or just after allocation, in case of default rule creation failure. Tables freed at namespace cleanup time are assured to be alive for the whole netns lifetime; tables freed just after creation time are never exposed to other possible users. Ensure that the free conditions are respected in ip{,6}mr_free_table, to document the locking schema and to prevent future possible introduction of 'table del' operation from breaking it. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-11-28 10:22:42 +01:00
Johannes Berg	a885a6b2d3	net: convert to nla_get__default() Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_."; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); \| -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); \| -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); \| -attr ? fn(attr) : def +dfn(attr, def) \| -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-11-11 10:32:06 -08:00
Guillaume Nault	6ab04392dd	ipv4: Prepare ipmr_rt_fib_lookup() to future .flowi4_tos conversion. Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/462402a097260357a7aba80228612305f230b6a9.1729530028.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-29 11:21:05 -07:00
Kuniyuki Iwashima	3ac84e31b3	ipmr: Use rtnl_register_many(). We will remove rtnl_register() and rtnl_register_module() in favour of rtnl_register_many(). When it succeeds for built-in callers, rtnl_register_many() guarantees all rtnetlink types in the passed array are supported, and there is no chance that a part of message types is not supported. Let's use rtnl_register_many() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241014201828.91221-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-15 18:52:26 -07:00
Eric Dumazet	2698acd6ea	net: do not acquire rtnl in fib_seq_sum() After we made sure no fib_seq_read() handlers needs RTNL anymore, we can remove RTNL from fib_seq_sum(). Note that after RTNL was dropped, fib_seq_sum() result was possibly outdated anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241009184405.3752829-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-11 15:35:05 -07:00
Eric Dumazet	055202b16c	ipmr: use READ_ONCE() to read net->ipv[46].ipmr_seq mr_call_vif_notifiers() and mr_call_mfc_notifiers() already uses WRITE_ONCE() on the write side. Using RTNL to protect the reads seems a big hammer. Constify 'struct net' argument of ip6mr_rules_seq_read() and ipmr_rules_seq_read(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241009184405.3752829-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-10-11 15:35:05 -07:00
Ido Schimmel	97edbbaad3	ipv4: ipmr: Unmask upper DSCP bits in ipmr_queue_xmit() Unmask the upper DSCP bits when calling ip_route_output_ports() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240903135327.2810535-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-09-04 16:57:11 -07:00
Alexander Lobakin	05c1280a2b	netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_local "Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-09-03 11:36:43 +02:00
Li Zetao	ebe39f95bc	ipmr: delete redundant judgment statements The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-08-23 14:27:45 +01:00
Ido Schimmel	2bc9778b66	ipv4: ipmr: Unmask upper DSCP bits in ipmr_rt_fib_lookup() Unmask the upper DSCP bits when calling ipmr_fib_lookup() so that in the future it could perform the FIB lookup according to the full DSCP value. Note that ipmr_fib_lookup() performs a FIB rule lookup (returning the relevant routing table) and that IPv4 multicast FIB rules do not support matching on TOS / DSCP. However, it is still worth unmasking the upper DSCP bits in case support for DSCP matching is ever added. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-08-22 16:59:57 -07:00
Alexander Lobakin	117aef12a7	ip_tunnel: use a separate struct to store tunnel params in the kernel Unlike IPv6 tunnels which use purely-kernel __ip6_tnl_parm structure to store params inside the kernel, IPv4 tunnel code uses the same ip_tunnel_parm which is being used to talk with the userspace. This makes it difficult to alter or add any fields or use a different format for whatever data. Define struct ip_tunnel_parm_kern, a 1:1 copy of ip_tunnel_parm for now, and use it throughout the code. Define the pieces, where the copy user <-> kernel happens, as standalone functions, and copy the data there field-by-field, so that the kernel-side structure could be easily modified later on and the users wouldn't have to care about this. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Gavrilov Ilia	5c3be3e0eb	ipmr: fix incorrect parameter validation in the ip_mroute_getsockopt() function The 'olr' variable can't be negative when assigned the result of 'min_t' because all 'min_t' parameters are cast to unsigned int, and then the minimum one is chosen. To fix the logic, check 'olr' as read from 'optlen', where the types of relevant variables are (signed) int. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-03-11 09:53:22 +00:00
Eric Dumazet	22e36ea9f5	inet: allow ip_valid_fib_dump_req() to be called with RTNL or RCU Add a new field into struct fib_dump_filter, to let callers tell if they use RTNL locking or RCU. This is used in the following patch, when inet_dump_fib() no longer holds RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-26 11:46:12 +00:00
Kunwu Chan	eec70af2b4	ipmr: Simplify the allocation of slab caches Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. And change cache name from 'ip_mrt_cache' to 'mfc_cache'. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-21 11:28:57 +00:00
Nicolas Dichtel	e622502c31	ipmr: fix kernel panic when forwarding mcast packets The stacktrace was: [ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092 [ 86.306815] #PF: supervisor read access in kernel mode [ 86.307717] #PF: error_code(0x0000) - not-present page [ 86.308624] PGD 0 P4D 0 [ 86.309091] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 86.309883] CPU: 2 PID: 3139 Comm: pimd Tainted: G U 6.8.0-6wind-knet #1 [ 86.311027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [ 86.312728] RIP: 0010:ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.313399] Code: f9 1f 0f 87 85 03 00 00 48 8d 04 5b 48 8d 04 83 49 8d 44 c5 00 48 8b 40 70 48 39 c2 0f 84 d9 00 00 00 49 8b 46 58 48 83 e0 fe <80> b8 92 00 00 00 00 0f 84 55 ff ff ff 49 83 47 38 01 45 85 e4 0f [ 86.316565] RSP: 0018:ffffad21c0583ae0 EFLAGS: 00010246 [ 86.317497] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 86.318596] RDX: ffff9559cb46c000 RSI: 0000000000000000 RDI: 0000000000000000 [ 86.319627] RBP: ffffad21c0583b30 R08: 0000000000000000 R09: 0000000000000000 [ 86.320650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 86.321672] R13: ffff9559c093a000 R14: ffff9559cc00b800 R15: ffff9559c09c1d80 [ 86.322873] FS: 00007f85db661980(0000) GS:ffff955a79d00000(0000) knlGS:0000000000000000 [ 86.324291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 86.325314] CR2: 0000000000000092 CR3: 000000002f13a000 CR4: 0000000000350ef0 [ 86.326589] Call Trace: [ 86.327036] <TASK> [ 86.327434] ? show_regs (/build/work/knet/arch/x86/kernel/dumpstack.c:479) [ 86.328049] ? __die (/build/work/knet/arch/x86/kernel/dumpstack.c:421 /build/work/knet/arch/x86/kernel/dumpstack.c:434) [ 86.328508] ? page_fault_oops (/build/work/knet/arch/x86/mm/fault.c:707) [ 86.329107] ? do_user_addr_fault (/build/work/knet/arch/x86/mm/fault.c:1264) [ 86.329756] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.330350] ? __irq_work_queue_local (/build/work/knet/kernel/irq_work.c:111 (discriminator 1)) [ 86.331013] ? exc_page_fault (/build/work/knet/./arch/x86/include/asm/paravirt.h:693 /build/work/knet/arch/x86/mm/fault.c:1515 /build/work/knet/arch/x86/mm/fault.c:1563) [ 86.331702] ? asm_exc_page_fault (/build/work/knet/./arch/x86/include/asm/idtentry.h:570) [ 86.332468] ? ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.333183] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.333920] ipmr_mfc_add (/build/work/knet/./include/linux/rcupdate.h:782 /build/work/knet/net/ipv4/ipmr.c:1009 /build/work/knet/net/ipv4/ipmr.c:1273) [ 86.334583] ? __pfx_ipmr_hash_cmp (/build/work/knet/net/ipv4/ipmr.c:363) [ 86.335357] ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.336135] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.336854] ? ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.337679] do_ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:944) [ 86.338408] ? __pfx_unix_stream_read_actor (/build/work/knet/net/unix/af_unix.c:2862) [ 86.339232] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.339809] ? aa_sk_perm (/build/work/knet/security/apparmor/include/cred.h:153 /build/work/knet/security/apparmor/net.c:181) [ 86.340342] ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:1415) [ 86.340859] raw_setsockopt (/build/work/knet/net/ipv4/raw.c:836) [ 86.341408] ? security_socket_setsockopt (/build/work/knet/security/security.c:4561 (discriminator 13)) [ 86.342116] sock_common_setsockopt (/build/work/knet/net/core/sock.c:3716) [ 86.342747] do_sock_setsockopt (/build/work/knet/net/socket.c:2313) [ 86.343363] __sys_setsockopt (/build/work/knet/./include/linux/file.h:32 /build/work/knet/net/socket.c:2336) [ 86.344020] __x64_sys_setsockopt (/build/work/knet/net/socket.c:2340) [ 86.344766] do_syscall_64 (/build/work/knet/arch/x86/entry/common.c:52 /build/work/knet/arch/x86/entry/common.c:83) [ 86.345433] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.346161] ? syscall_exit_work (/build/work/knet/./include/linux/audit.h:357 /build/work/knet/kernel/entry/common.c:160) [ 86.346938] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.347657] ? syscall_exit_to_user_mode (/build/work/knet/kernel/entry/common.c:215) [ 86.348538] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.349262] ? do_syscall_64 (/build/work/knet/./arch/x86/include/asm/cpufeature.h:171 /build/work/knet/arch/x86/entry/common.c:98) [ 86.349971] entry_SYSCALL_64_after_hwframe (/build/work/knet/arch/x86/entry/entry_64.S:129) The original packet in ipmr_cache_report() may be queued and then forwarded with ip_mr_forward(). This last function has the assumption that the skb dst is set. After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(), which causes the oops. Fixes: `bb7403655b` ("ipmr: support IP_PKTINFO on cache report IGMP msg") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240125141847.1931933-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-01-26 21:05:26 -08:00
Zhengchao Shao	b4c1d4d973	fib: remove unnecessary input parameters in fib_default_rule_add When fib_default_rule_add is invoked, the value of the input parameter 'flags' is always 0. Rules uses kzalloc to allocate memory, so 'flags' has been initialized to 0. Therefore, remove the input parameter 'flags' in fib_default_rule_add. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240102071519.3781384-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-01-03 16:42:48 -08:00

1 2 3 4 5 ...

429 Commits