linux/drivers
Daniel Borkmann fcc9c69a6e ipvlan, l3mdev: fix broken l3s mode wrt local routes
[ Upstream commit d5256083f6 ]

While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:06 +01:00
..
accessibility
acpi acpi/nfit: Fix command-supported detection 2019-01-31 08:14:37 +01:00
amba
android binder: fix race that allows malicious free of live buffer 2018-12-05 19:32:11 +01:00
ata libata: whitelist all SAMSUNG MZ7KM* solid-state disks 2018-12-21 14:15:20 +01:00
atm
auxdisplay auxdisplay: charlcd: fix x/y command parsing 2019-01-13 09:51:03 +01:00
base sysfs: Disable lockdep for driver bind/unbind files 2019-01-26 09:32:42 +01:00
bcma
block nbd: Use set_blocksize() to set device blocksize 2019-01-22 21:40:38 +01:00
bluetooth Bluetooth: btusb: Add support for Intel bluetooth device 8087:0029 2019-01-26 09:32:42 +01:00
bus Merge branch 'perm-fix' into omap-for-v4.19/fixes-v2 2018-08-28 09:58:03 -07:00
cdrom cdrom: fix improper type cast, which can leat to information leak. 2018-11-21 09:19:12 +01:00
char char/mwave: fix potential Spectre v1 vulnerability 2019-01-31 08:14:36 +01:00
clk clk: socfpga: stratix10: fix naming convention for the fixed-clocks 2019-01-31 08:14:34 +01:00
clocksource clocksource/drivers/integrator-ap: Add missing of_node_put() 2019-01-26 09:32:42 +01:00
connector
cpufreq cpufreq: scmi: Fix frequency invariance in slow path 2019-01-16 22:04:29 +01:00
cpuidle powerpc/pseries/cpuidle: Fix preempt warning 2019-01-26 09:32:37 +01:00
crypto crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK 2019-01-22 21:40:32 +01:00
dax mm, devm_memremap_pages: fix shutdown handling 2019-01-13 09:51:04 +01:00
dca
devfreq
dio
dma dmaengine: cppi41: delete channel from pending list when stop channel 2018-12-13 09:16:20 +01:00
dma-buf
edac EDAC, skx_edac: Fix logical channel intermediate decoding 2018-11-13 11:08:44 -08:00
eisa
extcon
firewire
firmware efi/libstub: Disable some warnings for x86{,_64} 2019-01-26 09:32:36 +01:00
fmc
fpga fpga: altera-cvp: fix probing for multiple FPGAs on the bus 2019-01-26 09:32:36 +01:00
fsi fsi: master-ast-cf: select GENERIC_ALLOCATOR 2018-12-17 09:24:35 +01:00
gnss gnss: sirf: fix activation retry handling 2018-12-13 09:16:22 +01:00
gpio gpio: pl061: Move irq_chip definition inside struct pl061 2019-01-26 09:32:33 +01:00
gpu drm/msm/gpu: fix building without debugfs 2019-02-06 17:30:06 +01:00
hid HID: ite: Add USB id match for another ITE based keyboard rfkill key quirk 2019-01-13 09:50:56 +01:00
hsi
hv Drivers: hv: vmbus: Check for ring when getting debug info 2019-01-31 08:14:36 +01:00
hwmon hwmon: (w83795) temp4_type has writable permission 2018-12-17 09:24:33 +01:00
hwspinlock
hwtracing intel_th: msu: Fix an off-by-one in attribute store 2019-01-13 09:51:10 +01:00
i2c i2c: dev: prevent adapter retries and timeout being set as minus value 2019-01-16 22:04:34 +01:00
ide ide: fix a typo in the settings proc file name 2019-01-31 08:14:42 +01:00
idle
iio iio: dac: ad5686: fix bit shift read register 2019-01-13 09:51:08 +01:00
infiniband IB/usnic: Fix potential deadlock 2019-01-26 09:32:42 +01:00
input Input: uinput - fix undefined behavior in uinput_validate_absinfo() 2019-01-31 08:14:37 +01:00
iommu iommu/vt-d: Handle domain agaw being less than iommu agaw 2019-01-13 09:51:09 +01:00
ipack
irqchip irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size 2019-01-31 08:14:39 +01:00
isdn isdn: fix kernel-infoleak in capi_unlocked_ioctl 2019-01-09 17:38:31 +01:00
leds leds: pwm: silently error out on EPROBE_DEFER 2019-01-13 09:51:08 +01:00
lightnvm lightnvm: pblk: fix race condition on metadata I/O 2018-11-13 11:08:21 -08:00
macintosh macintosh: therm_windtunnel: drop using attach_adapter 2018-08-24 14:42:42 +02:00
mailbox mailbox: PCC: handle parse error 2018-11-13 11:08:18 -08:00
mcb
md dm crypt: fix parsing of extended IV arguments 2019-01-31 08:14:38 +01:00
media media: venus: core: Set dma maximum segment size 2019-01-26 09:32:38 +01:00
memory memory: ti-aemif: fix a potential NULL-pointer dereference 2018-09-06 10:04:07 -07:00
memstick
message
mfd mfd: tps6586x: Handle interrupts on suspend 2019-01-22 21:40:33 +01:00
misc misc: ibmvsm: Fix potential NULL pointer dereference 2019-01-31 08:14:35 +01:00
mmc mmc: meson-gx: Free irq in release() callback 2019-01-31 08:14:36 +01:00
mtd mtd: rawnand: qcom: fix memory corruption that causes panic 2019-01-16 22:04:34 +01:00
mux mux: adgs1408: use the correct MODULE_LICENSE 2018-10-12 17:36:39 +02:00
net ipvlan, l3mdev: fix broken l3s mode wrt local routes 2019-02-06 17:30:06 +01:00
nfc NFC: nfcmrvl_uart: fix OF child-node lookup 2018-11-13 11:08:48 -08:00
ntb
nubus
nvdimm mm, devm_memremap_pages: fix shutdown handling 2019-01-13 09:51:04 +01:00
nvme nvmet-rdma: fix null dereference under heavy load 2019-01-31 08:14:41 +01:00
nvmem nvmem: check the return value of nvmem_add_cells() 2018-11-13 11:08:35 -08:00
of of: overlay: add missing of_node_put() after add new node to changeset 2019-01-26 09:32:34 +01:00
opp opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call 2018-12-01 09:37:27 +01:00
oprofile
parisc
parport
pci PCI: dwc: Move interrupt acking into the proper callback 2019-01-16 22:04:35 +01:00
pcmcia pcmcia: Implement CLKRUN protocol disabling for Ricoh bridges 2018-11-13 11:08:17 -08:00
perf drivers/perf: hisi: Fixup one DDRC PMU register offset 2019-01-13 09:51:10 +01:00
phy phy: qcom-qusb2: Fix HSTX_TRIM tuning with fused value for SDM845 2018-12-17 09:24:34 +01:00
pinctrl pinctrl: meson: fix pull enable register calculation 2019-01-13 09:50:54 +01:00
platform platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey 2019-01-26 09:32:34 +01:00
pnp
power power: supply: olpc_battery: correct the temperature units 2019-01-13 09:51:10 +01:00
powercap
pps
ps3
ptp ptp: fix Spectre v1 vulnerability 2018-10-17 22:00:22 -07:00
pwm
rapidio
ras
regulator regulator: fix crash caused by null driver data 2018-09-20 09:04:51 -07:00
remoteproc remoteproc: qcom: q6v5: Propagate EPROBE_DEFER 2018-11-13 11:08:52 -08:00
reset ARM: SoC: late updates 2018-08-25 14:12:36 -07:00
rpmsg rpmsg: smd: fix memory leak on channel create 2018-11-13 11:08:55 -08:00
rtc rtc: m41t80: Correct alarm month range with RTC reads 2019-01-09 17:38:48 +01:00
s390 s390/smp: fix CPU hotplug deadlock with CPU rescan 2019-01-31 08:14:35 +01:00
sbus drivers/sbus/char: add of_node_put() 2018-12-21 14:15:17 +01:00
scsi scsi: ufs: Use explicit access size in ufshcd_dump_regs 2019-01-31 08:14:38 +01:00
sfi
sh
siox
slimbus slimbus: ngd: mark PM functions as __maybe_unused 2018-12-19 19:19:49 +01:00
sn
soc soc: ti: QMSS: Fix usage of irq_set_affinity_hint 2018-11-21 09:19:18 +01:00
soundwire soundwire: Fix acquiring bus lock twice during master release 2018-08-27 09:49:48 +05:30
spi spi: bcm2835: Unbreak the build of esoteric configs 2019-01-09 17:38:49 +01:00
spmi
ssb
staging staging: rtl8188eu: Add device code for D-Link DWA-121 rev B1 2019-01-31 08:14:36 +01:00
target scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough 2019-01-26 09:32:38 +01:00
tc TC: Set DMA masks for devices 2018-11-13 11:08:51 -08:00
tee ARM: SoC driver updates 2018-08-23 13:52:46 -07:00
thermal thermal: armada: fix legacy validity test sense 2018-12-21 14:15:22 +01:00
thunderbolt thunderbolt: Prevent root port runtime suspend during NVM upgrade 2018-12-17 09:24:36 +01:00
tty vt: invoke notifier on screen size change 2019-01-31 08:14:40 +01:00
uio uio: Fix an Oops on load 2018-11-27 16:13:09 +01:00
usb usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup 2019-01-31 08:14:42 +01:00
uwb
vfio vfio/type1: Fix unmap overflow off-by-one 2019-01-16 22:04:34 +01:00
vhost vhost: log dirty page correctly 2019-01-31 08:14:32 +01:00
video vgacon: unconfuse vc_origin when using soft scrollback 2019-01-31 08:14:36 +01:00
virt
virtio virtio, vhost: fixes, tweaks 2018-08-24 08:45:19 -07:00
visorbus
vlynq
vme
w1 w1: omap-hdq: fix missing bus unregister at removal 2018-11-13 11:08:48 -08:00
watchdog
xen xen: Fix x86 sched_clock() interface for xen 2019-01-22 21:40:32 +01:00
zorro
Kconfig
Makefile