linux/arch/arm
Eric Biggers 15023db7dc UPSTREAM: crypto: arm/blake2b - add NEON-accelerated BLAKE2b
Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.0
	sha1-neon            16.3
	blake2s-256-arm      18.8
	sha1-asm             20.8
	blake2s-256-generic  26.0
	sha256-neon	     28.9
	sha256-asm	     32.0
	blake2b-256-generic  38.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).  It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only wired up to the shash API,
since there is no library API for BLAKE2b yet.  However, I've tried to
keep things consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

(cherry picked from commit 1862eb0073)
Bug: 178411248
Change-Id: I01557f0a63db0eb21778d0cc7b582ad89898d44a
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-23 08:06:20 +01:00
..
boot ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL 2021-02-17 11:02:24 +01:00
common ARM/sa1111: add a missing include of dma-map-ops.h 2020-10-20 09:40:33 +02:00
configs mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING 2020-12-06 10:19:07 -08:00
crypto UPSTREAM: crypto: arm/blake2b - add NEON-accelerated BLAKE2b 2021-02-23 08:06:20 +01:00
include Merge 5.10.17 into android12-5.10 2021-02-18 11:21:01 +01:00
kernel Merge 5.10.17 into android12-5.10 2021-02-18 11:21:01 +01:00
lib arm: propagate the calling convention changes down to csum_partial_copy_from_user() 2020-08-20 15:45:16 -04:00
mach-actions
mach-alpine
mach-artpec
mach-asm9260
mach-aspeed
mach-at91 ARM: at91: pm: remove unnecessary at91sam9x60_idle 2020-08-17 11:18:59 +02:00
mach-axxia
mach-bcm ARM: bcm: Enable BCM7038_L1_IRQ for ARCH_BRCMSTB 2020-08-17 09:20:34 -07:00
mach-berlin
mach-clps711x
mach-cns3xxx
mach-davinci ARM: SoC platform updates 2020-10-24 10:33:08 -07:00
mach-digicolor
mach-dove
mach-ebsa110
mach-efm32
mach-ep93xx treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
mach-exynos Samsung mach/soc changes for v5.10 2020-09-26 12:55:43 -07:00
mach-footbridge ARM: footbridge: fix dc21285 PCI configuration accessors 2021-02-10 09:29:20 +01:00
mach-gemini
mach-highbank dma-mapping: split <linux/dma-mapping.h> 2020-10-06 07:07:03 +02:00
mach-hisi ARM: hisi: add support for SD5203 SoC 2020-09-30 09:56:03 +08:00
mach-imx ARM: imx: build suspend-imx6.S with arm instruction set 2021-02-03 23:28:44 +01:00
mach-integrator
mach-iop32x
mach-ixp4xx ARM/ixp4xx: add a missing include of dma-map-ops.h 2020-10-13 13:28:22 +02:00
mach-keystone ARM: keystone: remove SECTION_SIZE_BITS/MAX_PHYSMEM_BITS 2020-12-07 15:32:04 +01:00
mach-lpc18xx
mach-lpc32xx
mach-mediatek
mach-meson
mach-milbeaut
mach-mmp treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
mach-moxart
mach-mstar ARM: mstar: Select MStar intc 2020-10-03 12:47:56 -07:00
mach-mv78xx0
mach-mvebu mvebu fixes for 5.9 (part 1) 2020-10-26 10:11:55 +01:00
mach-mxs
mach-nomadik
mach-npcm
mach-nspire
mach-omap1 ARM: OMAP1: OSK: fix ohci-omap breakage 2021-02-10 09:29:11 +01:00
mach-omap2 ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled 2021-02-17 11:02:22 +01:00
mach-orion5x treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
mach-oxnas
mach-picoxcell
mach-prima2 ANDROID: ARM: prima2: Register with kernel restart handler 2020-08-12 12:30:27 -07:00
mach-pxa power: supply: gpio-charger: Convert to GPIO descriptors 2020-08-27 16:47:14 +02:00
mach-qcom
mach-rda
mach-realtek
mach-realview
mach-rockchip
mach-rpc treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
mach-s3c ARM: SoC platform updates 2020-10-24 10:33:08 -07:00
mach-s5pv210 ARM: s5pv210: use private pm save/restore 2020-08-19 21:33:11 +02:00
mach-sa1100 power: supply: gpio-charger: Convert to GPIO descriptors 2020-08-27 16:47:14 +02:00
mach-shmobile ARM: SoC platform updates 2020-10-24 10:33:08 -07:00
mach-socfpga ARM: socfpga: PM: add missing put_device() call in socfpga_setup_ocram_self_refresh() 2020-07-28 13:57:36 -05:00
mach-spear
mach-sti
mach-stm32 ARM: stm32: Replace HTTP links with HTTPS ones 2020-10-03 12:38:54 -07:00
mach-sunxi ARM: sunxi: Add machine match for the Allwinner V3 SoC 2020-11-02 10:28:14 +01:00
mach-tango
mach-tegra treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
mach-u300
mach-uniphier
mach-ux500
mach-versatile
mach-vexpress
mach-vt8500
mach-zx
mach-zynq
mm ARM: 9025/1: Kconfig: CPU_BIG_ENDIAN depends on !LD_IS_LLD 2021-02-07 15:37:13 +01:00
net
nwfpe
oprofile
plat-omap PM: AVS: smartreflex Move driver to soc specific drivers 2020-10-16 18:28:43 +02:00
plat-orion ARM: orion/gpio: Make use of for_each_requested_gpio() 2020-07-18 22:49:23 +02:00
plat-pxa
plat-versatile
probes ARM: 9019/1: kprobes: Avoid fortify_panic() when copying optprobe template 2020-10-27 12:11:51 +00:00
tools mm/madvise: introduce process_madvise() syscall: an external memory hinting API 2020-10-18 09:27:10 -07:00
vdso kbuild: explicitly specify the build id style 2020-10-09 23:57:30 +09:00
vfp ARM: 9044/1: vfp: use undef hook for VFP support detection 2020-12-30 11:54:02 +01:00
xen Merge 5.10.17 into android12-5.10 2021-02-18 11:21:01 +01:00
Kbuild ARM: 8981/1: add arch/arm/Kbuild 2020-07-21 16:33:35 +01:00
Kconfig kbuild: Hoist '--orphan-handling' into Kconfig 2020-12-01 22:45:36 +09:00
Kconfig-nommu
Kconfig.assembler ARM: 8991/1: use VFP assembler mnemonics if available 2020-07-21 16:33:39 +01:00
Kconfig.debug ARM: SoC platform updates 2020-10-24 10:33:08 -07:00
Makefile kbuild: Hoist '--orphan-handling' into Kconfig 2020-12-01 22:45:36 +09:00