API:
- Replace crypto_get_default_rng with crypto_stdrng_get_bytes.
- Remove simd skcipher support.
- Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off.
Algorithms:
- Remove CPU-based des/3des acceleration.
- Add test vectors for authenc(hmac(md5),cbc(aes)).
- Add test vectors for authenc(hmac(md5),cbc(des)).
- Add test vectors for authenc(hmac(md5),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha1),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha224),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha256),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha384),rfc3686(ctr(aes))).
- Add test vectors for authenc(hmac(sha512),rfc3686(ctr(aes))).
- Replace spin lock with mutex in jitterentropy.
Drivers:
- Add authenc algorithms to safexcel.
- Add support for zstd in qat.
- Add wireless mode support for QAT GEN6.
- Add anti-rollback support for QAT GEN6.
- Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmne7qgACgkQxycdCkmx
i6cm0w/9HNFzIWuZWh4Q8k1d/SX32/2p40EMvlw9QFO8wt0gsMtbk6NN5G3sIfhL
36+rT8Vo5yg9MahTqAspXKjP+QTev5D7/nsDa/FzOSA1JxyvBbgV7X33k8EZjcgT
+ffuh0WbaWlutYw07o2h4cNPz1Yp4M0hp2IdzvY0Y3q9D05eiwis1SQzUVPmTs6K
I6OP+4JjJbqubOgJxsltEoeCH9ZP0fObRWmAiVm6rwk9uX4CY32nzi3QOttXQ0su
4F/useoRwWQ1t7FTy8/fcVtFpL/G8hAFSQ4un5ODhDWL7taV5sZPXQBwXUuoVQM6
aNjZlaju/MB7gnAOrBvSsniohAAqRUNR8O7P8QW6mDrFmDhUZ3ZILmCKW+VwF5SG
a4fV94XgBVOnKIqD01cc++8mb6keX/88KJW79AEWLeJ9YZ9BuyFphr9OEBFAIHqx
xG+iEg4uoVxwC52//oGt/yZaZKK3C1y/Zey5bOjfErKq3ATXGIvawaAzdvB9mh6Q
iAnl71JpR4mrs++fAyUCKM+dfvdmQYDq6HJayMdg+IHAIeIvyMnPjsGigdVJvE65
RpBKW4aclfiYaDwX9Jf703mHR1uuKGP1GKpz8U+JXN4Ax2JPg0maC1N3wFkDypYO
HUNKgEk/173f1HTjU0JjbqvqJh+rKQ3ZbHpLxZrYtnSMukDwRO0=
=KoAB
-----END PGP SIGNATURE-----
Merge tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Replace crypto_get_default_rng with crypto_stdrng_get_bytes
- Remove simd skcipher support
- Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off
Algorithms:
- Remove CPU-based des/3des acceleration
- Add test vectors for authenc(hmac(md5),cbc({aes,des})) and
authenc(hmac({md5,sha1,sha224,sha256,sha384,sha512}),rfc3686(ctr(aes)))
- Replace spin lock with mutex in jitterentropy
Drivers:
- Add authenc algorithms to safexcel
- Add support for zstd in qat
- Add wireless mode support for QAT GEN6
- Add anti-rollback support for QAT GEN6
- Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2"
* tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (129 commits)
crypto: af_alg - use sock_kmemdup in alg_setkey_by_key_serial
crypto: vmx - remove CRYPTO_DEV_VMX from Kconfig
crypto: omap - convert reqctx buffer to fixed-size array
crypto: atmel-sha204a - add Thorsten Blum as maintainer
crypto: atmel-ecc - add Thorsten Blum as maintainer
crypto: qat - fix IRQ cleanup on 6xxx probe failure
crypto: geniv - Remove unused spinlock from struct aead_geniv_ctx
crypto: qce - simplify qce_xts_swapiv()
crypto: hisilicon - Fix dma_unmap_single() direction
crypto: talitos - rename first/last to first_desc/last_desc
crypto: talitos - fix SEC1 32k ahash request limitation
crypto: jitterentropy - replace long-held spinlock with mutex
crypto: hisilicon - remove unused and non-public APIs for qm and sec
crypto: hisilicon/qm - drop redundant variable initialization
crypto: hisilicon/qm - remove else after return
crypto: hisilicon/qm - add const qualifier to info_name in struct qm_cmd_dump_item
crypto: hisilicon - fix the format string type error
crypto: ccree - fix a memory leak in cc_mac_digest()
crypto: qat - add support for zstd
crypto: qat - use swab32 macro
...
Since DES and Triple DES are obsolete, there is very little point in
maintining architecture-optimized code for them. Remove it.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of exposing the x86-optimized SM3 code via an x86-specific
crypto_shash algorithm, instead just implement the sm3_blocks() library
function. This is much simpler, it makes the SM3 library functions be
x86-optimized, and it fixes the longstanding issue where the
x86-optimized SM3 code was disabled by default. SM3 still remains
available through crypto_shash, but individual architectures no longer
need to handle it.
Tweak the prototype of sm3_transform_avx() to match what the library
expects, including changing the block count to size_t. Note that the
assembly code actually already treated this argument as size_t.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260321040935.410034-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Remove the "ghash-pclmulqdqni" crypto_shash algorithm. Move the
corresponding assembly code into lib/crypto/, and wire it up to the
GHASH library.
This makes the GHASH library be optimized with x86's carryless
multiplication instructions. It also greatly reduces the amount of
x86-specific glue code that is needed, and it fixes the issue where this
GHASH optimization was disabled by default.
Rename and adjust the prototypes of the assembly functions to make them
fit better with the library. Remove the byte-swaps (pshufb
instructions) that are no longer necessary because the library keeps the
accumulator in POLYVAL format rather than GHASH format.
Rename clmul_ghash_mul() to polyval_mul_pclmul() to reflect that it
really does a POLYVAL style multiplication. Wire it up to both
ghash_mul_arch() and polyval_mul_arch().
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260319061723.1140720-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Remove the "aes-aesni" crypto_cipher algorithm and the code specific to
its implementation. It is no longer necessary because the AES library
is now optimized with x86 AES-NI, and crypto/aes.c exposes the AES
library via the crypto_cipher API.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260112192035.10427-19-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Migrate the x86_64 implementations of NH into lib/crypto/. This makes
the nh() function be optimized on x86_64 kernels.
Note: this temporarily makes the adiantum template not utilize the
x86_64 optimized NH code. This is resolved in a later commit that
converts the adiantum template to use nh() instead of "nhpoly1305".
Link: https://lore.kernel.org/r/20251211011846.8179-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it
up to the POLYVAL library interface. This makes the POLYVAL library be
properly optimized on x86_64.
This drops the x86_64 optimizations of polyval in the crypto_shash API.
That's fine, since polyval will be removed from crypto_shash entirely
since it is unneeded there. But even if it comes back, the crypto_shash
API could just be implemented on top of the library API, as usual.
Adjust the names and prototypes of the assembly functions to align more
closely with the rest of the library code.
Also replace a movaps instruction with movups to remove the assumption
that the key struct is 16-byte aligned. Users can still align the key
if they want (and at least in this case, movups is just as fast as
movaps), but it's inconvenient to require it.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support for
various instruction mnemonics now that the minimum assembler version
supports them all
- The usual cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjqQswACgkQEsHwGGHe
VUoFQg/9EoQ8TnWyzdTQ83+4sy1ePIgY+WyRPlDPmyoAjGN1WTT1NUY2JBeaW5CA
UVKJlaO2Nh/c5YypuJR2PtpPuJlNvRBLwpN3Lj+PiAhaYv8gcyeZg64c4MaRaTyc
yuoj5CaEhyQ16CDBPAjxDQ6+68YHjltlDSZainj77YWSzcBSflJCYH1RnNlCHiM9
ggBIoFmWltrCEDDW6d0Phl+Fh3K4tuYexRucIavgE+k4ZD+XqujWeLTaau837yW7
CMvN16elGorWGRBGiaRGH2sbrh8ruYPw4lr5DlFl7ApoBmxgK9s9peicUHtHQz4H
E9/c2XjGwVE4MtCI5IfeqG87DfojVeiWkXO30CMRalsFlbZzKs4JwalspIzgxH4s
m2tsfN++y9eC1b4a8EaSVWBk03xmmNWM7FqjC3LOMyV0aI9dqj/u36aadHMC/GsL
Rwl1GCnJnwu0Z7bho7L2qB0om4NOkX8H3uyzoOzDNC+RTKvgwumI0LpJBwrUrqW7
Ftf7hIc52hj94drN2RsVtvu3ueBNJF8SW4VJ13UJyZyJDnB4Os2wrI9aJ1vBam1e
md90pVVGjiXg/PhoCPDHPYzPs8oV2zNEJ0im/wNhkCH42yMAoIlbFDS77JghzSF2
sI9vMJVsLN7y/SbiysejTBG83j1dEPIpkC7oSzkYOZNNjCKRWWo=
=dW6J
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- Simplify inline asm flag output operands now that the minimum
compiler version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support
for various instruction mnemonics now that the minimum assembler
version supports them all
- The usual cleanups all over the place
* tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__
x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>
x86/mtrr: Remove license boilerplate text with bad FSF address
x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h>
x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h>
x86/entry/fred: Push __KERNEL_CS directly
x86/kconfig: Remove CONFIG_AS_AVX512
crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ
crypto: X86 - Remove CONFIG_AS_VAES
crypto: x86 - Remove CONFIG_AS_GFNI
x86/kconfig: Drop unused and needless config X86_64_SMP
Reorganize the Curve25519 library code:
- Build a single libcurve25519 module, instead of up to three modules:
libcurve25519, libcurve25519-generic, and an arch-specific module.
- Move the arch-specific Curve25519 code from arch/$(SRCARCH)/crypto/ to
lib/crypto/$(SRCARCH)/. Centralize the build rules into
lib/crypto/Makefile and lib/crypto/Kconfig.
- Include the arch-specific code directly in lib/crypto/curve25519.c via
a header, rather than using a separate .c file.
- Eliminate the entanglement with CRYPTO. CRYPTO_LIB_CURVE25519 no
longer selects CRYPTO, and the arch-specific Curve25519 code no longer
depends on CRYPTO.
This brings Curve25519 in line with the latest conventions for
lib/crypto/, used by other algorithms. The exception is that I kept the
generic code in separate translation units for now. (Some of the
function names collide between the x86 and generic Curve25519 code. And
the Curve25519 functions are very long anyway, so inlining doesn't
matter as much for Curve25519 as it does for some other algorithms.)
Link: https://lore.kernel.org/r/20250906213523.84915-11-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Curve25519 is used only via the library API, not the crypto_kpp API. In
preparation for removing the unused crypto_kpp API for Curve25519,
remove the unused "curve25519-x86" kpp algorithm.
Note that the underlying x86_64 optimized Curve25519 code remains fully
supported and accessible via the library API.
It's also worth noting that even if the kpp support for Curve25519 comes
back later, there is no need for arch-specific kpp glue code like this,
as a single kpp algorithm that wraps the library API is sufficient.
Link: https://lore.kernel.org/r/20250906213523.84915-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Current minimum required version of binutils is 2.30, which supports GFNI
instruction mnemonics.
Remove check for assembler support of GFNI instructions and all relevant
macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/20250819085855.333380-1-ubizjak@gmail.com
Instead of exposing the x86-optimized SHA-1 code via x86-specific
crypto_shash algorithms, instead just implement the sha1_blocks()
library function. This is much simpler, it makes the SHA-1 library
functions be x86-optimized, and it fixes the longstanding issue where
the x86-optimized SHA-1 code was disabled by default. SHA-1 still
remains available through crypto_shash, but individual architectures no
longer need to handle it.
To match sha1_blocks(), change the type of the nblocks parameter of the
assembly functions from int to size_t. The assembly functions actually
already treated it as size_t.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250712232329.818226-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Instead of exposing the x86-optimized SHA-512 code via x86-specific
crypto_shash algorithms, instead just implement the sha512_blocks()
library function. This is much simpler, it makes the SHA-512 (and
SHA-384) library functions be x86-optimized, and it fixes the
longstanding issue where the x86-optimized SHA-512 code was disabled by
default. SHA-512 still remains available through crypto_shash, but
individual architectures no longer need to handle it.
To match sha512_blocks(), change the type of the nblocks parameter of
the assembly functions from int to size_t. The assembly functions
actually already treated it as size_t.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Instead of providing crypto_shash algorithms for the arch-optimized
SHA-256 code, instead implement the SHA-256 library. This is much
simpler, it makes the SHA-256 library functions be arch-optimized, and
it fixes the longstanding issue where the arch-optimized SHA-256 was
disabled by default. SHA-256 still remains available through
crypto_shash, but individual architectures no longer need to handle it.
To match sha256_blocks_arch(), change the type of the nblocks parameter
of the assembly functions from int to size_t. The assembly functions
actually already treated it as size_t.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Continue disentangling the crypto library functions from the generic
crypto infrastructure by moving the x86 BLAKE2s, ChaCha, and Poly1305
library functions into a new directory arch/x86/lib/crypto/ that does
not depend on CRYPTO. This mirrors the distinction between crypto/ and
lib/crypto/.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
arch/x86/crypto/Kconfig is sourced only when CONFIG_X86=y, so there is
no need for the symbols defined inside it to depend on X86.
In the case of CRYPTO_TWOFISH_586 and CRYPTO_TWOFISH_X86_64, the
dependency was actually on '(X86 || UML_X86)', which suggests that these
two symbols were intended to be available under user-mode Linux as well.
Yet, again these symbols were defined only when CONFIG_X86=y, so that
was not the case. Just remove this redundant dependency.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 Poly1305 code never falls back to the generic code, so selecting
CRYPTO_LIB_POLY1305_GENERIC is unnecessary.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since crypto/poly1305.c now registers a poly1305-$(ARCH) shash algorithm
that uses the architecture's Poly1305 library functions, individual
architectures no longer need to do the same. Therefore, remove the
redundant shash algorithm from the arch-specific code and leave just the
library functions there.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since crypto/chacha.c now registers chacha20-$(ARCH), xchacha20-$(ARCH),
and xchacha12-$(ARCH) skcipher algorithms that use the architecture's
ChaCha and HChaCha library functions, individual architectures no longer
need to do the same. Therefore, remove the redundant skcipher
algorithms and leave just the library functions.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Current minimum required version of binutils is 2.25,
which supports AVX-512 instruction mnemonics.
Remove check for assembler support of AVX-512 instructions
and all relevant macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c). The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs. Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU. This has now been fixed. In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.
This simplifies the code and improves performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The ARCH_MAY_HAVE patch missed arm64, mips and s390. But it may
also lead to arch options being enabled but ineffective because
of modular/built-in conflicts.
As the primary user of all these options wireguard is selecting
the arch options anyway, make the same selections at the lib/crypto
option level and hide the arch options from the user.
Instead of selecting them centrally from lib/crypto, simply set
the default of each arch option as suggested by Eric Biggers.
Change the Crypto API generic algorithms to select the top-level
lib/crypto options instead of the generic one as otherwise there
is no way to enable the arch options (Eric Biggers). Introduce a
set of INTERNAL options to work around dependency cycles on the
CONFIG_CRYPTO symbol.
Fixes: 1047e21aec ("crypto: lib/Kconfig - Fix lib built-in failure when arch is modular")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Arnd Bergmann <arnd@kernel.org>
Closes: https://lore.kernel.org/oe-kbuild-all/202502232152.JC84YDLp-lkp@intel.com/
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The HAVE_ARCH Kconfig options in lib/crypto try to solve the
modular versus built-in problem, but it still fails when the
the LIB option (e.g., CRYPTO_LIB_CURVE25519) is selected externally.
Fix this by introducing a level of indirection with ARCH_MAY_HAVE
Kconfig options, these then go on to select the ARCH_HAVE options
if the ARCH Kconfig options matches that of the LIB option.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501230223.ikroNDr1-lkp@intel.com/
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Move the x86 CRC-T10DIF assembly code into the lib directory and wire it
up to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Move the x86 CRC32 assembly code into the lib directory and wire it up
to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first
use case being preparing the length block in fewer instructions.
In practice this does not reduce the set of CPUs on which the code can
run, because all Intel and AMD CPUs with AES-NI also have SSE4.1.
Upgrade the existing SSE2 feature check to SSE4.1, though it seems this
check is not strictly necessary; the aesni-intel module has been getting
away with using SSE4.1 despite checking for AES-NI only.
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Update the kconfig help and module description to reflect that VAES
instructions are now used in some cases. Also fix XTR => XCTR.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10. There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors. This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.
I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code. The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved. Any remainder is handled
using a simple 1 vector at a time loop, with masking.
Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues. So I ended up starting from scratch. Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.
To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.
The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 60% | 62% | 70% | 69% |
Intel Sapphire Rapids | 157% | 145% | 162% | 119% | 96% | 96% |
Intel Emerald Rapids | 156% | 144% | 161% | 115% | 95% | 100% |
AMD Zen 4 | 103% | 89% | 78% | 56% | 54% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 66% | 48% | 49% | 70% | 53% |
Intel Sapphire Rapids | 80% | 60% | 41% | 62% | 38% |
Intel Emerald Rapids | 79% | 60% | 41% | 62% | 38% |
AMD Zen 4 | 51% | 35% | 27% | 32% | 25% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 59% | 63% | 67% | 71% |
Intel Sapphire Rapids | 159% | 145% | 161% | 125% | 102% | 100% |
Intel Emerald Rapids | 158% | 144% | 161% | 124% | 100% | 103% |
AMD Zen 4 | 110% | 95% | 80% | 59% | 56% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 67% | 56% | 46% | 70% | 56% |
Intel Sapphire Rapids | 79% | 62% | 39% | 61% | 39% |
Intel Emerald Rapids | 80% | 62% | 40% | 58% | 40% |
AMD Zen 4 | 49% | 36% | 30% | 35% | 28% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine. I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.
I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.
Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14171 | 12956 | 12318 | 9588 | 7293 | 6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 | 9107 | 5891 | 6472 |
AVX512_Intel_Linux | 13954 | 12277 | 11530 | 8712 | 6627 | 5898 |
AVX512_Cloudflare | 12564 | 11050 | 10905 | 8152 | 5345 | 5202 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 4939 | 3688 | 1846 | 1821 | 738 |
AVX512_Intel_OpenSSL | 4629 | 4532 | 2734 | 2332 | 1131 |
AVX512_Intel_Linux | 4035 | 2966 | 1567 | 1330 | 639 |
AVX512_Cloudflare | 3344 | 2485 | 1141 | 1127 | 456 |
Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14276 | 13311 | 13007 | 11086 | 8268 | 8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 | 9587 | 5954 | 7060 |
AVX512_Intel_Linux | 14116 | 12795 | 11778 | 9269 | 7735 | 6455 |
AVX512_Cloudflare | 13301 | 12018 | 11919 | 9182 | 7189 | 6726 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 6454 | 5020 | 2635 | 2602 | 1079 |
AVX512_Intel_OpenSSL | 5184 | 5799 | 2957 | 2545 | 1228 |
AVX512_Intel_Linux | 4394 | 4247 | 2235 | 1635 | 922 |
AVX512_Cloudflare | 4289 | 3851 | 1435 | 1417 | 574 |
So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark. (This also holds true when doing
the same tests on AMD Zen 4.) It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate. The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The minimum version of binutils for kernel build is currently 2.23 and
it doesn't support GFNI.
So, it fails to build the aria-avx512 if the old binutils is used.
aria-avx512 requires GFNI, so it should not be allowed to build if the
old binutils is used.
The AS_AVX512 and AS_GFNI are added to the Kconfig to disable build
aria-avx512 if the old binutils is used.
Fixes: c970d42001 ("crypto: x86/aria - implement aria-avx512")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
aria-avx512 implementation uses AVX512 and GFNI.
It supports 64way parallel processing.
So, byteslicing code is changed to support 64way parallel.
And it exports some aria-avx2 functions such as encrypt() and decrypt().
AVX and AVX2 have 16 registers.
They should use memory to store/load state because of lack of registers.
But AVX512 supports 32 registers.
So, it doesn't require store/load in the s-box layer.
It means that it can reduce overhead of store/load in the s-box layer.
Also code become much simpler.
Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100:
ARIA-AVX512(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1504 cycles (1024 bytes)
tcrypt: 1 operation in 4595 cycles (4096 bytes)
tcrypt: 1 operation in 1763 cycles (1024 bytes)
tcrypt: 1 operation in 5540 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1502 cycles (1024 bytes)
tcrypt: 1 operation in 4615 cycles (4096 bytes)
tcrypt: 1 operation in 1759 cycles (1024 bytes)
tcrypt: 1 operation in 5554 cycles (4096 bytes)
ARIA-AVX2 with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption
tcrypt: 1 operation in 2003 cycles (1024 bytes)
tcrypt: 1 operation in 5867 cycles (4096 bytes)
tcrypt: 1 operation in 2358 cycles (1024 bytes)
tcrypt: 1 operation in 7295 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption
tcrypt: 1 operation in 2004 cycles (1024 bytes)
tcrypt: 1 operation in 5956 cycles (4096 bytes)
tcrypt: 1 operation in 2409 cycles (1024 bytes)
tcrypt: 1 operation in 7564 cycles (4096 bytes)
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
aria-avx2 implementation uses AVX2, AES-NI, and GFNI.
It supports 32way parallel processing.
So, byteslicing code is changed to support 32way parallel.
And it exports some aria-avx functions such as encrypt() and decrypt().
There are two main logics, s-box layer and diffusion layer.
These codes are the same as aria-avx implementation.
But some instruction are exchanged because they don't support 256bit
registers.
Also, AES-NI doesn't support 256bit register.
So, aesenclast and aesdeclast are used twice like below:
vextracti128 $1, ymm0, xmm6;
vaesenclast xmm7, xmm0, xmm0;
vaesenclast xmm7, xmm6, xmm6;
vinserti128 $1, xmm6, ymm0, ymm0;
Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100:
ARIA-AVX2 with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption
tcrypt: 1 operation in 2003 cycles (1024 bytes)
tcrypt: 1 operation in 5867 cycles (4096 bytes)
tcrypt: 1 operation in 2358 cycles (1024 bytes)
tcrypt: 1 operation in 7295 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption
tcrypt: 1 operation in 2004 cycles (1024 bytes)
tcrypt: 1 operation in 5956 cycles (4096 bytes)
tcrypt: 1 operation in 2409 cycles (1024 bytes)
tcrypt: 1 operation in 7564 cycles (4096 bytes)
ARIA-AVX with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption
tcrypt: 1 operation in 2761 cycles (1024 bytes)
tcrypt: 1 operation in 9390 cycles (4096 bytes)
tcrypt: 1 operation in 3401 cycles (1024 bytes)
tcrypt: 1 operation in 11876 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption
tcrypt: 1 operation in 2735 cycles (1024 bytes)
tcrypt: 1 operation in 9424 cycles (4096 bytes)
tcrypt: 1 operation in 3369 cycles (1024 bytes)
tcrypt: 1 operation in 11954 cycles (4096 bytes)
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Shorten menu titles and make them consistent:
- acronym
- name
- architecture features in parenthesis
- no suffixes like "<something> algorithm", "support", or
"hardware acceleration", or "optimized"
Simplify help text descriptions, update references, and ensure that
https references are still valid.
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Shorten menu titles and make them consistent:
- acronym
- name
- architecture features in parenthesis
- no suffixes like "<something> algorithm", "support", or
"hardware acceleration", or "optimized"
Simplify help text descriptions, update references, and ensure that
https references are still valid.
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Shorten menu titles and make them consistent:
- acronym
- name
- architecture features in parenthesis
- no suffixes like "<something> algorithm", "support", or
"hardware acceleration", or "optimized"
Simplify help text descriptions, update references, and ensure that
https references are still valid.
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Shorten menu titles and make them consistent:
- acronym
- name
- architecture features in parenthesis
- no suffixes like "<something> algorithm", "support", or
"hardware acceleration", or "optimized"
Simplify help text descriptions, update references, and ensure that
https references are still valid.
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Shorten menu titles and make them consistent:
- acronym
- name
- architecture features in parenthesis
- no suffixes like "<something> algorithm", "support", or
"hardware acceleration", or "optimized"
Simplify help text descriptions, update references, and ensure that
https references are still valid.
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Move CPU-specific crypto/Kconfig entries to arch/xxx/crypto/Kconfig
and create a submenu for them under the Crypto API menu.
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>