Commit Graph

264 Commits

Author SHA1 Message Date
David Gow
5772f65352 x86/boot/e820: Re-enable BIOS fallback if e820 table is empty
In commit:

  157266edcc ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables")

the check on the number of entries in the e820 table was removed. The intention
was to support single-entry maps, but by removing the check entirely, we also
skip the fallback (to, e.g., the BIOS 88h function).

This means that if no E820 map is passed in from the bootloader (which is the
case on some bootloaders, like linld), we end up with an empty memory map, and
the kernel fails to boot (either by deadlocking on OOM, or by failing to
allocate the real mode trampoline, or similar).

Re-instate the check in append_e820_table(), but only check that nr_entries is
non-zero. This allows e820__memory_setup_default() to fall back to other memory
size sources, and doesn't affect e820__memory_setup_extended(), as the latter
ignores the return value from append_e820_table().

In doing so, we also update the return values to be proper error codes, with
-ENOENT for this case (there are no entries), and -EINVAL for the case where an
entry appears invalid. Given none of the callers check the actual value -- just
whether it's nonzero -- this is largely aesthetic in practice.

Tested against linld, and the kernel boots again fine.

[ mingo: Readability edits to the comment and the changelog. ]

Fixes: 157266edcc ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables")
Signed-off-by: David Gow <david@davidgow.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: stable@vger.kernel.org
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://patch.msgid.link/20260416065746.1896647-1-david@davidgow.net
2026-05-07 10:04:54 +02:00
Linus Torvalds
e812928be2 cxl changes for v7.0
- A set of commits that introduces cxl_memdev_attach and pave way for
   soft reserved handling, type2 accelerator enabling, and LSA 2.0
   enabling. All these series require the endpoint driver to settle
   before continuing the memdev driver probe.
 
 dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
 cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation
 cxl/mem: Drop @host argument to devm_cxl_add_memdev()
 cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup
 cxl/port: Arrange for always synchronous endpoint attach
 cxl/mem: Arrange for always-synchronous memdev attach
 cxl/mem: Fix devm_cxl_memdev_edac_release() confusion
 
 - A set to address CXL port error protocol handling and reporting. The
   large patch series was split into 3 parts. Part 1 and 2 are included
   here with part 3 coming later. Part 1 consists of a series of code
   refactoring to PCI AER sub-system that addresses CXL and also CXL
   RAS code to prepare for port error handling. Part 2 refactors the
   CXL code to move management of component registers to cxl_port
   objects to allow all CXL AER errors to be handled through the
   cxl_port hierarchy.
 
 Part 2:
 cxl/port: Move endpoint component register management to cxl_port
 cxl/port: Map Port RAS registers
 cxl/port: Move dport RAS setup to dport add time
 cxl/port: Move dport probe operations to a driver event
 cxl/port: Move decoder setup before dport creation
 cxl/port: Cleanup dport removal with a devres group
 cxl/port: Reduce number of @dport variables in cxl_port_add_dport()
 cxl/port: Cleanup handling of the nr_dports 0 -> 1 transition
 
 Part 1:
 cxl: Update RAS handler interfaces to also support CXL Ports
 cxl/mem: Clarify @host for devm_cxl_add_nvdimm()
 PCI/AER: Update struct aer_err_info with kernel-doc formatting
 PCI/AER: Report CXL or PCIe bus type in AER trace logging
 PCI/AER: Use guard() in cxl_rch_handle_error_iter()
 PCI/AER: Move CXL RCH error handling to aer_cxl_rch.c
 PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()
 PCI/AER: Export pci_aer_unmask_internal_errors()
 cxl/pci: Move CXL driver's RCH error handling into core/ras_rch.c
 PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS
 cxl/pci: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c
 PCI: Replace cxl_error_is_native() with pcie_aer_is_native()
 cxl/pci: Remove unnecessary CXL RCH handling helper functions
 cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
 PCI: Introduce pcie_is_cxl()
 PCI: Update CXL DVSEC definitions
 PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
 
 - A set of patches to provide AMD Zen5 platform address translation for
   CXL using ACPI PRMT. Set includes a conventions document to explain
   why this is needed and how it's implemented.
 
 cxl: Disable HPA/SPA translation handlers for Normalized Addressing
 cxl/region: Factor out code into cxl_region_setup_poison()
 cxl/atl: Lock decoders that need address translation
 cxl: Enable AMD Zen5 address translation using ACPI PRMT
 cxl/acpi: Prepare use of EFI runtime services
 cxl: Introduce callback for HPA address ranges translation
 cxl/region: Use region data to get the root decoder
 cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
 cxl/region: Separate region parameter setup and region construction
 cxl: Simplify cxl_root_ops allocation and handling
 cxl/region: Store HPA range in struct cxl_region
 cxl/region: Store root decoder in struct cxl_region
 cxl/region: Rename misleading variable name @hpa to @hpa_range
 Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
 cxl, doc: Moving conventions in separate files
 cxl, doc: Remove isonum.txt inclusion
 
 - A set of misc CXL patches of fixes, cleanups, and updates. Including
   CXL address translation for unaligned MOD3 regions.
 
 cxl: Fix premature commit_end increment on decoder commit failure
 cxl/region: Use do_div() for 64-bit modulo operation
 cxl/region: Translate HPA to DPA and memdev in unaligned regions
 cxl/region: Translate DPA->HPA in unaligned MOD3 regions
 cxl/core: Fix cxl_dport debugfs EINJ entries
 cxl/acpi: Remove cxl_acpi_set_cache_size()
 cxl/hdm: Fix newline character in dev_err() messages
 cxl/pci: Remove outdated FIXME comment and BUILD_BUG_ON
 Documentation/driver-api/cxl: device hotplug section
 Documentation/driver-api/cxl: BIOS/EFI expectation update
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmmOFXcACgkQYGjFFmlT
 OEojaxAApQJFLyX1MkPbhtm6j6GRzzEAEWTBX2XsmliZf1JhfahsNMWI69kO33rm
 LddF+nyZNEl/foyHgUaxVzlQwqWuihyp7Qk2djXnMzLsuCAsWhPbB9j0RgJUN8h5
 N4U76AmOdmhLlXH4CCqoW2jNy0OjxNdgp1FtTHv7VO7RxgRE9MFJRkLulKxB03wy
 t6lRZXPofEFcHen40DlYRtW26vy1BYUO0dng2f16DxWrb1ztdACH/zVqCJJtdoFc
 FAT5EaQCeRYZ9Yz4dONw3DcUjYlG6NcRN9FWNiptBn1Pb7pUX55Le8lfD3qZg0an
 m3lWRs1T/lGz7pWmz4GPUKDwGFCEqLqd4oSz5v+dFR3JJxjJpRzKa19y5TfqK/LF
 diqNZsDD9gCXE1HXzNr1YcbllpU2cPRPf58gWG9bLmG5xUUmScib8LoTMfgcCJW5
 SlC6kf7BFLkJfDTcFaILc/UANeZaLGhrV0vyJntfGyT5EqKOcfjQEvrZvofA8mef
 bdxt0IRDW4D+7kkcuR33OipTVUFG3ban8yYq4zXD64dmeHF76gwdJm3nyXsqdtpc
 IYIIhz0W6pbTKjJ2fy1rZcTac1ZaALstyaF4bYWIjyF3NylPM8tDi48DFr+DGgeX
 xkFs2B9p5vY5Cq73gCmSWsi3PBPTjWzeRp7YZrV6VoBd9uqewUs=
 =blFQ
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull CXL updates from Dave Jiang:

 - Introduce cxl_memdev_attach and pave way for soft reserved handling,
   type2 accelerator enabling, and LSA 2.0 enabling. All these series
   require the endpoint driver to settle before continuing the memdev
   driver probe.

 - Address CXL port error protocol handling and reporting.

   The large patch series was split into three parts. The first two
   parts are included here with the final part coming later.

   The first part consists of a series of code refactoring to PCI AER
   sub-system that addresses CXL and also CXL RAS code to prepare for
   port error handling.

   The second part refactors the CXL code to move management of
   component registers to cxl_port objects to allow all CXL AER errors
   to be handled through the cxl_port hierarchy.

 - Provide AMD Zen5 platform address translation for CXL using ACPI
   PRMT. This includes a conventions document to explain why this is
   needed and how it's implemented.

 - Misc CXL patches of fixes, cleanups, and updates. Including CXL
   address translation for unaligned MOD3 regions.

[ TLA service: CXL is "Compute Express Link" ]

* tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits)
  cxl: Disable HPA/SPA translation handlers for Normalized Addressing
  cxl/region: Factor out code into cxl_region_setup_poison()
  cxl/atl: Lock decoders that need address translation
  cxl: Enable AMD Zen5 address translation using ACPI PRMT
  cxl/acpi: Prepare use of EFI runtime services
  cxl: Introduce callback for HPA address ranges translation
  cxl/region: Use region data to get the root decoder
  cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
  cxl/region: Separate region parameter setup and region construction
  cxl: Simplify cxl_root_ops allocation and handling
  cxl/region: Store HPA range in struct cxl_region
  cxl/region: Store root decoder in struct cxl_region
  cxl/region: Rename misleading variable name @hpa to @hpa_range
  Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
  cxl, doc: Moving conventions in separate files
  cxl, doc: Remove isonum.txt inclusion
  cxl/port: Unify endpoint and switch port lookup
  cxl/port: Move endpoint component register management to cxl_port
  cxl/port: Map Port RAS registers
  cxl/port: Move dport RAS setup to dport add time
  ...
2026-02-12 16:33:05 -08:00
Dan Williams
bc62f5b308 dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
instead of the iomem_resource tree at boot. Delay publishing these ranges
into the iomem hierarchy until ownership is resolved and the HMEM path
is ready to consume them.

Publishing Soft Reserved ranges into iomem too early conflicts with CXL
hotplug and prevents region assembly when those ranges overlap CXL
windows.

Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
window publication is complete and HMEM is ready to claim the memory. This
provides a cleaner handoff between EFI-defined memory ranges and CXL
resource management without trimming or deleting resources later.

In the meantime "Soft Reserved" resources will no longer appear in
/proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff"

Before:
100000000-1ffffffff : Soft Reserved
  100000000-1ffffffff : dax1.0
    100000000-1ffffffff : System RAM (kmem)

After:
100000000-1ffffffff : dax1.0
  100000000-1ffffffff : System RAM (kmem)

The expectation is that this does not lead to a user visible regression
because the dax1.0 device is created in both instances.

Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
[Smita: incorporate feedback from x86 maintainer review]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com
[djbw: cleanups and clarifications]
Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-16 09:02:36 -07:00
Ingo Molnar
6c08d768a5 x86/boot/e820: Use <linux/sizes.h> symbols for literals
Use the human-readable SZ_* constants.

Suggested-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/92a15c2d-055c-4f4e-b232-32030a8e5e54@suse.com
2025-12-14 09:24:18 +01:00
Ingo Molnar
0d9daff414 x86/boot/e820: Make sure e820_search_gap() finds all gaps
The current implementation of e820_search_gap() searches gaps
in a reverse search from MAX_GAP_END back to 0, contrary to
what its main comment claims:

    * Search for a gap in the E820 memory space from 0 to MAX_GAP_END (4GB).

But gaps can not only be beyond E820 RAM ranges, they can be below
them as well. For example this function will not find the proper
PCI gap for simplified memory map layouts that have a single RAM
range that crosses the 4GB boundary.

Rework the function to have a proper forward search of
E820 table entries.

This makes the code somewhat bigger:

   text       data        bss        dec        hex    filename
   7613      44072          0      51685       c9e5    e820.o.before
   7645      44072          0      51717       ca05    e820.o.after

but it now both implements what it claims to do, and is more
straightforward to read.

( This also allows 'idx' to be the regular u32 again, not an 'int'
  underflowing to -1. )

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-29-mingo@kernel.org
2025-12-14 09:19:42 +01:00
Ingo Molnar
4ad03f133c x86/boot/e820: Simplify the e820__range_remove() API
Right now e820__range_remove() has two parameters to control the
E820 type of the range removed:

	extern void e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool check_type);

Since E820 types start at 1, zero has a natural meaning of 'no type.

Consolidate the (old_type,check_type) parameters into a single (filter_type)
parameter:

	extern void e820__range_remove(u64 start, u64 size, enum e820_type filter_type);

Note that both e820__mapped_raw_any() and e820__mapped_any()
already have such semantics for their 'type' parameter, although
it's currently not used with '0' by in-kernel code.

Also, the __e820__mapped_all() internal helper already has such
semantics implemented as well, and the e820__get_entry_type() API
uses the '0' type to such effect.

This simplifies not just e820__range_remove(), and synchronizes its
use of type filters with other E820 API functions, but simplifies
usage sites as well, such as parse_memmap_one(), beyond the reduction
of the number of parameters:

  -               else if (from)
  -                       e820__range_remove(start_at, mem_size, from, 1);
                  else
  -                       e820__range_remove(start_at, mem_size, 0, 0);
  +                       e820__range_remove(start_at, mem_size, from);

The generated code gets smaller as well:

	add/remove: 0/0 grow/shrink: 0/5 up/down: 0/-66 (-66)

	Function                                     old     new   delta
	parse_memopt                                 112     107      -5
	efi_init                                    1048    1039      -9
	setup_arch                                  2719    2709     -10
	e820__range_remove                           283     273     -10
	parse_memmap_opt                             559     527     -32

	Total: Before=22,675,600, After=22,675,534, chg -0.00%

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-28-mingo@kernel.org
2025-12-14 09:19:42 +01:00
Ingo Molnar
8b886d8a4d x86/boot/e820: Remove e820__range_remove()'s unused return parameter
None of the usage sites make use of the 'real_removed_size'
return parameter of e820__range_remove(), and it's hard
to contemplate much constructive use: E820 maps can have
holes, and removing a fixed range may result in removal
of any number of bytes from 0 to the requested size.

So remove this pointless calculation. This simplifies
the function a bit:

   text       data        bss        dec        hex    filename
   7645      44072          0      51717       ca05    e820.o.before
   7597      44072          0      51669       c9d5    e820.o.after

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-27-mingo@kernel.org
2025-12-14 09:19:42 +01:00
Ingo Molnar
157266edcc x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables
So append_e820_table() begins with this weird condition that checks 'nr_entries':

    static int __init append_e820_table(struct boot_e820_entry *entries, u32 nr_entries)
    {
            /* Only one memory region (or negative)? Ignore it */
            if (nr_entries < 2)
                    return -1;

Firstly, 'nr_entries' has been an u32 since 2017 and cannot be negative.

Secondly, there's nothing inherently wrong with single-entry E820 maps,
especially in virtualized environments.

So remove this restriction and remove the __append_e820_table()
indirection.

Also:

 - fix/update comments
 - remove obsolete comments

This shrinks the generated code a bit as well:

   text       data        bss        dec        hex    filename
   7549      44072          0      51621       c9a5    e820.o.before
   7533      44072          0      51605       c995    e820.o.after

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-26-mingo@kernel.org
2025-12-14 09:19:41 +01:00
Ingo Molnar
af0cf1646d x86/boot/e820: Standardize __init/__initdata tag placement
So the e820.c file has a hodgepodge of __init and __initdata tag
placements:

    static int __init e820_search_gap(unsigned long *max_gap_start, unsigned long *max_gap_size)
    __init void e820__setup_pci_gap(void)
    __init void e820__reallocate_tables(void)
    void __init e820__memory_setup_extended(u64 phys_addr, u32 data_len)
    void __init e820__register_nosave_regions(unsigned long limit_pfn)
    static int __init e820__register_nvs_regions(void)
    u64 __init e820__memblock_alloc_reserved(u64 size, u64 align)

Standardize on the style used by e820__setup_pci_gap() and place
them before the storage class.

In addition to the consistency, as a bonus this makes the grep output
rather clean looking:

    __init void e820__range_remove(u64 start, u64 size, enum e820_type filter_type)
    __init void e820__update_table_print(void)
    __init static void e820__update_table_kexec(void)
    __init static int e820_search_gap(unsigned long *max_gap_start, unsigned long *max_gap_size)
    __init void e820__setup_pci_gap(void)
    __init void e820__reallocate_tables(void)
    __init void e820__memory_setup_extended(u64 phys_addr, u32 data_len)
    __init void e820__register_nosave_regions(unsigned long limit_pfn)
    __init static int e820__register_nvs_regions(void)

... and if one learns to just ignore the leftmost '__init' noise then
the rest of the line looks just like a regular C function definition.

With the 'mixed' tag placement style the __init tag breaks up the function's
prototype for no good reason.

Do the same for __initdata.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-25-mingo@kernel.org
2025-12-14 09:19:41 +01:00
Ingo Molnar
7df2f811b2 x86/boot/e820: Simplify & clarify __e820__range_add() a bit
Use 'entry_new' to make clear we are allocating a new entry.

Change the table-full message to say that the table is full.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-24-mingo@kernel.org
2025-12-14 09:19:41 +01:00
Ingo Molnar
95060e411f x86/boot/e820: Rename gap_start/gap_size to max_gap_start/max_gap_start in e820_search_gap() et al
The PCI gap searching functions pass around pointers to the
gap_start/gap_size variables, which refer to the maximum
size gap found so far.

Rename the variables to say so, and disambiguate their namespace
from 'current gap' variables.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-23-mingo@kernel.org
2025-12-14 09:19:41 +01:00
Ingo Molnar
f40f3f32b3 x86/boot/e820: Change e820_search_gap() to search for the highest-address PCI gap
Right now the main x86 function that determines the position and size
of the 'PCI gap', e820_search_gap(), has this curious property:

        while (--idx >= 0) {
	...
			if (gap >= *gap_size) {

I.e. it will iterate the E820 table backwards, from its end to the beginning,
and will search for larger and larger gaps in the memory map below 4GB,
until it finishes with the table.

This logic will, should there be two gaps with the same size, pick the
one with the lower physical address - which is contrary to usual
practice that the PCI gap is just below 4GB.

Furthermore, the commit that introduced this weird logic 16 years ago:

  3381959da5 ("x86: cleanup e820_setup_gap(), add e820_search_gap(), v2")

  -                       if (gap > gapsize) {
  +                       if (gap >= *gapsize) {

didn't even declare this change, the title says it's a cleanup,
and the changelog declares it as a preparatory refactoring
for a later bugfix:

  809d9a8f93 ("x86/PCI: ACPI based PCI gap calculation")

which bugfix was reverted only 1 day later without much of an
explanation, and was never reintroduced:

  58b6e55384 ("Revert "x86/PCI: ACPI based PCI gap calculation"")

So based on the Git archeology and by the plain reading of the
code I declare this '>=' change an unintended bug and side effect.

Change it to '>' again.

It should not make much of a difference in practice, as the
likelihood of having *two* largest gaps with exactly the same
size are very low outside of weird user-provided memory maps.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-22-mingo@kernel.org
2025-12-14 09:19:40 +01:00
Ingo Molnar
cff02bff04 x86/boot/e820: Clean up e820__setup_pci_gap()/e820_search_gap() a bit
Apply misc cleanups:

 - Use a bit more readable variable names, we haven't run out of
   underscore characters in the kernel yet.

 - s/0x400000/SZ_4M

 - s/1024*1024/SZ_1M

Suggested-by: Andy Shevchenko <andy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-21-mingo@kernel.org
2025-12-14 09:19:40 +01:00
Ingo Molnar
58dcd82d2e x86/boot/e820: Standardize e820 table index variable types under 'u32'
So we have 'idx' types of 'int' and 'unsigned int', and sometimes
we assign 'u32' fields such as e820_table::nr_entries to these 'int'
values.

While there's no real risk of overflow with these tables, make it
all cleaner by standardizing on a single type: u32.

This also happens to shrink the code a bit:

   text      data      bss        dec        hex    filename
   7745     44072        0      51817       ca69    e820.o.before
   7613     44072        0      51685       c9e5    e820.o.after

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-19-mingo@kernel.org
2025-12-14 09:19:40 +01:00
Ingo Molnar
dc043d6463 x86/boot/e820: Standardize e820 table index variable names under 'idx'
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-18-mingo@kernel.org
2025-12-14 09:19:40 +01:00
Ingo Molnar
a515ca9664 x86/boot/e820: Remove unnecessary header inclusions
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-17-mingo@kernel.org
2025-12-14 09:19:39 +01:00
Ingo Molnar
2774ae1046 x86/boot/e820: Clean up __refdata use a bit
So __refdata, like __init, is more of a storage class specifier,
so move the attribute in front of the type, not after the variable
name. This also aligns it vertically.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-16-mingo@kernel.org
2025-12-14 09:19:39 +01:00
Ingo Molnar
a4803df3a2 x86/boot/e820: Clean up __e820__range_add() a bit
- Use 'idx' index variable instead of a weird 'x'
 - Make the error message E820-specific
 - Group the code a bit better

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-15-mingo@kernel.org
2025-12-14 09:19:39 +01:00
Ingo Molnar
4a7a13e04c x86/boot/e820: Improve e820_print_type() messages
For E820_TYPE_RESERVED, print:

  'reserved'  -> 'device reserved'

For E820_TYPE_PRAM and E820_TYPE_PMEM:

  'persistent' -> 'persistent RAM'

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-14-mingo@kernel.org
2025-12-14 09:19:39 +01:00
Ingo Molnar
44f732f3ec x86/boot/e820: Clean up confusing and self-contradictory verbiage around E820 related resource allocations
So the E820 code has a rather confusing area of code at around
e820__reserve_resources(), which is, by its plain reading,
rather self-contradictory. For example, the comment explaining
e820__reserve_resources() claims:

 - '* Mark E820 reserved areas as busy for the resource manager'

By 'E820 reserved areas' one can naively conclude that it's
talking about E820_TYPE_RESERVED areas - while those areas
are treated in exactly the opposite fashion by do_mark_busy():

        switch (type) {
        case E820_TYPE_RESERVED:
        case E820_TYPE_SOFT_RESERVED:
        case E820_TYPE_PRAM:
        case E820_TYPE_PMEM:
                return false;

Ie. E820_TYPE_RESERVED areas are *not* marked busy for the
resource manager, because E820_TYPE_RESERVED areas are
device regions that might eventually be claimed by a device driver.

This type of confusion permeates this whole area of code,
making it exceedingly difficult to read (for me at least).

So untangle it bit by bit:

 - Instead of talking about ambiguous 'reserved areas',
   talk about 'E820 device address regions' instead,
   and 'register'/'lock' them.

 - The do_mark_busy() function is a misnomer as well, because despite
   its name it 'does' nothing - it only determines what type
   of resource handling an E820 type should receive from the
   kernel. Rename it to e820_device_region() and negate its
   meaning, to avoid the 'busy/reserved' confusion. Because
   that's what this code is really about: filtering out
   device regions such as E820_TYPE_RESERVED, E820_TYPE_PRAM,
   E820_TYPE_PMEM, etc., and allowing them to be claimed
   by device drivers later on.

 - All other E820 regions (system regions) are registered and
   locked early on, before the PCI resource manager does its
   search for device BAR addresses, etc.

Also fix this somewhat misleading comment:

	/*
	 * Try to bump up RAM regions to reasonable boundaries, to
	 * avoid stolen RAM:
	 */

and explain that here we register artificial 'gap' resources
at the end of suspiciously sized RAM regions, as heuristics
to try to avoid buggy firmware with undeclared 'stolen RAM' regions:

	/*
	 * Create additional 'gaps' at the end of RAM regions,
	 * rounding them up to 64k/1MB/64MB boundaries, should
	 * they be weirdly sized, and register extra, locked
	 * resource regions for them, to make sure drivers
	 * won't claim those addresses.
	 *
	 * These are basically blind guesses and heuristics to
	 * avoid resource conflicts with broken firmware that
	 * doesn't properly list 'stolen RAM' as a system region
	 * in the E820 map.
	 */

Also improve the printout of this extra resource a bit: make the
message more unambiguous, and upgrade it from pr_debug() (where
very few people will see it), to pr_info() (where it will make
it into the syslog on default distro configs).

Also fix spelling and improve comment placement.

No change in functionality intended.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-13-mingo@kernel.org
2025-12-14 09:19:38 +01:00
Ingo Molnar
d214484f50 x86/boot/e820: Remove pointless early_panic() indirection
early_panic() is a pointless wrapper around panic():

	static void __init early_panic(char *msg)
	{
		early_printk(msg);
		panic(msg);
	}

panic() will already do a printk() of 'msg', and an early_printk() if
earlyprintk is enabled. There's no need to print it separately.

Remove the function.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-12-mingo@kernel.org
2025-12-14 09:19:38 +01:00
Ingo Molnar
eea78dc546 x86/boot/e820: Use 'u64' consistently instead of 'unsigned long long'
There's a number of structure fields and local variables related
to E820 entry physical addresses that are defined as 'unsigned long long',
but then are compared to u64 fields.

Make the types all consistently u64.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-11-mingo@kernel.org
2025-12-14 09:19:38 +01:00
Ingo Molnar
1d7bc219e2 x86/boot/e820: Call the PCI gap a 'gap' in the boot log printout
It is a bit weird and inconsistent that the PCI gap is
advertised during bootup as 'mem'ory:

  [mem 0xc0000000-0xfed1bfff] available for PCI devices
   ^^^

It's not really memory, it's a gap that PCI devices can decode
and use and they often do not map it to any memory themselves.

So advertise it for what it is, a gap:

  [gap 0xc0000000-0xfed1bfff] available for PCI devices

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-10-mingo@kernel.org
2025-12-14 09:19:38 +01:00
Ingo Molnar
fa06d58805 x86/boot/e820: Print E820_TYPE_RAM entries as ... RAM entries
So it is a bit weird that the actual RAM entries of the E820 table
are not actually called RAM, but 'usable':

	BIOS-e820: [mem 0x0000000000100000-0x000000007ffdbfff]    1.9 GB usable

'usable' is pretty passive-aggressive in that context and ambiguous,
most E820 entries denote 'usable' address ranges - reserved ranges
may be used by devices, or the platform.

Clarify and disambiguate this by making the boot log entry
explicitly say 'System RAM', like in /proc/iomem:

	BIOS-e820: [mem 0x0000000000100000-0x000000007ffdbfff]    1.9 GB System RAM

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://patch.msgid.link/20250515120549.2820541-9-mingo@kernel.org
2025-12-14 09:19:38 +01:00
Ingo Molnar
c87f944777 x86/boot/e820: Make the field separator space character part of e820_print_type()
We are going to add more columns to the E820 table printout,
so make e820_print_type()'s field separator (space character)
part of the function itself.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-7-mingo@kernel.org
2025-12-14 09:19:37 +01:00
Ingo Molnar
4d8e5a682b x86/boot/e820: Print gaps in the E820 table
Gaps in the E820 table are not obvious at a glance and can
easily be overlooked.

Print out gaps in the E820 table:

Before:

	BIOS-provided physical RAM map:
	BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
	BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
	BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
	BIOS-e820: [mem 0x0000000000100000-0x000000007ffdbfff] usable
	BIOS-e820: [mem 0x000000007ffdc000-0x000000007fffffff] reserved
	BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
	BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
	BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
	BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
	BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
After:

	BIOS-provided physical RAM map:
	BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
	BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
	BIOS-e820: [gap 0x00000000000a0000-0x00000000000effff]
	BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
	BIOS-e820: [mem 0x0000000000100000-0x000000007ffdbfff] usable
	BIOS-e820: [mem 0x000000007ffdc000-0x000000007fffffff] reserved
	BIOS-e820: [gap 0x0000000080000000-0x00000000afffffff]
	BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
	BIOS-e820: [gap 0x00000000c0000000-0x00000000fed1bfff]
	BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
	BIOS-e820: [gap 0x00000000fed20000-0x00000000feffbfff]
	BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
	BIOS-e820: [gap 0x00000000ff000000-0x00000000fffbffff]
	BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
	BIOS-e820: [gap 0x0000000100000000-0x000000fcffffffff]
	BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved

Also warn about badly ordered E820 table entries:

	BUG: out of order E820 entry!

( this is printed before the entry is printed, so there's no need to
  print any additional data with the warning. )

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-6-mingo@kernel.org
2025-12-14 09:19:37 +01:00
Ingo Molnar
3e57abd455 x86/boot/e820: Mark e820__print_table() static
There are no external users of this function left.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-5-mingo@kernel.org
2025-12-14 09:19:37 +01:00
Ingo Molnar
0bb4a8bdbd x86/boot/e820: Simplify e820__print_table() a bit
Introduce 'entry' for the current table entry and shorten
repetitious use of e820_table->entries[i].

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://patch.msgid.link/20250515120549.2820541-3-mingo@kernel.org
2025-12-14 09:19:36 +01:00
Ingo Molnar
db0d69c570 x86/boot/e820: Remove inverted boolean logic from the e820_nomerge() function name, rename it to e820_type_mergeable()
It's a bad practice to put inverted logic into function names,
flip it back and rename it to e820_type_mergeable().

Add/update a few comments about this function while at it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://patch.msgid.link/20250515120549.2820541-2-mingo@kernel.org
2025-12-14 09:19:36 +01:00
Sean Christopherson
6276c67f2b x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
Extend KVM's export macro framework to provide EXPORT_SYMBOL_FOR_KVM(),
and use the helper macro to export symbols for KVM throughout x86 if and
only if KVM will build one or more modules, and only for those modules.

To avoid unnecessary exports when CONFIG_KVM=m but kvm.ko will not be
built (because no vendor modules are selected), let arch code #define
EXPORT_SYMBOL_FOR_KVM to suppress/override the exports.

Note, the set of symbols to restrict to KVM was generated by manual search
and audit; any "misses" are due to human error, not some grand plan.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-5-seanjc%40google.com
2025-11-12 15:29:38 -08:00
Alexander Graf
a2daf83e10 x86/e820: temporarily enable KHO scratch for memory below 1M
KHO kernels are special and use only scratch memory for memblock
allocations, but memory below 1M is ignored by kernel after early boot and
cannot be naturally marked as scratch.

To allow allocation of the real-mode trampoline and a few (if any) other
very early allocations from below 1M forcibly mark the memory below 1M as
scratch.

After real mode trampoline is allocated, clear that scratch marking.

Link: https://lkml.kernel.org/r/20250509074635.3187114-13-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:41 -07:00
Mike Rapoport (Microsoft)
83b2d345e1 x86/e820: Discard high memory that can't be addressed by 32-bit systems
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org # discussion and submission
2025-04-19 16:48:18 +02:00
Myrrh Periwinkle
f2f29da9f0 x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions()
While debugging kexec/hibernation hangs and crashes, it turned out that
the current implementation of e820__register_nosave_regions() suffers from
multiple serious issues:

 - The end of last region is tracked by PFN, causing it to find holes
   that aren't there if two consecutive subpage regions are present

 - The nosave PFN ranges derived from holes are rounded out (instead of
   rounded in) which makes it inconsistent with how explicitly reserved
   regions are handled

Fix this by:

 - Treating reserved regions as if they were holes, to ensure consistent
   handling (rounding out nosave PFN ranges is more correct as the
   kernel does not use partial pages)

 - Tracking the end of the last RAM region by address instead of pages
   to detect holes more precisely

These bugs appear to have been introduced about ~18 years ago with the very
first version of e820_mark_nosave_regions(), and its flawed assumptions were
carried forward uninterrupted through various waves of rewrites and renames.

[ mingo: Added Git archeology details, for kicks and giggles. ]

Fixes: e8eff5ac29 ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64")
Reported-by: Roberto Ricci <io@r-ricci.it>
Tested-by: Roberto Ricci <io@r-ricci.it>
Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250406-fix-e820-nosave-v3-1-f3787bc1ee1d@qtmlabs.xyz
Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/
2025-04-07 19:20:08 +02:00
Dave Young
a2498e5c45 x86/kexec: Export e820_table_kexec[] to sysfs
Previously the e820_table_kexec[] was exported to sysfs since kexec-tools uses
the memmap entries to prepare the e820 table for the new kernel.

The following commit, ~8 years ago, introduced e820_table_firmware[] and changed
the behavior to export the firmware table instead:

  12df216c61 ("x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table")

Originally the kexec_file_load and kexec_load syscalls both used e820_table_kexec[].
Since the sysfs exported entries are from e820_table_firmware[] people
now need to tune both tables for kexec.

Restore the old behavior so the kexec_load and kexec_file_load syscalls work with
only one table update.  The e820_table_firmware[] is used by hibernation kernel
code and it works without the sysfs exporting. Also remove the SEV
e820_table_firmware[] updating code.

Also update the code comments here and drop the comments about setup_data
reservation since it is not needed any more after this change was made
a year ago:

  fc7f27cda8 ("x86/kexec: Do not update E820 kexec table for setup_data")

[ mingo: Tidy up the changelog and comments. ]

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/Z5jcb1GKhLvH8kDc@darkstar.users.ipa.redhat.com
2025-02-22 12:44:45 +01:00
Mike Rapoport (Microsoft)
efe659ac01 x86/e820: Drop obsolete E820_TYPE_RESERVED_KERN and related code
E820_TYPE_RESERVED_KERN is a relict from the ancient history that was used
to early reserve setup_data, see:

  28bb223795 ("x86: move reserve_setup_data to setup.c")

Nowadays setup_data is anyway reserved in memblock and there is no point in
carrying E820_TYPE_RESERVED_KERN that behaves exactly like E820_TYPE_RAM
but only complicates the code.

A bonus for removing E820_TYPE_RESERVED_KERN is a small but measurable
speedup of 20 microseconds in init_mem_mappings() on a VM with 32GB or RAM.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-5-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Mike Rapoport (Microsoft)
2d6bff3139 x86/boot: Move setting of memblock parameters to e820__memblock_setup()
Changing memblock parameters, namely bottom_up and allocation upper
limit does not have any effect before memblock initialization in
e820__memblock_setup().

Move the calls to memblock_set_bottom_up() and memblock_set_current_limit()
to e820__memblock_setup() to group all the memblock initial setup and make
setup_arch() more readable.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-2-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Guo Weikang
c6f239796b mm/memblock: add memblock_alloc_or_panic interface
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:38 -08:00
Kirill A. Shutemov
06fa48d85b x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things,
guides where direct mapping ends. Any memory above max_pfn is not going to be
present in the direct mapping.

e820__end_of_ram_pfn() finds the end of the RAM based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.

Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables
and might be required by kernel to function properly.

Usually the problem is hidden because there is some E820_TYPE_RAM memory above
E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as
E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the
last E820_TYPE_ACPI range.

Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.

The problem was discovered during debugging kexec for TDX guest. TDX guest uses
E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the
kernels on kexec.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-13-kirill.shutemov@linux.intel.com
2024-06-17 17:46:08 +02:00
Ashish Kalra
d6d85ac15c x86/e820: Add a new e820 table update helper
Add a new API helper e820__range_update_table() with which to update an
arbitrary e820 table. Move all current users of
e820__range_update_kexec() to this new helper.

  [ bp: Massage. ]

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b726af213ad55053f8a7a1e793b01bb3f1ca9dd5.1714090302.git.ashish.kalra@amd.com
2024-04-29 11:15:31 +02:00
Dave Young
fc7f27cda8 x86/kexec: Do not update E820 kexec table for setup_data
crashkernel reservation failed on a Thinkpad t440s laptop recently.
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test steps:
  kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
    kexec reboot ->
        dmesg|grep "crashkernel reserved";
            crashkernel memory range like below reserved successfully:
              0x00000000d0000000 - 0x00000000da000000
        But no such "Crash kernel" region in /proc/iomem

The background story:

Currently the E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.

Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.

Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().

Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Eric DeVolder <eric.devolder@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
2024-03-22 10:07:45 +01:00
Jiri Bohac
7fd817c906 x86/e820: Don't reserve SETUP_RNG_SEED in e820
SETUP_RNG_SEED in setup_data is supplied by kexec and should
not be reserved in the e820 map.

Doing so reserves 16 bytes of RAM when booting with kexec.
(16 bytes because data->len is zeroed by parse_setup_data so only
sizeof(setup_data) is reserved.)

When kexec is used repeatedly, each boot adds two entries in the
kexec-provided e820 map as the 16-byte range splits a larger
range of usable memory. Eventually all of the 128 available entries
get used up. The next split will result in losing usable memory
as the new entries cannot be added to the e820 map.

Fixes: 68b8e9713c ("x86/setup: Use rng seeds from setup_data")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZbmOjKnARGiaYBd5@dwarf.suse.cz
2024-03-01 10:27:20 -08:00
Yuntao Wang
50c66d7b04 x86/setup: Move duplicate boot_cpu_data definition out of the ifdeffery
Both the if and else blocks define an exact same boot_cpu_data variable, move
the duplicate variable definition out of the if/else block.

In addition, do some other minor cleanups.

  [ bp: Massage. ]

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20220601122914.820890-1-ytcoode@gmail.com
2023-01-11 12:45:16 +01:00
Wang Yong
b7d1f15b5c x86/boot/e820: Fix typo in e820.c comment
change "itsmain" to "its main".

Fixes: 544a0f47e7 ("x86/boot/e820: Rename e820_table_saved to e820_table_firmware and improve the description")
Signed-off-by: Wang Yong <yongw.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221211103849.173870-1-yongw.kernel@gmail.com
2023-01-11 12:45:03 +01:00
Jonathan McDowell
b69a2afd5a x86/kexec: Carry forward IMA measurement log on kexec
On kexec file load, the Integrity Measurement Architecture (IMA)
subsystem may verify the IMA signature of the kernel and initramfs, and
measure it. The command line parameters passed to the kernel in the
kexec call may also be measured by IMA.

A remote attestation service can verify a TPM quote based on the TPM
event log, the IMA measurement list and the TPM PCR data. This can
be achieved only if the IMA measurement log is carried over from the
current kernel to the next kernel across the kexec call.

PowerPC and ARM64 both achieve this using device tree with a
"linux,ima-kexec-buffer" node. x86 platforms generally don't make use of
device tree, so use the setup_data mechanism to pass the IMA buffer to
the new kernel.

Signed-off-by: Jonathan McDowell <noodles@fb.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> # IMA function definitions
Link: https://lore.kernel.org/r/YmKyvlF3my1yWTvK@noodles-fedora-PC23Y6EG
2022-07-01 15:22:16 +02:00
Ross Philipson
7228918b34 x86/boot: Fix memremap of setup_indirect structures
As documented, the setup_indirect structure is nested inside
the setup_data structures in the setup_data list. The code currently
accesses the fields inside the setup_indirect structure but only
the sizeof(struct setup_data) is being memremapped. No crash
occurred but this is just due to how the area is remapped under the
covers.

Properly memremap both the setup_data and setup_indirect structures
in these cases before accessing them.

Fixes: b3c72fc9a7 ("x86/boot: Introduce setup_indirect")
Signed-off-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com
2022-03-09 12:49:44 +01:00
Linus Torvalds
5469f160e6 Power management updates for 5.13-rc1
- Add idle states table for IceLake-D to the intel_idle driver and
    update IceLake-X C6 data in it (Artem Bityutskiy).
 
  - Fix the C7 idle state on Tegra114 in the tegra cpuidle driver and
    drop the unused do_idle() firmware call from it (Dmitry Osipenko).
 
  - Fix cpuidle-qcom-spm Kconfig entry (He Ying).
 
  - Fix handling of possible negative tick_nohz_get_next_hrtimer()
    return values of in cpuidle governors (Rafael Wysocki).
 
  - Add support for frequency-invariance to the ACPI CPPC cpufreq
    driver and update the frequency-invariance engine (FIE) to use it
    as needed (Viresh Kumar).
 
  - Simplify the default delay_us setting in the ACPI CPPC cpufreq
    driver (Tom Saeger).
 
  - Clean up frequency-related computations in the intel_pstate
    cpufreq driver (Rafael Wysocki).
 
  - Fix TBG parent setting for load levels in the armada-37xx
    cpufreq driver and drop the CPU PM clock .set_parent method for
    armada-37xx (Marek Behún).
 
  - Fix multiple issues in the armada-37xx cpufreq driver (Pali Rohár).
 
  - Fix handling of dev_pm_opp_of_cpumask_add_table() return values
    in cpufreq-dt to take the -EPROBE_DEFER one into acconut as
    appropriate (Quanyang Wang).
 
  - Fix format string in ia64-acpi-cpufreq (Sergei Trofimovich).
 
  - Drop the unused for_each_policy() macro from cpufreq (Shaokun
    Zhang).
 
  - Simplify computations in the schedutil cpufreq governor to avoid
    unnecessary overhead (Yue Hu).
 
  - Fix typos in the s5pv210 cpufreq driver (Bhaskar Chowdhury).
 
  - Fix cpufreq documentation links in Kconfig (Alexander Monakov).
 
  - Fix PCI device power state handling in pci_enable_device_flags()
    to avoid issuse in some cases when the device depends on an ACPI
    power resource (Rafael Wysocki).
 
  - Add missing documentation of pm_runtime_resume_and_get() (Alan
    Stern).
 
  - Add missing static inline stub for pm_runtime_has_no_callbacks()
    to pm_runtime.h and drop the unused try_to_freeze_nowarn()
    definition (YueHaibing).
 
  - Drop duplicate struct device declaration from pm.h and fix a
    structure type declaration in intel_rapl.h (Wan Jiabing).
 
  - Use dev_set_name() instead of an open-coded equivalent of it in
    the wakeup sources code and drop a redundant local variable
    initialization from it (Andy Shevchenko, Colin Ian King).
 
  - Use crc32 instead of md5 for e820 memory map integrity check
    during resume from hibernation on x86 (Chris von Recklinghausen).
 
  - Fix typos in comments in the system-wide and hibernation support
    code (Lu Jialin).
 
  - Modify the generic power domains (genpd) code to avoid resuming
    devices in the "prepare" phase of system-wide suspend and
    hibernation (Ulf Hansson).
 
  - Add Hygon Fam18h RAPL support to the intel_rapl power capping
    driver (Pu Wen).
 
  - Add MAINTAINERS entry for the dynamic thermal power management
    (DTPM) code (Daniel Lezcano).
 
  - Add devm variants of operating performance points (OPP) API
    functions and switch over some users of the OPP framework to
    the new resource-managed API (Yangtao Li and Dmitry Osipenko).
 
  - Update devfreq core:
 
    * Register devfreq devices as cooling devices on demand (Daniel
      Lezcano).
 
    * Add missing unlock opeation in devfreq_add_device() (Lukasz
      Luba).
 
    * Use the next frequency as resume_freq instead of the previous
      frequency when using the opp-suspend property (Dong Aisheng).
 
    * Check get_dev_status in devfreq_update_stats() (Dong Aisheng).
 
    * Fix set_freq path for the userspace governor in Kconfig (Dong
      Aisheng).
 
    * Remove invalid description of get_target_freq() (Dong Aisheng).
 
  - Update devfreq drivers:
 
    * imx8m-ddrc: Remove imx8m_ddrc_get_dev_status() and unneeded
      of_match_ptr() (Dong Aisheng, Fabio Estevam).
 
    * rk3399_dmc: dt-bindings: Add rockchip,pmu phandle and drop
      references to undefined symbols (Enric Balletbo i Serra, Gaël
      PORTAY).
 
    * rk3399_dmc: Use dev_err_probe() to simplify the code (Krzysztof
      Kozlowski).
 
    * imx-bus: Remove unneeded of_match_ptr() (Fabio Estevam).
 
  - Fix kernel-doc warnings in three places (Pierre-Louis Bossart).
 
  - Fix typo in the pm-graph utility code (Ricardo Ribalda).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmCHAUISHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxAxMP/0tFjgxeaJ3chYaiqoPlk2QX/XdwqJvm
 8OOu2qBMWbt2bubcIlAPpdlCNaERI4itF7E8za7t9alswdq7YPWGmNR9snCXUKhD
 BzERuicZTeOcCk2P3DTgzLVc4EzF6wutA3lTdYYZIpf+LuuB+guG8zgMzScRHIsM
 N3I83O+iLTA9ifIqN0/wH//a0ISvo6rSWtcbx+48d5bYvYixW7CsBmoxWHhGiYsw
 4PJ4AzbdNJEhQp91SBYPIAmqwV88FZUPoYnRazXMxOSevMewhP9JuCHXAujC3gLV
 l5d2TBaBV4EBYLD5tfCpJvHMXhv/yBpg6KRivjri+zEnY1TAqIlfR4vYiL7puVvQ
 PdwjyvNFDNGyUSX/AAwYF6F4WCtIhw8hCahw6Dw2zcDz0plFdRZmWAiTdmIjECJK
 8EvwJNlmdl27G1y+EBc6NnwzEFZQwiu9F5bUHUkmc3fF1M1aFTza8WDNDo30TC94
 94c+uVBRL2fBePl4FfGZATfJbOMb8+vDIkroQxrIjQDT/7Ha3Mz75JZDRHItZo92
 +4fES2eFdfZISCLIQMBY5TeXox3O8qsirC1k4qELwy71gPUE9CpP3FkxKIvyZLlv
 +6fS9ttpUfyFBF7gqrEy+ziVk1Rm4oorLmWCtthz4xyerzj5+vtZQqKzcySk0GA5
 hYkseZkedR6y
 =t+SG
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add some new hardware support (for example, IceLake-D idle
  states in intel_idle), fix some issues (for example, the handling of
  negative "sleep length" values in cpuidle governors), add new
  functionality to the existing drivers (for example, scale-invariance
  support in the ACPI CPPC cpufreq driver) and clean up code all over.

  Specifics:

   - Add idle states table for IceLake-D to the intel_idle driver and
     update IceLake-X C6 data in it (Artem Bityutskiy).

   - Fix the C7 idle state on Tegra114 in the tegra cpuidle driver and
     drop the unused do_idle() firmware call from it (Dmitry Osipenko).

   - Fix cpuidle-qcom-spm Kconfig entry (He Ying).

   - Fix handling of possible negative tick_nohz_get_next_hrtimer()
     return values of in cpuidle governors (Rafael Wysocki).

   - Add support for frequency-invariance to the ACPI CPPC cpufreq
     driver and update the frequency-invariance engine (FIE) to use it
     as needed (Viresh Kumar).

   - Simplify the default delay_us setting in the ACPI CPPC cpufreq
     driver (Tom Saeger).

   - Clean up frequency-related computations in the intel_pstate cpufreq
     driver (Rafael Wysocki).

   - Fix TBG parent setting for load levels in the armada-37xx cpufreq
     driver and drop the CPU PM clock .set_parent method for armada-37xx
     (Marek Behún).

   - Fix multiple issues in the armada-37xx cpufreq driver (Pali Rohár).

   - Fix handling of dev_pm_opp_of_cpumask_add_table() return values in
     cpufreq-dt to take the -EPROBE_DEFER one into acconut as
     appropriate (Quanyang Wang).

   - Fix format string in ia64-acpi-cpufreq (Sergei Trofimovich).

   - Drop the unused for_each_policy() macro from cpufreq (Shaokun
     Zhang).

   - Simplify computations in the schedutil cpufreq governor to avoid
     unnecessary overhead (Yue Hu).

   - Fix typos in the s5pv210 cpufreq driver (Bhaskar Chowdhury).

   - Fix cpufreq documentation links in Kconfig (Alexander Monakov).

   - Fix PCI device power state handling in pci_enable_device_flags() to
     avoid issuse in some cases when the device depends on an ACPI power
     resource (Rafael Wysocki).

   - Add missing documentation of pm_runtime_resume_and_get() (Alan
     Stern).

   - Add missing static inline stub for pm_runtime_has_no_callbacks() to
     pm_runtime.h and drop the unused try_to_freeze_nowarn() definition
     (YueHaibing).

   - Drop duplicate struct device declaration from pm.h and fix a
     structure type declaration in intel_rapl.h (Wan Jiabing).

   - Use dev_set_name() instead of an open-coded equivalent of it in the
     wakeup sources code and drop a redundant local variable
     initialization from it (Andy Shevchenko, Colin Ian King).

   - Use crc32 instead of md5 for e820 memory map integrity check during
     resume from hibernation on x86 (Chris von Recklinghausen).

   - Fix typos in comments in the system-wide and hibernation support
     code (Lu Jialin).

   - Modify the generic power domains (genpd) code to avoid resuming
     devices in the "prepare" phase of system-wide suspend and
     hibernation (Ulf Hansson).

   - Add Hygon Fam18h RAPL support to the intel_rapl power capping
     driver (Pu Wen).

   - Add MAINTAINERS entry for the dynamic thermal power management
     (DTPM) code (Daniel Lezcano).

   - Add devm variants of operating performance points (OPP) API
     functions and switch over some users of the OPP framework to the
     new resource-managed API (Yangtao Li and Dmitry Osipenko).

   - Update devfreq core:

      * Register devfreq devices as cooling devices on demand (Daniel
        Lezcano).

      * Add missing unlock opeation in devfreq_add_device() (Lukasz
        Luba).

      * Use the next frequency as resume_freq instead of the previous
        frequency when using the opp-suspend property (Dong Aisheng).

      * Check get_dev_status in devfreq_update_stats() (Dong Aisheng).

      * Fix set_freq path for the userspace governor in Kconfig (Dong
        Aisheng).

      * Remove invalid description of get_target_freq() (Dong Aisheng).

   - Update devfreq drivers:

      * imx8m-ddrc: Remove imx8m_ddrc_get_dev_status() and unneeded
        of_match_ptr() (Dong Aisheng, Fabio Estevam).

      * rk3399_dmc: dt-bindings: Add rockchip,pmu phandle and drop
        references to undefined symbols (Enric Balletbo i Serra, Gaël
        PORTAY).

      * rk3399_dmc: Use dev_err_probe() to simplify the code (Krzysztof
        Kozlowski).

      * imx-bus: Remove unneeded of_match_ptr() (Fabio Estevam).

   - Fix kernel-doc warnings in three places (Pierre-Louis Bossart).

   - Fix typo in the pm-graph utility code (Ricardo Ribalda)"

* tag 'pm-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
  PM: wakeup: remove redundant assignment to variable retval
  PM: hibernate: x86: Use crc32 instead of md5 for hibernation e820 integrity check
  cpufreq: Kconfig: fix documentation links
  PM: wakeup: use dev_set_name() directly
  PM: runtime: Add documentation for pm_runtime_resume_and_get()
  cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits()
  cpufreq: armada-37xx: Fix module unloading
  cpufreq: armada-37xx: Remove cur_frequency variable
  cpufreq: armada-37xx: Fix determining base CPU frequency
  cpufreq: armada-37xx: Fix driver cleanup when registration failed
  clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
  clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
  cpufreq: armada-37xx: Fix the AVS value for load L1
  clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
  cpufreq: armada-37xx: Fix setting TBG parent for load levels
  cpuidle: Fix ARM_QCOM_SPM_CPUIDLE configuration
  cpuidle: tegra: Remove do_idle firmware call
  cpuidle: tegra: Fix C7 idling state on Tegra114
  PM: sleep: fix typos in comments
  cpufreq: Remove unused for_each_policy macro
  ...
2021-04-26 15:10:25 -07:00
Chris von Recklinghausen
f5d1499ae2 PM: hibernate: x86: Use crc32 instead of md5 for hibernation e820 integrity check
Hibernation fails on a system in fips mode because md5 is used for the e820
integrity check and is not available. Use crc32 instead.

The check is intended to detect whether the E820 memory map provided
by the firmware after cold boot unexpectedly differs from the one that
was in use when the hibernation image was created. In this case, the
hibernation image cannot be restored, as it may cover memory regions
that are no longer available to the OS.

A non-cryptographic checksum such as CRC-32 is sufficient to detect such
inadvertent deviations.

Fixes: 62a03defea ("PM / hibernate: Verify the consistent of e820 memory map by md5 digest")
Reviewed-by: Eric Biggers <ebiggers@google.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-21 19:03:37 +02:00
Ingo Molnar
d9f6e12fb0 x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-18 15:31:53 +01:00
Dan Williams
88e9a5b796 efi/fake_mem: arrange for a resource entry per efi_fake_mem instance
In preparation for attaching a platform device per iomem resource teach
the efi_fake_mem code to create an e820 entry per instance.  Similar to
E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: https://lkml.kernel.org/r/159643096068.4062302.11590041070221681669.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Thomas Gleixner
37d1a04b13 Rebase locking/kcsan to locking/urgent
Merge the state of the locking kcsan branch before the read/write_once()
and the atomics modifications got merged.

Squash the fallout of the rebase on top of the read/write once and atomic
fallback work into the merge. The history of the original branch is
preserved in tag locking-kcsan-2020-06-02.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11 20:02:46 +02:00