Kernel: Virtual Paging

From TechPubs Wiki

Revision as of 23:55, 9 January 2026 by Raion (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

The IRIX kernel virtual memory subsystem manages physical page allocation, deallocation, and mapping for kernel use. It features a sophisticated free page list organized by cache color buckets with separate queues for clean/stale and associated/unassociated pages, support for large pages (contiguous allocation and coalescing), NUMA-aware node-specific freelists, cache coloring and VCE avoidance, and optimizations for direct-mapped (K0/K1) vs K2 addresses. Key characteristics:

pfdat_t structures track per-page metadata (flags, use count, hash chains, etc.). phead_t bucket arrays per node and page size manage free lists. Aggressive cache line reuse and coloring to minimize conflicts. Reservation system via memory pools to prevent deadlock. Special handling for R10000 speculation bug (lowmem separation). Poisoned page support for ECC/uncorrectable errors. Kernel stack pool for fast allocation.

The implementation prioritizes low-latency kernel allocation with careful TLB and cache management.

Key Functions

Core Allocation

kvpalloc / kvpalloc_node: Allocate virtual + physical pages (K2 + physical). kvalloc: Allocate only K2 virtual space. kpalloc / kpalloc_node: Map physical pages into existing K2 space. pagealloc / pagealloc_node / pagealloc_size: Core physical page allocator (single or large pages). contig_memalloc / kmem_contig_alloc: Physically contiguous allocation.

Deallocation

kvpfree / kvpffree: Free virtual + physical (handles K0/K1/K2). kvfree: Free only K2 virtual space. pagefree / pagefree_size: Return physical page to freelist (with coalescing). kmem_contig_free: Free contiguous block.

Large Page Support

lpage_alloc_contig_physmem: Allocate large contiguous block. lpage_free_contig_physmem: Free large block. lpage_coalesce: Background merging of adjacent free base pages. lpage_split: Break large page into smaller ones.

Special Cases

page_mapin / page_mapout: Temporary mapping for copy/zero. page_copy / page_zero: COW and fault-time zeroing (with BTE on SN0). page_discard / page_error_clean: Handle ECC/poisoned pages. kstack_alloc / kstack_free: Kernel stack page pool.

Undocumented or IRIX-Specific Interfaces and Behaviors

Critical Structures (from pfdat.h, page.h, etc.)

pfd_t (page frame data): pf_flags: P_QUEUE, P_HASH, P_ANON, P_DONE, P_WAIT, P_DIRTY, P_DUMP, P_BULKDATA, P_ERROR, P_HWBAD, etc. pf_use: Reference count. pf_next/prev: Free list links. pf_hchain: Hash chain. pf_tag: Vnode or anon handle. pf_pageno: File offset.

phead_t (per-color bucket): ph_count: Number of pages. ph_list[PH_NLISTS]: CLEAN/STALE, ASSOC/NOASSOC (and POISONOUS on NUMA).

Node-specific: pg_free_t per node: freelists, phead arrays, rotors, counters.


Free List Organization

Per-node, per-page-size phead arrays. Cache color bucketed (pheadmask). Separate lists: CLEAN/STALE × ASSOC/NOASSOC (plus POISONOUS). Rotor for round-robin uncached allocation.

NUMA and Migration

Node-local freelists. Round-robin or radial search fallback. Poisoned page handling (directory clearing, discard queues).

R10000 Speculation Workaround

Low memory (<256MB) separated. Special sxbrk variants (low/high memory). Kernel reference tracking (krpf) for DMA safety.

Cache and TLB Optimizations

Direct K0/K1 preferred when possible. VCE avoidance via color validation. Stale → clean promotion with selective cache flush.

Similarities to illumos and BSD Kernel Implementations

illumos (Solaris-derived) Strong similarity:

Page freelist with hash buckets and color awareness. kmem allocator for kernel objects (zones similar to IRIX zones). Contiguous allocation via vmem. Page daemon (vhand equivalent). NUMA support evolved differently (resource pools).

Porting: illumos kmem/page subsystem closest; lacks exact phead coloring and large-page coalescing. BSD (FreeBSD) More divergent:

Simple page queues (free, cache, etc.). UMA/kmem for slabs, vm_page for physical. Contiguous via contigmalloc. No per-color buckets or large-page splitting.

Porting: BSD simpler; lacks IRIX's sophisticated coloring, NUMA freelists, and coalescing. Overall, IRIX VM is classic SVR4 with heavy MIPS/NUMA/R10000 optimizations. illumos provides nearest modern analog; BSD too simplified for direct mapping. For replication: preserve pfdat flags, phead structure, node freelists, and coalescing logic.