xv6-riscv_ch3
- This chapter covers the fundamental concepts of paging hardware, memory allocation, and process address space management, including practical code implementations like creating address spaces, physical memory allocation, and process management functions such as
sbrk
andexec
.
ch3: Page tables
- Page tables are the most popular mechanism through which the operating system provides each process with its own private address space and memory.
- Xv6 performs a few tricks: mapping the same memory (a trampoline page) in several address spaces, and guarding kernel and user stacks with an unmapped page. The rest of this chapter explains the page tables that the RISC-V hardware provides and how xv6 uses them.
3.1 Paging hardware
- As a reminder, RISC-V instructions (both user and kernel) manipulate virtual addresses. The machine’s RAM, or physical memory, is indexed with physical addresses. The RISC-V page table hardware connects these two kinds of addresses, by mapping each virtual address to a physical address.
- Xv6 runs on Sv39 RISC-V, which means that only the bottom 39 bits of a 64-bit virtual address are used; the top 25 bits are not used. In this Sv39 configuration, a RISC-V page table is logically an array of 227 (134,217,728) page table entries (PTEs). Each PTE contains a 44-bit physical page number (PPN) and some flags. The paging hardware translates a virtual address by using the top 27 bits of the 39 bits to index into the page table to find a PTE, and making a 56-bit physical address whose top 44 bits come from the PPN in the PTE and whose bottom 12 bits are copied from the original virtual address. Figure 3.1 shows this process with a logical view of the page table as a simple array of PTEs (see Figure 3.2 for a fuller story). A page table gives the operating system control over virtual-to-physical address translations at the granularity of aligned chunks of 4096 (212 ) bytes. Such a chunk is called a page.
- In Sv39 RISC-V, the top 25 bits of a virtual address are not used for translation. The physical
address also has room for growth: there is room in the PTE format for the physical page number
to grow by another 10 bits. The designers of RISC-V chose these numbers based on technology
predictions. 239 bytes is 512 GB, which should be enough address space for applications running on RISC-V computers. 256 is enough physical memory space for the near future to fit many I/O
devices and RAM chips. If more is needed, the RISC-V designers have defined Sv48 with 48-bit
virtual addresses [3]. - As Figure 3.2 shows, a RISC-V CPU translates a virtual address into a physical in three steps.
A page table is stored in physical memory as a three-level tree. The root of the tree is a 4096-byte
page-table page that contains 512 PTEs, which contain the physical addresses for page-table pages
in the next level of the tree. Each of those pages contains 512 PTEs for the final level in the tree.
The paging hardware uses the top 9 bits of the 27 bits to select a PTE in the root page-table page,
the middle 9 bits to select a PTE in a page-table page in the next level of the tree, and the bottom
9 bits to select the final PTE. (In Sv48 RISC-V a page table has four levels, and bits 39 through 47
of a virtual address index into the top-level.)
If any of the three PTEs required to translate an address is not present, the paging hardware
raises a page-fault exception, leaving it up to the kernel to handle the exception (see Chapter 4).
The three-level structure of Figure 3.2 allows a memory-efficient way of recording PTEs, com-
pared to the single-level design of Figure 3.1. In the common case in which large ranges of virtual
addresses have no mappings, the three-level structure can omit entire page directories. For exam-
ple, if an application uses only a few pages starting at address zero, then the entries 1 through 511
of the top-level page directory are invalid, and the kernel doesn’t have to allocate pages those for
511 intermediate page directories. Furthermore, the kernel also doesn’t have to allocate pages for
the bottom-level page directories for those 511 intermediate page directories. So, in this example,
the three-level design saves 511 pages for intermediate page directories and 511 × 512 pages for
bottom-level page directories.
Although a CPU walks the three-level structure in hardware as part of executing a load or store
instruction, a potential downside of three levels is that the CPU must load three PTEs from memory
to perform the translation of the virtual address in the load/store instruction to a physical address.
To avoid the cost of loading PTEs from physical memory, a RISC-V CPU caches page table entries
in a Translation Look-aside Buffer (TLB).
allowed to be used. PTE_V indicates whether the PTE is present: if it is not set, a reference to the
page causes an exception (i.e., is not allowed). PTE_R controls whether instructions are allowed
to read to the page. PTE_W controls whether instructions are allowed to write to the page. PTE_X
controls whether the CPU may interpret the content of the page as instructions and execute them.
PTE_U controls whether instructions in user mode are allowed to access the page; if PTE_U is not
set, the PTE can be used only in supervisor mode. Figure 3.2 shows how it all works. The flags and
all other page hardware-related structures are defined in (kernel/riscv.h) - To tell a CPU to use a page table, the kernel must write the physical address of the root page-
table page into the satp register. A CPU will translate all addresses generated by subsequent
instructions using the page table pointed to by its own satp. Each CPU has its own satp so that
different CPUs can run different processes, each with a private address space described by its own
page table. - notice:A few notes about terms used in this book. Physical memory refers to storage cells in RAM.
A byte of physical memory has an address, called a physical address. Instructions that dereference
addresses (such as loads, stores, jumps, and function calls) use only virtual addresses, which the
paging hardware translates to physical addresses, and then sends to the RAM hardware to read or
write storage. An address space is the set of virtual addresses that are valid in a given page table; each xv6 process has a separate user address space, and the xv6 kernel has its own address space as
well. User memory refers to a process’s user address space plus the physical memory that the page
table allows the process to access. Virtual memory refers to the ideas and techniques associated
with managing page tables and using them to achieve goals such as isolation.
3.2 Kernel address space
- Xv6 maintains one page table per process, describing each process’s user address space, plus a sin-
gle page table that describes the kernel’s address space. The kernel configures the layout of its ad-
dress space to give itself access to physical memory and various hardware resources at predictable virtual addresses. Figure 3.3 shows how this layout maps kernel virtual addresses to physical addresses. The file (kernel/memlayout.h) declares the constants for xv6’s kernel memory layout. - The kernel gets at RAM and memory-mapped device registers using “direct mapping;” that
is, mapping the resources at virtual addresses that are equal to the physical address. For example,
the kernel itself is located at KERNBASE=0x80000000 in both the virtual address space and in
physical memory. Direct mapping simplifies kernel code that reads or writes physical memory. - There are a couple of kernel virtual addresses that aren’t direct-mapped:
- The trampoline page. It is mapped at the top of the virtual address space; user page tables have this same mapping. Chapter 4 discusses the role of the trampoline page, but we see here an interesting use case of page tables; a physical page (holding the trampoline code) is mapped twice in the virtual address space of the kernel: once at top of the virtual address space and once with a direct mapping.
- The kernel stack pages. Each process has its own kernel stack, which is mapped high so that below it xv6 can leave an unmapped guard page. The guard page’s PTE is invalid (i.e., PTE_V is not set), so that if the kernel overflows a kernel stack, it will likely cause an exception and the kernel will panic. Without a guard page an overflowing stack would overwrite other kernel memory, resulting in incorrect operation. A panic crash is preferable.
3.3 Code: creating an address space
- Most of the xv6 code for manipulating address spaces and page tables resides in vm.c (kernel/vm.c:1). The central data structure is pagetable_t, which is really a pointer to a RISC-V root page-table page; a pagetable_t may be either the kernel page table, or one of the per-process page tables. The central functions are walk, which finds the PTE for a virtual address,and mappages, which installs PTEs for new mappings. Functions starting with kvm manipulate the kernel page table; functions starting with uvm manipulate a user page table; other functions are used for both. copyout and copyin copy data to and from user virtual addresses provided as system call arguments; they are in vm.c because they need to explicitly translate those addresses in order to find the corresponding physical memory.
- Early in the boot sequence, main calls kvminit (kernel/vm.c:54) to create the kernel’s page ta-
ble using kvmmake (kernel/vm.c:20). This call occurs before xv6 has enabled paging on the RISC-V,
so addresses refer directly to physical memory. kvmmake first allocates a page of physical mem-
ory to hold the root page-table page. Then it calls kvmmap to install the translations that the kernel
needs. The translations include the kernel’s instructions and data, physical memory up to PHYSTOP,
and memory ranges which are actually devices. proc_mapstacks (kernel/proc.c:33) allocates a
kernel stack for each process. It calls kvmmap to map each stack at the virtual address generated
by KSTACK, which leaves room for the invalid stack-guard pages. kvmmap
(kernel/vm.c:132) calls mappages (kernel/vm.c:144), which installs mappings into a
page table for a range of virtual addresses to a corresponding range of physical addresses. It does
this separately for each virtual address in the range, at page intervals. For each virtual address to
be mapped, mappages calls walk to find the address of the PTE for that address. It then initializes
the PTE to hold the relevant physical page number, the desired permissions (PTE_W, PTE_X, and/or
PTE_R), and PTE_V to mark the PTE as valid (kernel/vm.c:165).walk
(kernel/vm.c:86) mimics the RISC-V paging hardware as it looks up the PTE for a virtual
address (see Figure 3.2). walk descends the page table one level at a time, using each level’s 9
bits of virtual address to index into the relevant page directory page. At each level it finds either
the PTE of the next level’s page directory page, or the PTE of final page (kernel/vm.c:92). If a PTE
in a first or second level page directory page isn’t valid, then the required directory page hasn’t
yet been allocated; if the alloc argument is set, walk allocates a new page-table page and puts
its physical address in the PTE. It returns the address of the PTE in the lowest layer in the tree
(kernel/vm.c:102).- main calls kvminithart (kernel/vm.c:62) to install the kernel page table. It writes the physical
address of the root page-table page into the register satp. After this the CPU will translate ad-
dresses using the kernel page table. Since the kernel uses a direct mapping, the now virtual address
of the next instruction will map to the right physical memory address. - Each RISC-V CPU caches page table entries in a Translation Look-aside Buffer (TLB), and
when xv6 changes a page table, it must tell the CPU to invalidate corresponding cached TLB
entries. If it didn’t, then at some point later the TLB might use an old cached mapping, point-
ing to a physical page that in the meantime has been allocated to another process, and as a re-
sult, a process might be able to scribble on some other process’s memory. The RISC-V has an instruction sfence.vma that flushes the current CPU’s TLB. Xv6 executes sfence.vma in
kvminithart after reloading the satp register, and in the trampoline code that switches to a
user page table before returning to user space (kernel/trampoline.S:89).
It is also necessary to issue sfence.vma before changing satp, in order to wait for comple-
tion of all outstanding loads and stores. This wait ensures that preceding updates to the page table
have completed, and ensures that preceding loads and stores use the old page table, not the new
one.
To avoid flushing the complete TLB, RISC-V CPUs may support address space identifiers
(ASIDs) [3]. The kernel can then flush just the TLB entries for a particular address space. Xv6
does not use this feature.
3.4 Physical memory allocation
- The kernel must allocate and free physical memory at run-time for page tables, user memory,
kernel stacks, and pipe buffers. - Xv6 uses the physical memory between the end of the kernel and PHYSTOP for run-time alloca-
tion. It allocates and frees whole 4096-byte pages at a time. It keeps track of which pages are free
by threading a linked list through the pages themselves. Allocation consists of removing a page
from the linked list; freeing consists of adding the freed page to the list.
3.5 Code: Physical memory allocator
- The allocator resides in kalloc.c (kernel/kalloc.c:1). The allocator’s data structure is a free list
of physical memory pages that are available for allocation. Each free page’s list element is a
struct run (kernel/kalloc.c:17). Where does the allocator get the memory to hold that data struc-
ture? It store each free page’s run structure in the free page itself, since there’s nothing else stored
there. The free list is protected by a spin lock (kernel/kalloc.c:21-24). The list and the lock are
wrapped in a struct to make clear that the lock protects the fields in the struct. For now, ignore the
lock and the calls to acquire and release; Chapter 6 will examine locking in detail.
The function main calls kinit to initialize the allocator (kernel/kalloc.c:27). kinit initializes
the free list to hold every page between the end of the kernel and PHYSTOP. Xv6 ought to de-
termine how much physical memory is available by parsing configuration information provided
by the hardware. Instead xv6 assumes that the machine has 128 megabytes of RAM. kinit calls
freerange to add memory to the free list via per-page calls to kfree. A PTE can only refer to
a physical address that is aligned on a 4096-byte boundary (is a multiple of 4096), so freerange
uses PGROUNDUP to ensure that it frees only aligned physical addresses. The allocator starts with
no memory; these calls to kfree give it some to manage.
The allocator sometimes treats addresses as integers in order to perform arithmetic on them
(e.g., traversing all pages in freerange), and sometimes uses addresses as pointers to read and
write memory (e.g., manipulating the run structure stored in each page); this dual use of addresses
is the main reason that the allocator code is full of C type casts.
The function kfree (kernel/kalloc.c:47) begins by setting every byte in the memory being freed
to the value 1. This will cause code that uses memory after freeing it (uses “dangling references”)
to read garbage instead of the old valid contents; hopefully that will cause such code to break faster.
Then kfree prepends the page to the free list: it casts pa to a pointer to struct run, records the
old start of the free list in r->next, and sets the free list equal to r. kalloc removes and returns
the first element in the free list.
3.6 Process address space
- Each process has its own page table, and when xv6 switches between processes, it also changes
page tables. Figure 3.4 shows a process’s address space in more detail than Figure 2.3. A process’s
user memory starts at virtual address zero and can grow up to MAXVA (kernel/riscv.h:375), allowing
a process to address in principle 256 Gigabytes of memory.
A process’s address space consists of pages that contain the text of the program (which xv6
maps with the permissions PTE_R, PTE_X, and PTE_U), pages that contain the pre-initialized data
of the program, a page for the stack, and pages for the heap. Xv6 maps the data, stack, and heap
with the permissions PTE_R, PTE_W, and PTE_U.
Using permissions within a user address space is a common technique to harden a user process.
If the text were mapped with PTE_W, then a process could accidentally modify its own program;
for example, a programming error may cause the program to write to a null pointer, modifying
instructions at address 0, and then continue running, perhaps creating more havoc. To detect such
errors immediately, xv6 maps the text without PTE_W; if a program accidentally attempts to store
to address 0, the hardware will refuse to execute the store and raises a page fault (see Section 4.6).
The kernel then kills the process and prints out an informative message so that the developer can
track down the problem.
Similarly, by mapping data without PTE_X, a user program cannot accidentally jump to an
address in the program’s data and start executing at that address.
In the real world, hardening a process by setting permissions carefully also aids in defending
against security attacks. An adversary may feed carefully-constructed input to a program (e.g., a
Web server) that triggers a bug in the program in the hope of turning that bug into an exploit [14].
Setting permissions carefully and other techniques, such as randomizing of the layout of the user
address space, make such attacks harder.
The stack is a single page, and is shown with the initial contents as created by exec. Strings
containing the command-line arguments, as well as an array of pointers to them, are at the very
top of the stack. Just under that are values that allow a program to start at main as if the function
main(argc, argv) had just been called.
To detect a user stack overflowing the allocated stack memory, xv6 places an inaccessible guard
page right below the stack by clearing the PTE_U flag. If the user stack overflows and the process
tries to use an address below the stack, the hardware will generate a page-fault exception because
the guard page is inaccessible to a program running in user mode. A real-world operating system
might instead automatically allocate more memory for the user stack when it overflows.
When a process asks xv6 for more user memory, xv6 grows the process’s heap. Xv6 first uses kalloc to allocate physical pages. It then adds PTEs to the process’s page table that point to the
new physical pages. Xv6 sets the PTE_W, PTE_R, PTE_U, and PTE_V flags in these PTEs. Most
processes do not use the entire user address space; xv6 leaves PTE_V clear in unused PTEs.
We see here a few nice examples of use of page tables. First, different processes’ page tables
translate user addresses to different pages of physical memory, so that each process has private user
memory. Second, each process sees its memory as having contiguous virtual addresses starting at
zero, while the process’s physical memory can be non-contiguous. Third, the kernel maps a page
with trampoline code at the top of the user address space (without PTE_U), thus a single page of
physical memory shows up in all address spaces, but can be used only by the kernel.
3.7 Code: sbrk
- sbrk is the system call for a process to shrink or grow its memory. The system call is implemented
by the function growproc (kernel/proc.c:260). growproc calls uvmalloc or uvmdealloc, de-
pending on whether n is positive or negative. uvmalloc (kernel/vm.c:233) allocates physical mem-
ory with kalloc, zeros the allocated memory, and adds PTEs to the user page table with mappages.
uvmdealloc calls uvmunmap (kernel/vm.c:178), which uses walk to find PTEs and kfree to
free the physical memory they refer to.
Xv6 uses a process’s page table not just to tell the hardware how to map user virtual addresses, but also as the only record of which physical memory pages are allocated to that process. That is the reason why freeing user memory (in uvmunmap) requires examination of the user page table.
3.8 Code: exec
- A binary is typically the output of the compiler and linker, and holds
machine instructions and program data.exec
(kernel/exec.c:23) opens the named binary path using
namei (kernel/exec.c:36), which is explained in Chapter 8. Then, it reads the ELF header. Xv6
binaries are formatted in the widely-used ELF format, defined in (kernel/elf.h). An ELF binary
consists of an ELF header, struct elfhdr (kernel/elf.h:6), followed by a sequence of program
section headers, struct proghdr (kernel/elf.h:25). Each progvhdr describes a section of the
application that must be loaded into memory; xv6 programs have two program section headers:
one for instructions and one for data.
The first step is a quick check that the file probably contains an ELF binary. An ELF binary
starts with the four-byte “magic number” 0x7F, ‘E’, ‘L’, ‘F’, or ELF_MAGIC (kernel/elf.h:3). If
the ELF header has the right magic number, exec assumes that the binary is well-formed.
exec allocates a new page table with no user mappings with proc_pagetable (kernel/exec.c:49),
allocates memory for each ELF segment with uvmalloc (kernel/exec.c:65), and loads each segment
into memory with loadseg (kernel/exec.c:10). loadseg uses walkaddr to find the physical ad-
dress of the allocated memory at which to write each page of the ELF segment, and readi to read
from the file.
The program section header for /init, the first user program created with exec, looks like this:
1 |
|
The output of objdump -p
shows the Program Header Table of an ELF (Executable and Linkable Format) file. This is a crucial part of the ELF file, telling the operating system how to load each segment into memory for execution.
🌟 Explanation of ELF Program Header fields — each segment corresponds to the following fields:
Field | Meaning |
---|---|
off |
Byte offset of the segment in the file, counted from the beginning. |
vaddr |
Virtual address of the segment in memory. |
paddr |
Physical address — usually ignored by modern OSes. |
align |
Alignment requirement; the segment must be aligned to this (often a page size). |
filesz |
Size of the segment in the file (in bytes). loadseg reads this many bytes. |
memsz |
Total size the segment occupies in memory after loading (may be larger than in the file, e.g., for bss). |
flags |
Permission flags: r (read), w (write), x (execute). |
type |
Segment type (e.g., LOAD , STACK , NOTE , DYNAMIC , etc.). |
align |
Memory alignment requirement. |
4 Program Headers (Segments):
1. First Segment: Type 0x70000003
(non-standard)
1 |
|
type=0x70000003
is processor-specific and non-standard — generally ignored by developers.off=0x6bb0
— the offset within the file.filesz=0x4a
— 74 bytes of data.memsz=0x0
— although there is data in the file, nothing is loaded into memory (e.g., debug info, notes).- This is typically metadata, not code or data.
2. Second Segment: The actual code segment (text segment)
1 |
|
- Type:
LOAD
— to be loaded into memory. off=0x1000
— loading starts from offset 0x1000 in the file.vaddr=0x0
— to be loaded at virtual address0x0
(important).filesz = memsz = 0x1000
— both file and memory size are 4KB.flags=r-x
— readable and executable. Indicates this is the code segment.
🔑 This segment contains program instructions (text segment), and exec/loadseg will load it into memory at 0x0.
3. Third Segment: Data segment
1 |
|
- Type:
LOAD
— also needs to be loaded. - File offset: 0x2000
- Virtual address: 0x1000
filesz = 0x10
,memsz = 0x30
— 16 bytes from file, the rest is BSS (needs to be zero-initialized).flags = rw-
— read/write permission, indicates this is a data segment.
🔑 This segment holds global variables — part comes from the file, the rest is zeroed out by memset
or page initialization.
4. Fourth Segment: Stack segment
1 |
|
- Type:
STACK
— indicates a stack segment. - Usually has no actual content; the stack is manually allocated by the kernel (not loaded from ELF).
- This is just a marker indicating that the ELF expects a stack.
🔧 Main points in how exec
loads ELF in xv6
ELF segment loading
- Each segment (like
.text
,.data
) specifies avaddr
(virtual address). - The kernel reads
filesz
bytes from the file offset and loads them into memory atvaddr
. - If
memsz > filesz
, the remaining space (like.bss
) must be zero-initialized.
- Each segment (like
User stack initialization
- One page of memory is allocated for the user stack; the top stores the argument strings.
- A “guard page” is placed below the stack and made inaccessible.
- Arguments are passed to
main()
viaa0
(argc) anda1
(argv).
Security checks
- A malicious ELF file could attempt to set dangerous
vaddr
values or trigger integer overflows. - xv6 performs checks, such as preventing
vaddr + memsz
overflows. - On RISC-V, xv6 uses separate page tables for user and kernel space to prevent interference.
- A malicious ELF file could attempt to set dangerous
ELF loading steps (in exec
)
The kernel reads the ELF header and locates the Program Headers.
For each
LOAD
-type segment:- It calls
uvmalloc
to allocate memory fromvaddr
tovaddr + memsz
. - It uses
loadseg
to readfilesz
bytes from the file into that memory. - If
memsz > filesz
, the remaining bytes are cleared to zero.
- It calls
Finally, it creates the user stack, sets up the trap frame and arguments, and jumps to user mode for execution.
3.9 Real world
- Xv6 is simplified by the kernel’s use of a direct map between virtual and physical addresses, and
by its assumption that there is physical RAM at address 0x80000000, where the kernel expects to
be loaded. This works with QEMU, but on real hardware it turns out to be a bad idea; real hardware
places RAM and devices at unpredictable physical addresses, so that (for example) there might be
no RAM at 0x80000000, where xv6 expect to be able to store the kernel. More serious kernel
designs exploit the page table to turn arbitrary hardware physical memory layouts into predictable
kernel virtual address layouts. - RISC-V supports protection at the level of physical addresses, but xv6 doesn’t use that feature.
- On machines with lots of memory it might make sense to use RISC-V’s support for “super
pages.” Small pages make sense when physical memory is small, to allow allocation and page-out
to disk with fine granularity. For example, if a program uses only 8 kilobytes of memory, giving
it a whole 4-megabyte super-page of physical memory is wasteful. Larger pages make sense on
machines with lots of RAM, and may reduce overhead for page-table manipulation. - The xv6 kernel’s lack of a malloc-like allocator that can provide memory for small objects
prevents the kernel from using sophisticated data structures that would require dynamic allocation.
A more elaborate kernel would likely allocate many different sizes of small blocks, rather than (as
in xv6) just 4096-byte blocks; a real kernel allocator would need to handle small allocations as
well as large ones.