xv6-riscv_ch4
- How traps and system calls work on RISC-V. It introduces the trap mechanism, how user programs invoke system calls, how the kernel handles those traps, and how arguments are passed. It also covers kernel-mode traps, page faults, and real-world implications like protection and isolation.
ch4: Traps and system calls
There are three kinds of event which cause the CPU to set aside ordinary execution of instructions and force a transfer of control to special code that handles the event. and we uses
trap
as a generic term for these situations.- One situation is a
systemcall
, when a user program executes the ecall instruction to ask the kernel to do something for it. - Another situation is an
exception
: an instruction (user or kernel) does something illegal, such as divide by zero or use an invalid virtual address. - The third situation is a
device interrupt
, when a device signals that it needs attention, for example when the disk hardware finishes a read or write request.
- One situation is a
Xv6 handles all traps in the kernel; traps are not delivered to user code. Handling traps in the kernel is natural for system calls. It makes sense for interrupts since isolation demands that only the kernel be allowed to use devices, and because the kernel is a convenient mechanism with which to share devices among multiple processes. It also makes sense for exceptions since xv6 responds to all exceptions from user space by killing the offending program.
Xv6 trap handling proceeds in four stages:
- hardware actions taken by the RISC-V CPU,
- some assembly instructions that prepare the way for kernel C code
- a C function that decides what to do with the trap
- and the system call or device-driver service routine.
While commonality among the three trap types suggests that a kernel could handle all traps with a single code path, it turns out to be convenient to have separate code for two distinct cases: traps from user space, and traps from kernel space. Kernel code (assembler or C) that processes a trap is often called a handler; the first handler instructions are usually written in assembler (rather than C) and are sometimes called a vector.
4.1 RISC-V trap machinery机制
Each RISC-V CPU has a set of control registers that the kernel writes to tell the CPU how to handle traps, and that the kernel can read to find out about a trap that has occurred. The RISC-V documents contain the full story [3]. riscv.h (kernel/riscv.h:1) contains definitions that xv6 uses. Here’s an outline of the most important registers:
stvec
: The kernel writes the address of its trap handler here; the RISC-V jumps to the address in stvec to handle a trap.sepc
: When a trap occurs, RISC-V saves the program counter here (since the pc is then overwritten with the value in stvec). The sret (return from trap) instruction copies sepc to the pc. The kernel can write sepc to control where sret goes.scause
: RISC-V puts a number here that describes the reason for the trap.sscratch
: The trap handler code uses sscratch to help it avoid overwriting user registers before saving them.sstatus
: The SIE bit in sstatus controls whether device interrupts are enabled. If the kernel clears SIE, the RISC-V will defer device interrupts until the kernel sets SIE. The SPP bit indicates whether a trap came from user mode or supervisor mode, and controls to what mode sret returns.
The above registers relate to traps handled in supervisor mode, and they cannot be read or written in user mode. Each CPU on a multi-core chip has its own set of these registers, and more than one CPU may be handling a trap at any given time.
When it needs to force a trap, the RISC-V hardware does the following for all trap types:
- If the trap is a device interrupt, and the sstatus SIE bit is clear, don’t do any of the following.
- Disable interrupts by clearing the SIE bit in sstatus.
- Copy the pc to sepc.
- Save the current mode (user or supervisor) in the SPP bit in sstatus.
- Set scause to reflect the trap’s cause.
- Set the mode to supervisor.
- Copy stvec to the pc.
- Start executing at the new pc.
Note that the CPU doesn’t switch to the kernel page table, doesn’t switch to a stack in the kernel, and doesn’t save any registers other than the pc. Kernel software must perform these tasks. One reason that the CPU does minimal work during a traps is to provide flexibility to software; for example, some operating systems omit a page table switch in some situations to increase trap performance.
It’s worth thinking about whether any of the steps listed above could be omitted, perhaps in search of faster traps. Though there are situations in which a simpler sequence can work, many of the steps would be dangerous to omit in general. For example, suppose that the CPU didn’t switch program counters. Then a trap from user space could switch to supervisor mode while still running user instructions. Those user instructions could break user/kernel isolation, for example by modifying the satp register to point to a page table that allowed accessing all of physical memory. It is thus important that the CPU switch to a kernel-specified instruction address, namelystvec
.
4.2 Traps from user space
Xv6 handles traps differently depending on whether the trap occurs while executing in the kernel
or in user code. Here is the story for traps from user code; Section 4.5 describes traps from kernel
code.
A trap may occur while executing in user space if the user program makes a system call (ecall
instruction), or does something illegal, or if a device interrupts. The high-level path of a trap from
user space is uservec (kernel/trampoline.S:22), then usertrap (kernel/trap.c:37); and when re-
turning, usertrapret (kernel/trap.c:90) and then userret (kernel/trampoline.S:101).
A major constraint on the design of xv6’s trap handling is the fact that the RISC-V hardware
does not switch page tables when it forces a trap. This means that the trap handler address in
stvec must have a valid mapping in the user page table, since that’s the page table in force when
the trap handling code starts executing. Furthermore, xv6’s trap handling code needs to switch to
the kernel page table; in order to be able to continue executing after that switch, the kernel page
table must also have a mapping for the handler pointed to by stvec.
Xv6 satisfies these requirements using a trampoline page. The trampoline page contains uservec,
the xv6 trap handling code that stvec points to. The trampoline page is mapped in every process’s
page table at address TRAMPOLINE, which is at the top of the virtual address space so that it will be
above memory that programs use for themselves. The trampoline page is also mapped at address
TRAMPOLINE in the kernel page table. See Figure 2.3 and Figure 3.3. Because the trampoline
page is mapped in the user page table, traps can start executing there in supervisor mode. Because
the trampoline page is mapped at the same address in the kernel address space, the trap handler
can continue to execute after it switches to the kernel page table.
The code for the uservec trap handler is in trampoline.S (kernel/trampoline.S:22). When
uservec starts, all 32 registers contain values owned by the interrupted user code. These 32
values need to be saved somewhere in memory, so that later on the kernel can restore them before
returning to user space. Storing to memory requires use of a register to hold the address, but at this
point there are no general-purpose registers available! Luckily RISC-V provides a helping hand in
the form of the sscratch register. The csrw instruction at the start of uservec saves a0 in
sscratch. Now uservec has one register (a0) to play with.
uservec’s next task is to save the 32 user registers. The kernel allocates, for each process, a
page of memory for a trapframe structure that (among other things) has space to save the 32
user registers (kernel/proc.h:43). Because satp still refers to the user page table, uservec needs
the trapframe to be mapped in the user address space. Xv6 maps each process’s trapframe at virtual
address TRAPFRAME in that process’s user page table; TRAPFRAME is just below TRAMPOLINE.
The process’s p->trapframe also points to the trapframe, though at its physical address so the
kernel can use it through the kernel page table.
Thus uservec loads address TRAPFRAME into a0 and saves all the user registers there,
including the user’s a0, read back from sscratch.
The trapframe contains the address of the current process’s kernel stack, the current CPU’s
hartid, the address of the usertrap function, and the address of the kernel page table. uservec
retrieves these values, switches satp to the kernel page table, and jumps to usertrap.
The job of usertrap is to determine the cause of the trap, process it, and return (kernel/-
trap.c:37). It first changes stvec so that a trap while in the kernel will be handled by kernelvec
rather than uservec. It saves the sepc register (the saved user program counter), because
usertrap might call yield to switch to another process’s kernel thread, and that process might
return to user space, in the process of which it will modify sepc. If the trap is a system call,
usertrap calls syscall to handle it; if a device interrupt, devintr; otherwise it’s an ex-
ception, and the kernel kills the faulting process. The system call path adds four to the saved user
program counter because RISC-V, in the case of a system call, leaves the program pointer pointing
to the ecall instruction but user code needs to resume executing at the subsequent instruction.
On the way out, usertrap checks if the process has been killed or should yield the CPU (if this
trap is a timer interrupt).
The first step in returning to user space is the call to usertrapret (kernel/trap.c:90). This
function sets up the RISC-V control registers to prepare for a future trap from user space: setting
stvec to uservec and preparing the trapframe fields that uservec relies on. usertrapret
sets sepc to the previously saved user program counter. At the end, usertrapret calls userret
on the trampoline page that is mapped in both user and kernel page tables; the reason is that as-
sembly code in userret will switch page tables.
usertrapret’s call to userret passes a pointer to the process’s user page table in a0
(kernel/trampoline.S:101). userret switches satp to the process’s user page table. Recall that the
user page table maps both the trampoline page and TRAPFRAME, but nothing else from the kernel.
The trampoline page mapping at the same virtual address in user and kernel page tables allows
userret to keep executing after changing satp. From this point on, the only data userret
can use is the register contents and the content of the trapframe. userret loads the TRAPFRAME
address into a0, restores saved user registers from the trapframe via a0, restores the saved user
a0, and executes sret to return to user space.
4.3 Code: Calling system calls
Chapter 2 ended with initcode.S invoking the exec system call (user/initcode.S:11). Let’s look
at how the user call makes its way to the exec system call’s implementation in the kernel.
initcode.S places the arguments for exec in registers a0 and a1, and puts the system call
number in a7. System call numbers match the entries in the syscalls array, a table of function
pointers (kernel/syscall.c:107). The ecall instruction traps into the kernel and causes uservec,
usertrap, and then syscall to execute, as we saw above.
syscall (kernel/syscall.c:132) retrieves the system call number from the saved a7 in the trapframe
and uses it to index into syscalls. For the first system call, a7 contains SYS_exec (ker-
nel/syscall.h:8), resulting in a call to the system call implementation function sys_exec.
When sys_exec returns, syscall records its return value in p->trapframe->a0. This will
cause the original user-space call to exec() to return that value, since the C calling convention
on RISC-V places return values in a0. System calls conventionally return negative numbers to
indicate errors, and zero or positive numbers for success. If the system call number is invalid,
syscall prints an error and returns −1.
4.4 Code: System call arguments
- System call implementations in the kernel need to find the arguments passed by user code. Because
user code calls system call wrapper functions, the arguments are initially where the RISC-V C
calling convention places them: in registers. The kernel trap code saves user registers to the current
process’s trap frame, where kernel code can find them. The kernel functions argint, argaddr,
and argfd retrieve the n ’th system call argument from the trap frame as an integer, pointer, or a file
descriptor. They all call argraw to retrieve the appropriate saved user register (kernel/syscall.c:34).
Some system calls pass pointers as arguments, and the kernel must use those pointers to read
or write user memory. The exec system call, for example, passes the kernel an array of pointers
referring to string arguments in user space. These pointers pose two challenges. First, the user pro-
gram may be buggy or malicious, and may pass the kernel an invalid pointer or a pointer intended
to trick the kernel into accessing kernel memory instead of user memory. Second, the xv6 kernel
page table mappings are not the same as the user page table mappings, so the kernel cannot use
ordinary instructions to load or store from user-supplied addresses.
The kernel implements functions that safely transfer data to and from user-supplied addresses.
fetchstr is an example (kernel/syscall.c:25). File system calls such as exec use fetchstr to
retrieve string file-name arguments from user space. fetchstr calls copyinstr to do the hard
work.
copyinstr (kernel/vm.c:415) copies up to max bytes to dst from virtual address srcva in
the user page table pagetable. Since pagetable is not the current page table, copyinstr
uses walkaddr (which calls walk) to look up srcva in pagetable, yielding physical address
pa0. The kernel’s page table maps all of physical RAM at virtual addresses that are equal to the
RAM’s physical address. This allows copyinstr to directly copy string bytes from pa0 to dst.
walkaddr (kernel/vm.c:109) checks that the user-supplied virtual address is part of the process’s user address space, so programs cannot trick the kernel into reading other memory. A similar
function, copyout, copies data from the kernel to a user-supplied address.
4.5 Traps from kernel space
Xv6 handles traps from kernel code in a different way than traps from user code. When entering
the kernel, usertrap points stvec to the assembly code at kernelvec (kernel/kernelvec.S:12).
Since kernelvec only executes if xv6 was already in the kernel, kernelvec can rely on
satp being set to the kernel page table, and on the stack pointer referring to a valid kernel stack.
kernelvec pushes all 32 registers onto the stack, from which it will later restore them so that
the interrupted kernel code can resume without disturbance.
kernelvec saves the registers on the stack of the interrupted kernel thread, which makes
sense because the register values belong to that thread. This is particularly important if the trap
causes a switch to a different thread – in that case the trap will actually return from the stack of the
new thread, leaving the interrupted thread’s saved registers safely on its stack.
kernelvec jumps to kerneltrap (kernel/trap.c:135) after saving registers. kerneltrap
is prepared for two types of traps: device interrupts and exceptions. It calls devintr (kernel/-
trap.c:185) to check for and handle the former. If the trap isn’t a device interrupt, it must be an
exception, and that is always a fatal error if it occurs in the xv6 kernel; the kernel calls panic and
stops executing.
If kerneltrap was called due to a timer interrupt, and a process’s kernel thread is running
(as opposed to a scheduler thread), kerneltrap calls yield to give other threads a chance to
run. At some point one of those threads will yield, and let our thread and its kerneltrap resume
again. Chapter 7 explains what happens in yield.
When kerneltrap’s work is done, it needs to return to whatever code was interrupted
by the trap. Because a yield may have disturbed sepc and the previous mode in sstatus,
kerneltrap saves them when it starts. It now restores those control registers and returns to
kernelvec (kernel/kernelvec.S:38). kernelvec pops the saved registers from the stack and ex-
ecutes sret, which copies sepc to pc and resumes the interrupted kernel code.
It’s worth thinking through how the trap return happens if kerneltrap called yield due to
a timer interrupt.
Xv6 sets a CPU’s stvec to kernelvec when that CPU enters the kernel from user space;
you can see this in usertrap (kernel/trap.c:29). There’s a window of time when the kernel has
started executing but stvec is still set to uservec, and it’s crucial that no device interrupt occur
during that window. Luckily the RISC-V always disables interrupts when