Linux缺页异常程序必须能够区分由编程引起的异常以及由引用属于进程地址空间但还尚未分配物理页框的页所引起的异常。在x86-ia32体系上由do_page_fault函数处理,每个版本有所差异,现分析的版本为2.6.32
/*
regs:该结构包含当异常发生时的微处理器寄存器的值
3位的error_code,当异常发生时由控制单元压入栈中
-如果第0位被清0,则异常由访问一个不存在的页所
引起,否则,则异常由无效的访问权限所引起;
-如果第1位被清0,表示异常由读访问或者执行访问
所引起,反之,异常由写访问引起;
-如果第2位被清0,则异常发生在处理器处于内核态
时,否则,异常发生在处理器处于用户态时
-如果3位为1表示检测到使用了保留位。4位为1表示
1表示缺页异常是在取指令的时候出现的
*/
dotraplinkage void __kprobes
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
unsigned long address;
struct mm_struct *mm;
int write;
int fault;
/*获取当前cpu正在运行的进程的进程描述符
然后获取该进程的内存描述符*/
tsk = current;
mm = tsk->mm;
/* Get the faulting address: */
/*获取出错的地址*/
address = read_cr2();
/*
* Detect and handle instructions that would cause a page fault for
* both a tracked kernel page and a userspace page.
*/
if (kmemcheck_active(regs))
kmemcheck_hide(regs);
prefetchw(&mm->mmap_sem);
if (unlikely(kmmio_fault(regs, address)))
return;
/*
* We fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
*
* NOTE! We MUST NOT take any locks for this case. We may
* be in an interrupt or a critical region, and should
* only copy the information from the master page table,
* nothing more.
*
* This verifies that the fault happens in kernel space
* (error_code & 4) == 0, and that the fault was not a
* protection error (error_code & 9) == 0.
*/
/*页访问出错地址address在内核空间*/
if (unlikely(fault_in_kernel_space(address))) {
/*检查标志位确定访问发生在"内核态"*/
if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
/*如果是内核空间"非连续内存"的访问,
则直接拷贝"内核页表项"到"用户页表项"
如果"内核页表项"为null,说明内核有BUG,返回-1
这里就是把init_mm中addr对应的项拷贝到本进程
的相应页表,防止缺页中断
*/
if (vmalloc_fault(address) >= 0)
return;
/*关于kmemcheck的操作需要设置宏,这个版本
没有设置,可以不看;
检查不能为vm86模式以及读写权限是否正确*/
if (kmemcheck_fault(regs, address, error_code))
return;
}
/* Can handle a stale RO->RW TLB: */
/*内核空间的地址,检查页表对应项的写、执行权限*/
if (spurious_fault(error_code, address))
return;
/* kprobes don't want to hook the spurious faults: */
if (notify_page_fault(regs))
return;
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
*/
/*如果上面的检查不能搞定直接进入"非法访问"处理函数*/
bad_area_nosemaphore(regs, error_code, address);
return;
}
/* kprobes don't want to hook the spurious faults: */
if (unlikely(notify_page_fault(regs)))
return;
/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
*
* User-mode registers count as a user access even for any
* potential system fault or CPU buglet:
*/
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
}
if (unlikely(error_code & PF_RSVD))/*使用了保留位*/
/*CPU寄存器和内核态堆栈的全部转储打印到控制台,
以及页表的相关信息,并输出到一个系统消息缓冲
区,然后调用函数do_exit()杀死当前进程*/
pgtable_bad(regs, error_code, address);
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
/*
* If we're in an interrupt, have no user context or are running
* in an atomic region then we must not take the fault:
*/
/*如果运行在中断环境中,没有用户上下文
或运行在临界区中*/
if (unlikely(in_atomic() || !mm)) {
bad_area_nosemaphore(regs, error_code, address);
return;
}
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
* the kernel and should generate an OOPS. Unfortunately, in the
* case of an erroneous fault occurring in a code path which already
* holds mmap_sem we will deadlock attempting to validate the fault
* against the address space. Luckily the kernel only validly
* references user space from well defined areas of code, which are
* listed in the exceptions table.
*
* As the vast majority of faults will be valid we will only perform
* the source reference check when there is a possibility of a
* deadlock. Attempt to lock the address space, if we cannot we then
* validate the source. If this is invalid we can skip the address
* space check, thus avoiding the deadlock:
*/
/*此时可以确定出错addr在用户空间*/
if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
/*错误发生在"内核态",查看异常表
如果在内核态引起缺页,则引起缺页的
"指令地址"一定在"异常表"中
如果"异常表"中返回指令地址
,则说明可能是"请求调页",也可能是"非法访问"
如果"异常表"中无地址,则肯定是内核错误
*/
if ((error_code & PF_USER) == 0 &&
!search_exception_tables(regs->ip)) {
bad_area_nosemaphore(regs, error_code, address);
return;
}
down_read(&mm->mmap_sem);
} else {
/*
* The above down_read_trylock() might have succeeded in
* which case we'll have missed the might_sleep() from
* down_read():
*/
might_sleep();
}
/*寻找address所在的vma*/
vma = find_vma(mm, address);
/*如果address之后无vma,则肯定是非法访问*/
if (unlikely(!vma)) {
bad_area(regs, error_code, address);
return;
}
/*1 如果vma->start_address<=address,则直接跳到 "合法访问"阶段
2 如果vma->start_address>address,则也有可能是用户的"入栈行为"导致缺页*/
if (likely(vma->vm_start <= address))
goto good_area;
/* "入栈"操作,则该vma的标志为 "向下增长"*/
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
bad_area(regs, error_code, address);
return;
}
/*确定缺页发生在"用户态"*/
if (error_code & PF_USER) {
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
/*验证缺页address和栈顶sp的关系*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
}/*扩展栈*/
if (unlikely(expand_stack(vma, address))) {
bad_area(regs, error_code, address);
return;
}
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it..
*/
good_area:
write = error_code & PF_WRITE;
/*再次验证"权限"*/
if (unlikely(access_error(error_code, write, vma))) {
bad_area_access_error(regs, error_code, address);
return;
}
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault:
*/
/*分配新"页框"*/
fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, fault);
return;
}
if (fault & VM_FAULT_MAJOR) {
tsk->maj_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
regs, address);
} else {
tsk->min_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
regs, address);
}
check_v8086_mode(regs, address, tsk);
up_read(&mm->mmap_sem);
}
大致流程中分为:
地址为内核空间:
1,当地址为内核地址空间并且在内核中访问时,如果是非连续内存地址,将init_mm中对应的项复制到本进程对应的页表项做修正;
2,地址为内核空间时,检查页表的访问权限;
3,如果1,2没搞定,跳到非法访问处理(在后面详细分析这个);
地址为用户空间:
4,如果使用了保留位,打印信息,杀死当前进程;
5,如果在中断上下文中火临界区中时,直接跳到非法访问;
6,如果出错在内核空间中,查看异常表,进行相应的处理;
7,查找地址对应的vma,如果找不到,直接跳到非法访问处,如果找到正常,跳到good_area;
8,如果vma->start_address>address,可能是栈太小,对齐进行扩展;
9,good_area处,再次检查权限;
10,权限正确后分配新页框,页表等;
对于缺页中断的非法访问由函数bad_area执行,该函数的执行情况分为:
1,如果在用户空间访问,直接发送SEGSEGV信号;
2,如果在内核空间访问分为两种情况:
1)地址是一个错误的系统调用参数,修正码(典型是发送SIGSEGV信号);
2)反之,杀死进程并显示内核的OOPS信息;
static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int si_code)
{
struct task_struct *tsk = current;
/* User mode accesses just cause a SIGSEGV */
/*如果用户态*/
if (error_code & PF_USER) {
/*
* It's possible to have interrupts off here:
*/
local_irq_enable();
/*
* Valid to do another page fault here because this one came
* from user space:
*/
if (is_prefetch(regs, error_code, address))
return;
if (is_errata100(regs, address))
return;
if (unlikely(show_unhandled_signals))
show_signal_msg(regs, error_code, address, tsk);
/* Kernel addresses are always protection faults: */
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code | (address >= TASK_SIZE);
tsk->thread.trap_no = 14;
/*发送SIGSEGV信号*/
force_sig_info_fault(SIGSEGV, si_code, address, tsk);
return;
}
if (is_f00f_bug(regs, address))
return;
/*内核态访问*/
no_context(regs, error_code, address);
}
内核访问时
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
{
struct task_struct *tsk = current;
unsigned long *stackend;
unsigned long flags;
int sig;
/* Are we prepared to handle this kernel fault? */
/*地址是一个系统调用参数,"修正码",典型情况是发送
SIGSEGV信号*/
if (fixup_exception(regs))
return;
/*
* 32-bit:
*
* Valid to do another page fault here, because if this fault
* had been triggered by is_prefetch fixup_exception would have
* handled it.
*
* 64-bit:
*
* Hall of shame of CPU/BIOS bugs.
*/
if (is_prefetch(regs, error_code, address))
return;
if (is_errata93(regs, address))
return;
/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice:
*/
/*下面代码用于oops信息的显示和杀死当前
进程*/
flags = oops_begin();
show_fault_oops(regs, error_code, address);
stackend = end_of_stack(tsk);
if (*stackend != STACK_END_MAGIC)
printk(KERN_ALERT "Thread overran stack, or stack corrupted\n");
tsk->thread.cr2 = address;
tsk->thread.trap_no = 14;
tsk->thread.error_code = error_code;
sig = SIGKILL;
if (__die("Oops", regs, error_code))
sig = 0;
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_EMERG "CR2: %016lx\n", address);
oops_end(flags, regs, sig);
}