从oops中查找错误代码行堆栈错误信息.pdf-得力文库

资源描述

《从oops中查找错误代码行堆栈错误信息.pdf》由会员分享，可在线阅读，更多相关《从oops中查找错误代码行堆栈错误信息.pdf（11页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、从 oops 信息查找出错代码行分类：Linux Releated 2011-04-18 16:15 725人阅读评论(0)收藏举报（1）从 oops crash 的地方开始查起，首先找到指针访问错误的代码行 a)重新编译内核时，选上 kernel hacking-compile the kernel with debug info -kernel debugging 使得内核包含调试信息，b)然后从 Oops 信息中找到“PC is at free_block+0 x8c/0 x168”#Unable to handle kernel paging request at virtua

2、l address 000c0604/非法指针地址 pgd=40004000 000c0604*pgd=00000000 Internal error:Oops:817#1 Modules linked in:CPU:0 Not tainted (2.6.27.18#221)PC is at free_block+0 x78/0 x168 /当前指令地址 LR is at release_console_sem+0 x19c/0 x1b8 /函数返回地址#从 system_map中查到 free_block地址 0 x40097ac0，+0 x78得到 0 x40097B38 c)在内核根目录

3、运行arm-wrs-linux-gnueabi-armv6jel_vfp-uclibc_small-gdb vmlinux 就可以得到出错行 rootkqyang-hikvision linux-2.6.27_svn_quyong#arm-wrs-linux-gnueabi-armv6jel_vfp-uclibc_small-gdb vmlinux GNU gdb(Wind River Linux Sourcery G+4.3-85)6.8.50.20080821-cvs Copyright(C)2008 Free Software Foundation,Inc.License GPLv3+:

4、GNU GPL version 3 or later This is free software:you are free to change and redistribute it.There is NO WARRANTY,to the extent permitted by law.Type show copying and show warranty for details.This GDB was configured as-host=i686-pc-linux-gnu-target=arm-wrs-linux-gnueabi.For bug reporting instruction

5、s,please see:.(gdb)l*0 x40097B38 0 x40097b38 is in free_block(include/linux/list.h:93).88 *the prev/next entries already!89 */90#include 91 static inline void _list_del(struct list_head*prev,struct list_head*next)92 93 next-prev=prev;94 prev-next=next;95 96 97 /*(gdb)原文地址：linux 内核的 oops 信息作者：XINU Oo

6、ps 可看成是内核级（特权级）的 Segmentation Fault。一般应用程序（用户级）如进行了内存的非法访问(地址不合法、无权限访问、)或执行了非法指令，则会得到 Segfault信号，一般对应的行为是 coredump，应用程序也可以自行获取 Segfault 信号进行处理，而内核出错则是打印出 Oops 信息。内核打印 Oops 信息的执行流程：1、do_page_fault()（arch/i386/mm/fault.c），如果内核出现非法访问，则该函数会打印出EIP、PDE 等信息，如下：Unable to handle kernel paging request at virt

7、ual address f899b670 printing eip:c01de48c *pde=00737067 接下来调用die(Oops,regs,error_code);函数，此时如果系统还活着(至少要满足两个条件：1.在进程上下文 2.没有设置 panic_on_oops)，则会 kill 掉当前进程，以致死机。2、die()（arch/i386/kernel/traps.c），该函数最开始会打印出：Oops:0002#1 其中，0002 代表错误码，#1 代表 Oops 发生次数。error_code:*bit0 0 means no page found,1 means prote

8、ction fault *bit1 0 means read,1 means write *bit2 0 means kernel,1 means user-mode *bit3 0 means data,1 means instruction 接下来会调用 show_registers(regs)函数，输出寄存器、当前进程、堆栈、指令代码等信息，以供判断。Linux 内核在发生 kernel panic 时会打印出 Oops 信息，把当前的寄存器状态、堆栈信息、完整的 Call trace 都打印出来，以帮助我们定位错误。下在是一个例子，该例子展示了空指针引用错误。01#include 02

9、#include 03 04 static int _init hello_init(void)05 06 int*p=0;07 08 *p=1;09 return 0;10 11 12 static void _exit hello_exit(void)13 14 return;15 16 17 module_init(hello_init);18 module_exit(hello_exit);19 20 MODULE_LICENSE(GPL);从上面的代码中，我们可以很容易看到出错的代码在 08 行，当我们把它编译成一个*.ko模块，并使用 insmod 将其添加到内核时，Oops 信息

10、如期而至，如下：100.243737 BUG:unable to handle kernel NULL pointer dereference at(null)100.244985 IP:hello_init+0 x5/0 x11 hello 100.262266*pde=00000000 100.288395 Oops:0002#1 SMP 100.305468 last sysfs file:/sys/devices/virtual/sound/timer/uevent 100.325955 Modules linked in:hello(+)vmblock vsock vmmemctl

11、vmhgfs acpiphp snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev psmouse serio_raw fbcon tileblit font bitblit softcursor snd parport_pc soundcore snd_page_alloc

12、vmci i2c_piix4 vga16fb vgastate intel_agp agpgart shpchp lp parport floppy pcnet32 mii mptspi mptscsih mptbase scsi_transport_spi vmxnet 100.472178 100.494931 Pid:1586,comm:insmod Not tainted(2.6.32-21-generic#32-Ubuntu)VMware Virtual Platform 100.540018 EIP:0060:EFLAGS:00010246 CPU:0 100.562844 EIP

13、 is at hello_init+0 x5/0 x11 hello 100.584351 EAX:00000000 EBX:fffffffc ECX:f82cf040 EDX:00000001 100.609358 ESI:f82cf040 EDI:00000000 EBP:f1b9ff5c ESP:f1b9ff5c 100.631467 DS:007b ES:007b FS:00d8 GS:00e0 SS:0068 100.657664 Process insmod(pid:1586,ti=f1b9e000 task=f137b340 task.ti=f1b9e000)100.706083

14、 Stack:100.731783 f1b9ff88 c0101131 f82cf040 c076d240 fffffffc f82cf040 0072cff4 f82d2000 100.759324 fffffffc f82cf040 0072cff4 f1b9ffac c0182340 f19638f8 f137b340 f19638c0 100.811396 00000004 09cc9018 09cc9018 00020000 f1b9e000 c01033ec 09cc9018 00015324 100.891922 Call Trace:100.916257?do_one_init

15、call+0 x31/0 x190 100.943670?hello_init+0 x0/0 x11 hello 100.970905?sys_init_module+0 xb0/0 x210 100.995542?syscall_call+0 x7/0 xb 101.024087 Code:05 00 00 00 00 01 00 00 00 5d c3 00 00 00 00 00 00 00 00 00 00 101.079592 EIP:hello_init+0 x5/0 x11 hello SS:ESP 0068:f1b9ff5c 101.134682 CR2:00000000000

16、00000 101.158929-end trace e294b69a66d752cb-Oops 描述了 Bug 类型，并指出 Bug 的位置，即“IP:hello_init+0 x5/0 x11 hello”。此时，我们需要用 objdump 工具来帮忙分析问题，该命令可以帮助反汇编，执行命令如下：objdump-S hello.o 下面是反汇编后的内容，是 C 语言与汇编混合代码，如下：01 hello.o:file format elf32-i386 02 03 04 Disassembly of section.init.text:05 06 00000000:07#include 0

17、8#include 09 10 static int _init hello_init(void)11 12 0:55 push%ebp 13 int*p=0;14 15 *p=1;16 17 return 0;18 19 1:31 c0 xor%eax,%eax 20#include 21#include 22 23 static int _init hello_init(void)24 25 3:89 e5 mov%esp,%ebp 26 int*p=0;27 28 *p=1;29 5:c7 05 00 00 00 00 01 movl$0 x1,0 x0 30 c:00 00 00 31

18、 32 return 0;33 34 f:5d pop%ebp 35 10:c3 ret 36 37 Disassembly of section.exit.text:38 39 00000000:40 41 static void _exit hello_exit(void)42 43 0:55 push%ebp 44 1:89 e5 mov%esp,%ebp 45 3:e8 fc ff ff ff call 4 46 return;47 48 8:5d pop%ebp 49 9:c3 ret (注意：上面的%ebp 等中间出现空格，其中的空格应去掉，因为 sina 作了处理，故采用空格跳过

19、)对照 Oops 的提示，我们可以很清楚的看到，出错的位置 hello_init+0 x5 的汇编代码是：29 5:c7 05 00 00 00 00 01 movl$0 x1,0 x0 这句代码的作用是把数值 1 存入 0 这个地址，这个操作当然是非法的，同时也可以看到对应的源码为：28 *p=1;哈哈，在 Oops 信息的帮助下，我们很快就可以找到问题所在。该例子没有造成死机，可以使用 dmesg 命令查看到完整的错误信息，但很多时候是会造成死机，并且会存在多屏显示提示信息，那么我们可以使用内核转储工具 kdump 把发生 Oops时的内存和 CPU 寄存器的内容 dump 到一个文件里，

20、之后我们再用 gdb 来分析问题。参考网址：http:/ http:/ http:/ 1 何谓 OOPS Oops 是美国人比较常有的口语。就是有点意外，吃惊，或突然的意思。“Oops”并不是很严重，正如在 Britney Spears 的“Oops I Did It Again”那首歌的歌词中，也是一种轻描淡写，有时含有抱歉的意思。http:/ 对于 Linux 内核来说，Oops 就意外着内核出了异常，此时会将产生异常时 CPU 的状态，出错的指令地址、数据地址及其他寄存器，函数调用的顺序甚至是栈里面的内容都打印出来，然后根据异常的严重程度来决定下一步的操作：杀死导致异常的进程或者挂起系统

21、。最典型的异常是在内核态引用了一个非法地址，通常是未初始化的野指针 Null，这将导致页表异常，最终引发 Oops。Linux 系统足够健壮，能够正常的反应各种异常。异常通常导致当前进程的死亡，而系统依然能够继续运转，但是这种运转都处在一种不稳定的状态，随时可能出问题。对于中断上下文的异常及系统关键资源的破坏，通常会导致内核挂起，不再响应任何事件。2 内核的异常级别 2.1 Bug Bug 是指那些不符合内核的正常设计，但内核能够检测出来并且对系统运行不会产生影响的问题，比如在原子上下文中休眠。如：BUG:scheduling while atomic:insmod/826/0 x000000

22、02 Call Trace:ef12f700 c00081e0 show_stack+0 x3c/0 x194(unreliable)ef12f730 c0019b2c _schedule_bug+0 x64/0 x78 ef12f750 c0350f50 schedule+0 x324/0 x34c ef12f7a0 c03515c0 schedule_timeout+0 x68/0 xe4 ef12f7e0 c027938c fsl_elbc_run_command+0 x138/0 x1c0 ef12f820 c0275820 nand_do_read_ops+0 x130/0 x3dc

23、 ef12f880 c0275ebc nand_read+0 xac/0 xe0 ef12f8b0 c0262d98 part_read+0 x5c/0 xe4 ef12f8c0 c017bcac jffs2_flash_read+0 x68/0 x254 ef12f8f0 c0170550 jffs2_read_dnode+0 x60/0 x304 ef12f940 c017088c jffs2_read_inode_range+0 x98/0 x180 ef12f970 c016e610 jffs2_do_readpage_nolock+0 x94/0 x1ac ef12f990 c016

24、ee04 jffs2_write_begin+0 x2b0/0 x330 ef12fa10 c005144c generic_file_buffered_write+0 x11c/0 x8d0 ef12fab0 c0051e48 _generic_file_aio_write_nolock+0 x248/0 x500 ef12fb20 c0052168 generic_file_aio_write+0 x68/0 x10c ef12fb50 c007ca80 do_sync_write+0 xc4/0 x138 ef12fc10 f107c0dc oops_log+0 xdc/0 x1e8 o

25、opslog ef12fe70 f3087058 oops_log_init+0 x58/0 xa0 oopslog ef12fe80 c00477bc sys_init_module+0 x130/0 x17dc ef12ff40 c00104b0 ret_from_syscall+0 x0/0 x38-Exception:c01 at 0 xff29658 LR=0 x10031300 2.2 Oops 程序在内核态时，进入一种异常情况，比如引用非法指针导致的数据异常，数组越界导致的取指异常，此时异常处理机制能够捕获此异常，并将系统关键信息打印到串口上，正常情况下 Oops 消息会被记录到

26、系统日志中去。Oops 发生时，进程处在内核态，很可能正在访问系统关键资源，并且获取了一些锁，当进程由于 Oops 异常退出时，无法释放已经获取的资源，导致其他需要获取此资源的进程挂起，对系统的正常运行造成影响。通常这种情况，系统处在不稳定的状态，很可能崩溃。2.3 Panic 当 Oops 发生在中断上下文中或者在进程 0 和 1 中，系统将彻底挂起，因为中断服务程序异常后，将无法恢复，这种情况即称为内核 panic。另外当系统设置了 panic 标志时，无论Oops 发生在中断上下文还是进程上下文，都将导致内核 Panic。由于在中断复位程序中panic 后，系统将不再进行调度，Syslo

27、gd 将不会再运行，因此这种情况下，Oops 的消息仅仅打印到串口上，不会被记录在系统日志中。Kernelpanic 调试举例：242.788019 bluesleep_outgoing_data:tx was sleeping 244.012224*host_wake is 1 245.234647 Disable_key_during_touch=0 245.237802 huqiao_button-code=139,state=1 245.414640 Disable_key_during_touch=0 245.417542 huqiao_button-code=139,state=0

28、 245.821424*host_wake is 0 245.823708 bluesleep_hostwake_isr:Iwaking up.245.823713 245.830155 bluesleep_hostwake_task:bluesleep_hostwake_task is called 245.838356 Unable to handle kernel NULL pointer dereference at virtualaddress 00000008 245.845678 pgd=c0004000 245.848188 00000008*pgd=00000000 245.

29、851751 Internal error:Oops:5#1 PREEMPT SMP ARM 245.857122 Modules linked in:245.860080 CPU:0 Tainted:G W (3.4.0-perf-svn874#1)245.866444 PC is at sco_connect_cfm+0 x380/0 x4e8 245.871106 LR is at 0 xd880 245.873800 pc:lr:psr:40000013 245.873805 sp:dbe55e78 ip:00000000 fp:d7d95c00 245.885246 r10:d864

30、3998 r9:d8e5b80d r8:d8643830 245.890529 r7:dbe54000 r6:d9e5b600 r5:cae27c80 r4:d8643800 245.896968 r3:00000008 r2:00000000 r1:d7d96016 r0:00000000 245.903552 Flags:nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel 245.910772 Control:10c5787d Table:5a47406a DAC:00000015 245.916576 245.916579 PC

31、:0 xc0744640:245.920751 4640 e3310000 1afffffa f57ff04f e320f004 e5973004e2433001 e5873004 e5973000 245.928910 4660 ea000042 e59f0190 e300332a e19030b3 e31300040a000004 e2800fc6 e59f1198 如上图，当出现 kernel panic 的时候，会出现上面所示的堆栈信息。我们可以看到 245.866444 PC is atsco_connect_cfm+0 x380/0 x4e8，就会知道在 sco_connect_c

32、fm函数这边出现问题的。一般来说从 LR(链接寄存器)这，我们可以知道上面的哪个函数是被hci_proto_connect_cfm 所调用的。当看到 Unable to handle kernel NULL pointerdereference at virtual address 00000008 时，就知道这个函数应用了一个非法地址，在 linux 中，将最高的 1G 字节（从虚拟地址 0 xC0000000 到 0 xFFFFFFFF），供内核使用，称为“内核空间”。而将较低的 3G 字节（从虚拟地址 0 x00000000 到 0 xBFFFFFFF），供各个进程使用，称为“用户空间）

33、，现在内核非法使用了用户空间的地址故存在问题。关于 kernel panic 一般很难复现，于是我计划在内核中自己用代码去模拟这个现象。static inlinevoid hci_proto_connect_cfm(struct hci_conn*conn,_u8 status)register struct hci_proto*hp;hp=hci_protoHCI_PROTO_L2CAP;if(hp&hp-connect_cfm)hp-connect_cfm(conn,status);hp=hci_protoHCI_PROTO_SCO;if(hp&hp-connect_cfm)hp-conn

34、ect_cfm(conn,status);if(conn-connect_cfm_cb)conn-connect_cfm_cb(conn,status);当我把函数改变为 static inlinevoid hci_proto_connect_cfm(struct hci_conn*conn,_u8 status)register struct hci_proto*hp;hp=hci_protoHCI_PROTO_L2CAP;if(hp&hp-connect_cfm)hp-connect_cfm(conn,status);conn=NULL-21;/Simulation this phenom

35、enon,hp=hci_protoHCI_PROTO_SCO;if(hp&hp-connect_cfm)hp-connect_cfm(conn,status);if(conn-connect_cfm_cb)conn-connect_cfm_cb(conn,status);这个现象就会完全的复现。其实根据 hci_conn 结构体定义，我们就会知道 hcon-type 的地址为 00000008，于是我们就会明白，在最初的代码中，在调用 sco_connect_cfm 的时候，传入的变量 conn 的地址被改变为 NULL-21;但是在前面跑 hp-connect_cfm(conn,status)却没有什么问题，conn 的地址传进 hp-connect_cfm(conn,status)，也没有什么改变。于是我就开始郁闷了。为什么突然地址变为一个非法的地址？后来在网上查了下，才发现可能是硬件的问题，使得某一个地址发生了临时的错误而导致的。于是找到了原因，这个 bug 也就分析结束了。

展开阅读全文