Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Analyzing Kernel Crash On Red Hat

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Analyzing a kernel crash on Redhat 5.

4
This is a "quick start" extracted from the following places: http://www.dedoimedo.com/computers/crash-book.html The PDF document attached here (which was found online here, but perhaps won't be there forever) GRO (Generic Receive Offload): http://lwn.net/Articles/358910/ and http://lwn.net/Articles/358221/ Network driver updates notes (from Redhat 5.4 technical notes):

Generic Receive Offload (GRO) support has been implemented in this update, both. The GRO system increases the performance of inbound network connections by reducing the amount of processing done by the Central Processing Unit (CPU). GRO implements the same technique as the Large Receive Offload (LRO) system, but can be applied to a wider range of transport layer protocols. GRO support has also been added to a several network device drivers, including the igb driver for Intel Gigabit Ethernet Adapters and the ixgbe driver for Intel 10 Gigabit PCI Express network devices How GRO works How to turn GRO off using udev: http://www.centos.org/modules/newbb/viewtopic.php? topic_id=23922&forum=40

The case that prompted us to go through a kernel analysis had the following characteristics: 1. We made some changes to the sysctl.conf on the server in question to allow for better (or simply to make sure HS5 would work) performance 1. The changes were the following: net.ipv4.tcp_rmem = 4096 131072 262140 net.ipv4.tcp_wmem = 4096 131072 262140 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_window_scaling = 0 net.ipv4.tcp_keepalive_time = 60000 net.ipv4.tcp_keepalive_intvl = 15000 net.ipv4.tcp_fin_timeout = 30 2. After those changes, the server began to crash every-so-often (two/three days?) 3. The client implemented the debugging option for the kernel 4. As a result we got a hold of a vmcore file after one of the panic crashes OK, now to the troubleshooting itself: Before you can invoke crash on a vmcore, you need to install the associated kernel debuginfo packages (kernel-debuginfo-version.arch.rpm and kernel-debuginfo-common-version.arch.rpm). The vmlinux kernel debug information is stored in a separate debuginfo file. To confirm which architecture and variant/flavor of the kernel you have, simply check the vmcore file: <-- will show you the kernel architecture

# file ./vmcore

./vmcore: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'vmlinux'

# strings vmcore | fgrep -m1 'Linux ' <-- will show you the kernel variant

Linux version 2.6.9-22.EL (bhcompile@porky.build.redhat.com) (gcc version 3.4.4

20050721 (Red Hat 3.4.4-2)) #1 Mon Sep 19 18:20:28 EDT 2005

Install the debuginfo packages and make sure you'll see a vmlinux file under /usr/lib/debug/lib/modules/yourversion.ARCH/

# rpm -ivh kernel-debuginfo-2.6.9-22.EL.i686.rpm

Preparing... ########################################### [100%] 1:kernel-debuginfo ########################################### [100%]

# rpm -ivh kernel-debuginfo-common-2.6.9-22.EL.i686.rpm

Preparing... ########################################### [100%] 1:kernel-debuginfo-common########################################### [100%]

# ls /usr/lib/debug/lib/modules/ -l

total 24

drwxr-xr-x 3 root root 4096 May 2 10:41 2.6.9-22.EL drwxr-xr-x 3 root root 4096 May 2 10:41 2.6.9-22.ELhugemem drwxr-xr-x 3 root root 4096 May 2 10:41 2.6.9-22.ELsmp

# ls /usr/lib/debug/lib/modules/2.6.9-22.EL -l

total 32848 drwxr-xr-x 9 root root 4096 May 1 19:50 kernel -rwxr-xr-x 1 root root 33583473 Sep 20 2005 vmlinux

Take note that: * You should use -ivh rather than -Uvh when installing the kernel package. This will preserve the older version of the kernel installed so that you can revert back to a known working version of the kernel should you encounter any problems with the new version. * The kernel-debuginfo package for an older kernel can safely remain installed when installing a newer version. The kernel-debuginfo must match the kernel version, variant, and architecture that created the vmcore. See the file ./vmcore and strings vmcore | fgrep -m1 'Linux ' commands in the above output. Run crash Two arguments are required The vmlinux file associated with the running kernel, typically found in /usr/lib/debug/lib/modules/ directory. The kernel crash dump name, vmcore.

For example # crash /usr/lib/debug/lib/module/vmlinux /var/crash/127.0.0.1-200704-30-21\:38/vmcore Once in the crash program, check the PANIC and COMMAND lines to see if they display the command that caused the crash:

It may or may not be possible to asses what happened by just looking at those two lines, so proceed with the next steps Issue the "bt" (backtrack) command to list what was going on just before the crash:

The bt output is sorted in a descending order, from the last task that was running (the moment of the crash) to previous tasks running (thus: #0 is the moment of crash, #1 is the task just before the crash, etc) Pay attention to two things: the task that called the panic and the CS register value. Here is the output of bt In our real-life case:

crash> bt PID: 0 TASK: ffff81043fcae7a0 CPU: 11 COMMAND: "swapper" #0 [ffff81023fcb3b90] crash_kexec at ffffffff800ac5b9 #1 [ffff81023fcb3c50] __die at ffffffff80065127 #2 [ffff81023fcb3c90] die at ffffffff8006bc51 #3 [ffff81023fcb3cc0] do_general_protection at ffffffff8006556f #4 [ffff81023fcb3d00] error_exit at ffffffff8005dde9 [exception RIP: skb_gro_reset_offset+53] RIP: ffffffff8022714a RSP: ffff81023fcb3db8 RFLAGS: 00010246 RAX: ffff8101e9541010 RBX: ffff81043d4e16c0 RCX: 0000000000000008

RDX: 2030202020202020 RSI: ffff810196642d18 RDI: ffff810196642cc0 RBP: ffffc20010087300 R8: 0000000000000000 R9: ffff8101e9541880 R10: 0000000000000013 R11: 00000000ffff8104 R12: ffff810196642cc0 R13: ffff81043d4e1500 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff81023fcb3db8] napi_gro_receive at ffffffff80228638 #6 [ffff81023fcb3dd8] netxen_process_rcv at ffffffff881a7a23 #7 [ffff81023fcb3e38] netxen_process_rcv_ring at ffffffff881a7b2b #8 [ffff81023fcb3eb8] netxen_nic_poll at ffffffff881a5966 #9 [ffff81023fcb3ef8] net_rx_action at ffffffff8000c845 #10 [ffff81023fcb3f38] __do_softirq at ffffffff8001235a #11 [ffff81023fcb3f68] call_softirq at ffffffff8005e2fc #12 [ffff81023fcb3f80] do_softirq at ffffffff8006cb14 #13 [ffff81023fcb3f90] do_IRQ at ffffffff8006c99c --- <IRQ stack> --#14 [ffff81023fcaddf8] ret_from_intr at ffffffff8005d615 [exception RIP: acpi_safe_halt+37] RIP: ffffffff801973ac RSP: ffff81023fcadea0 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff81043fcb28a0 RCX: 0000000000000004 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff804b5e2c RBP: 0000000000000000 R8: ffff81023fcac000 R9: ffffffff804b5e2c R10: 0000000000000046 R11: 0000000000000046 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000003 R15: 0000000100000000 ORIG_RAX: ffffffffffffff35 CS: 0010 SS: 0018 #15 [ffff81023fcadea0] acpi_processor_idle at ffffffff801975a2 #16 [ffff81023fcadef0] cpu_idle at ffffffff8004939e

Analysis of the output The kernel function causing the panic is called "skb_gro_reset_offset" The CS register in at panic time is 0010: this means that the crash occurs in kernel mode (as opposed to user mode). What can be deducted from this is: it's not an end-user software causing the error, but rather hardware or a system function. SKB = socket buffer, GRO = Generic Receive Offload This bug talks about a panic crash caused by a NIC with enabled GRO... so, we needed to check what kind of NIC bonding and/or features were enabled on the machine in question...and whether we can disable GRO (if it indeed is enabled) on it. To see if GRO is enabled on a network interface: # ethtool -k <ifname> eg: # ethtool -k eth0 Some network drivers may be buggy when it comes to GRO... another consideration was to look into the NIC of the server and see if there were any firmware patches from its manufacturer

If further analysis is needed, issue the log command (within crash) to get more information that may shed some light into the issue: from crash> log: ... mtrr: type mismatch for e8000000,4000000 old: uncachable new: writecombining cciss 0000:0e:00.0: a power on or device reset detected cciss 0000:0e:00.0: unknown unit attention detected cciss 0000:0e:00.0: unknown unit attention detected cciss 0000:0e:00.0: unknown unit attention detected cciss 0000:0e:00.0: unknown unit attention detected cciss 0000:0e:00.0: unknown unit attention detected general protection fault: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 11 Modules linked in: st autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev sr_mod cdrom sg pcspkr shpchp hpilo serio_raw netxen_nic dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata cciss(U)

sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Tainted: G RIP: 0010:[<ffffffff8022714a>] skb_gro_reset_offset+0x35/0xa2 RSP: 0018:ffff81023fcb3db8 2.6.18-164.el5 #1

[<ffffffff8022714a>]

EFLAGS: 00010246

RAX: ffff8101e9541010 RBX: ffff81043d4e16c0 RCX: 0000000000000008 RDX: 2030202020202020 RSI: ffff810196642d18 RDI: ffff810196642cc0 RBP: ffffc20010087300 R08: 0000000000000000 R09: ffff8101e9541880 R10: 0000000000000013 R11: 00000000ffff8104 R12: ffff810196642cc0 R13: ffff81043d4e1500 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff810107eb8440(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b

CR2: 00002b676431524c CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff81023fcac000, task ffff81043fcae7a0) Stack: ffffffff80228638 a920002100000000 ffff810196642cc0 000000000000a920 ffffffff881a7a23 0000000000000000 ffff810037c1fa00 0000000000000000 ffff81043d4e1000 ffff81043d4e1a08 ffff810037c1fa00 ffff810037c10000 Call Trace: <IRQ> [<ffffffff80228638>] napi_gro_receive+0x15/0x2f

[<ffffffff881a7a23>] :netxen_nic:netxen_process_rcv+0x2de/0x311 [<ffffffff881a7b2b>] :netxen_nic:netxen_process_rcv_ring+0xd5/0x2fe [<ffffffff881a5966>] :netxen_nic:netxen_nic_poll+0x4a/0xf2 [<ffffffff8000c845>] net_rx_action+0xac/0x1e0 [<ffffffff8001235a>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28

[<ffffffff8006cb14>] do_softirq+0x2c/0x85 [<ffffffff8006c99c>] do_IRQ+0xec/0xf5 [<ffffffff8019741b>] acpi_processor_idle+0x0/0x463 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff801973ac>] acpi_safe_halt+0x25/0x36

[<ffffffff801975a2>] acpi_processor_idle+0x187/0x463 [<ffffffff8019741b>] acpi_processor_idle+0x0/0x463 [<ffffffff8019741b>] acpi_processor_idle+0x0/0x463 [<ffffffff8004939e>] cpu_idle+0x95/0xb8 [<ffffffff80076e23>] start_secondary+0x45a/0x469

Code: 48 8b 0a 48 c1 e9 33 48 89 c8 48 c1 e8 09 48 8b 04 c5 80 5b RIP [<ffffffff8022714a>] skb_gro_reset_offset+0x35/0xa2

RSP <ffff81023fcb3db8> So, the conclusion was that: this panic crash was likely because the TCP changes made were creating a problem for the "receive buffer" and the GRO. We opted for turning the GRO option on the server in question through the use of udev rules: Create a file under /etc/udev/rules.d called 50-ethtool.rules with the following content:ACTION=="add", SUBSYSTEM=="net", KERNEL=="eth0", RUN+="/sbin/ethtool -K eth0 gro off" The command (udev rule) will then be executed upon boot (or when running /sbin/start_udev) and will set the GRO option on eth0 to off.

You might also like