[Beowulf] NFS+XFS+SMP on kernel 2.6

Joe Landman landman at scalableinformatics.com
Wed Jun 15 14:48:16 PDT 2005


Eeek... nfs crashed ...  atop xfs.  You are running 2.6.8.1 with SuSE 
9.1.  Try upgrading to 9.3.  2.6.11 seems to have fixed many bugs on 
AMD64.

Also, run xfs_check against that file system device.  I had lots of 
problems with SuSE 9.1 crashing in general.

Note also that you are using tg3.  I had seen a fair number of tg3 
initiated oopses on other machines.  The bcm5700 driver seemed more 
stable to me.

Joe



I don't think this is a 4k page issue.

Suvendra Nath Dutta wrote:
> /var/log/messages
> 
> 
> Jun 14 16:39:48 sauron kernel: ----------- [cut here ] --------- [please 
> bite here ] ---------
> Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
> Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
> Jun 14 16:39:48 sauron kernel: CPU 1
> Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs dm_mod
> Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted 
> 2.6.8.1-suse91-osmp
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8  EFLAGS: 00010246
> Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX: 
> 0000000000000000 RCX: ffffffff805b4ae8
> Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI: 
> 0000000000000001 RDI: 000001006e6aab30
> Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08: 
> 0000000000000001 R09: 0000000000000001
> Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11: 
> 0000000000000000 R12: 0000010033f47af0
> Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14: 
> 000001007e169000 R15: 000001007cf53a38
> Jun 14 16:39:50 sauron kernel: FS:  0000002a9588d6e0(0000) 
> GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
> Jun 14 16:39:50 sauron kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
> Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3: 
> 0000000037eca000 CR4: 00000000000006e0
> Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo 
> 00000100791d0000, task 000001006e6aab30)
> Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001 0000000000000293 
> 0000003000000020 00000100791d18a8
> Jun 14 16:39:51 sauron kernel:        00000100791d17e8 ffffffff80153b08 
> 0000000000001000 ffffffff8017677a
> Jun 14 16:39:51 sauron kernel:        0000010078a8d080 0000010033f47ac0
> Jun 14 16:39:51 sauron kernel: Call 
> Trace:<ffffffff80153b08>{find_get_page+24} 
> <ffffffff8017677a>{__find_get_block_slow+74}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8ef8>{vn_purge+328} 
> <ffffffff80177e98>{unmap_underlying_metadata+8}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c7c99>{linvfs_alloc_inode+41} 
> <ffffffff8018e6a6>{iget_locked+230}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c91ec>{vn_initialize+124} <ffffffff802a02b6>{xfs_iget+358}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8fe4>{vn_remove+68} 
> <ffffffff802b6b73>{xfs_vget+51}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c87d8>{vfs_vget+40} 
> <ffffffff802a9e41>{xlog_write+1057}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c77eb>{linvfs_get_dentry+59} 
> <ffffffff802186f0>{find_exported_dentry+64}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021bdf0>{nfsd_acceptable+0} 
> <ffffffff8047b011>{sock_alloc_send_pskb+113}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80491b88>{rt_hash_code+56} 
> <ffffffff80493c10>{__ip_route_output_key+48}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff804819fd>{netif_receive_skb+381} 
> <ffffffffa0013327>{:tg3:tg3_enable_ints+23}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8049a319>{ip_append_data+809} <ffffffff8048f783>{qdisc_restart+35}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8022084e>{exp_find_key+126} 
> <ffffffff80218d7b>{export_decode_fh+123}
> Jun 14 16:39:51 sauron kernel:        <ffffffff8021bc31>{fh_verify+961} 
> <ffffffff80135230>{autoremove_wake_function+0}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80135230>{autoremove_wake_function+0} 
> <ffffffff8021d6d8>{nfsd_open+56}
> Jun 14 16:39:51 sauron kernel:        <ffffffff8021da3b>{nfsd_write+107} 
> <ffffffff8036e63f>{scsi_end_request+223}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8036e84c>{scsi_io_completion+492} 
> <ffffffff8015b99e>{cache_flusharray+110}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80504bd2>{ip_map_lookup+306} 
> <ffffffff805053a5>{svcauth_unix_accept+597}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802252d1>{nfsd3_proc_write+241} 
> <ffffffff80218f60>{nfsd_dispatch+256}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219465>{nfsd+581} 
> <ffffffff801332ee>{schedule_tail+14}
> Jun 14 16:39:51 sauron kernel:        <ffffffff801102a7>{child_rip+8} 
> <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219220>{nfsd+0} 
> <ffffffff8011029f>{child_rip+0}
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a 00 
> 48 81 c4 e0 00 00 00 5b
> Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278} RSP 
> <00000100791d17b8>
> 
> On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:
> 
>> What kernel bug did you run into?  Was it a page_allocation failure?
>> paul
>>
>> Suvendra Nath Dutta wrote:
>>
>>> We set up a 160 node cluster with a dual processor head node with 2GB 
>>> RAM. The head node also has two RAID devices attached to two SCSI 
>>> cards. These have a XFS filesystem on them and are NFS exported to 
>>> the cluster. The head node runs very low on memory (7-8 MB). And 
>>> today I ran into a kernel bug that crashed the system. Google 
>>> suggests that I should upgrade to kernel 2.6.11, but that sounds very 
>>> unpleasant. I am thinking of putting the raid boxes on a different 
>>> box. Will separating the file-server and the head node give me back 
>>> stability on the head node?
>>>
>>> Suvendra.
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615



More information about the Beowulf mailing list