[Beowulf] NFS+XFS+SMP on kernel 2.6 (Update)

Suvendra Nath Dutta sdutta at cfa.harvard.edu
Tue Jun 21 07:10:21 PDT 2005


Update on this:

I upgraded the kernel to 2.6.11 and the machine is a lot less sluggish. 
A lot more memory is available now.

Thanks for a much cheaper (and better) solution than buying a new 
machine.

Suvendra.

On Jun 15, 2005, at 5:48 PM, Joe Landman wrote:

> Eeek... nfs crashed ...  atop xfs.  You are running 2.6.8.1 with SuSE 
> 9.1.  Try upgrading to 9.3.  2.6.11 seems to have fixed many bugs on 
> AMD64.
>
> Also, run xfs_check against that file system device.  I had lots of 
> problems with SuSE 9.1 crashing in general.
>
> Note also that you are using tg3.  I had seen a fair number of tg3 
> initiated oopses on other machines.  The bcm5700 driver seemed more 
> stable to me.
>
> Joe
>
>
>
> I don't think this is a 4k page issue.
>
> Suvendra Nath Dutta wrote:
>> /var/log/messages
>> Jun 14 16:39:48 sauron kernel: ----------- [cut here ] --------- 
>> [please bite here ] ---------
>> Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
>> Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
>> Jun 14 16:39:48 sauron kernel: CPU 1
>> Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs 
>> dm_mod
>> Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted 
>> 2.6.8.1-suse91-osmp
>> Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299] 
>> <ffffffff802c9456>{cmn_err+278}
>> Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>] 
>> <ffffffff802c9456>{cmn_err+278}
>> Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8  EFLAGS: 
>> 00010246
>> Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX: 
>> 0000000000000000 RCX: ffffffff805b4ae8
>> Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI: 
>> 0000000000000001 RDI: 000001006e6aab30
>> Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08: 
>> 0000000000000001 R09: 0000000000000001
>> Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11: 
>> 0000000000000000 R12: 0000010033f47af0
>> Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14: 
>> 000001007e169000 R15: 000001007cf53a38
>> Jun 14 16:39:50 sauron kernel: FS:  0000002a9588d6e0(0000) 
>> GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
>> Jun 14 16:39:50 sauron kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
>> 000000008005003b
>> Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3: 
>> 0000000037eca000 CR4: 00000000000006e0
>> Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo 
>> 00000100791d0000, task 000001006e6aab30)
>> Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001 
>> 0000000000000293 0000003000000020 00000100791d18a8
>> Jun 14 16:39:51 sauron kernel:        00000100791d17e8 
>> ffffffff80153b08 0000000000001000 ffffffff8017677a
>> Jun 14 16:39:51 sauron kernel:        0000010078a8d080 
>> 0000010033f47ac0
>> Jun 14 16:39:51 sauron kernel: Call 
>> Trace:<ffffffff80153b08>{find_get_page+24} 
>> <ffffffff8017677a>{__find_get_block_slow+74}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802c8ef8>{vn_purge+328} 
>> <ffffffff80177e98>{unmap_underlying_metadata+8}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802c7c99>{linvfs_alloc_inode+41} 
>> <ffffffff8018e6a6>{iget_locked+230}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802c91ec>{vn_initialize+124} 
>> <ffffffff802a02b6>{xfs_iget+358}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802c8fe4>{vn_remove+68} <ffffffff802b6b73>{xfs_vget+51}
>> Jun 14 16:39:51 sauron kernel:        <ffffffff802c87d8>{vfs_vget+40} 
>> <ffffffff802a9e41>{xlog_write+1057}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802c77eb>{linvfs_get_dentry+59} 
>> <ffffffff802186f0>{find_exported_dentry+64}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8021bdf0>{nfsd_acceptable+0} 
>> <ffffffff8047b011>{sock_alloc_send_pskb+113}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff80491b88>{rt_hash_code+56} 
>> <ffffffff80493c10>{__ip_route_output_key+48}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff804819fd>{netif_receive_skb+381} 
>> <ffffffffa0013327>{:tg3:tg3_enable_ints+23}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8049a319>{ip_append_data+809} 
>> <ffffffff8048f783>{qdisc_restart+35}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8022084e>{exp_find_key+126} 
>> <ffffffff80218d7b>{export_decode_fh+123}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8021bc31>{fh_verify+961} 
>> <ffffffff80135230>{autoremove_wake_function+0}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff80135230>{autoremove_wake_function+0} 
>> <ffffffff8021d6d8>{nfsd_open+56}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8021da3b>{nfsd_write+107} 
>> <ffffffff8036e63f>{scsi_end_request+223}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff8036e84c>{scsi_io_completion+492} 
>> <ffffffff8015b99e>{cache_flusharray+110}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff80504bd2>{ip_map_lookup+306} 
>> <ffffffff805053a5>{svcauth_unix_accept+597}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff802252d1>{nfsd3_proc_write+241} 
>> <ffffffff80218f60>{nfsd_dispatch+256}
>> Jun 14 16:39:51 sauron kernel:        
>> <ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
>> Jun 14 16:39:51 sauron kernel:        <ffffffff80219465>{nfsd+581} 
>> <ffffffff801332ee>{schedule_tail+14}
>> Jun 14 16:39:51 sauron kernel:        <ffffffff801102a7>{child_rip+8} 
>> <ffffffff80219220>{nfsd+0}
>> Jun 14 16:39:51 sauron kernel:        <ffffffff80219220>{nfsd+0} 
>> <ffffffff8011029f>{child_rip+0}
>> Jun 14 16:39:51 sauron kernel:
>> Jun 14 16:39:51 sauron kernel:
>> Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a 
>> 00 48 81 c4 e0 00 00 00 5b
>> Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278} 
>> RSP <00000100791d17b8>
>> On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:
>>> What kernel bug did you run into?  Was it a page_allocation failure?
>>> paul
>>>
>>> Suvendra Nath Dutta wrote:
>>>
>>>> We set up a 160 node cluster with a dual processor head node with 
>>>> 2GB RAM. The head node also has two RAID devices attached to two 
>>>> SCSI cards. These have a XFS filesystem on them and are NFS 
>>>> exported to the cluster. The head node runs very low on memory (7-8 
>>>> MB). And today I ran into a kernel bug that crashed the system. 
>>>> Google suggests that I should upgrade to kernel 2.6.11, but that 
>>>> sounds very unpleasant. I am thinking of putting the raid boxes on 
>>>> a different box. Will separating the file-server and the head node 
>>>> give me back stability on the head node?
>>>>
>>>> Suvendra.
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615




More information about the Beowulf mailing list