[Beowulf] NFS+XFS+SMP on kernel 2.6 (Update)
Suvendra Nath Dutta
sdutta at cfa.harvard.edu
Tue Jun 21 07:10:21 PDT 2005
Update on this:
I upgraded the kernel to 2.6.11 and the machine is a lot less sluggish.
A lot more memory is available now.
Thanks for a much cheaper (and better) solution than buying a new
machine.
Suvendra.
On Jun 15, 2005, at 5:48 PM, Joe Landman wrote:
> Eeek... nfs crashed ... atop xfs. You are running 2.6.8.1 with SuSE
> 9.1. Try upgrading to 9.3. 2.6.11 seems to have fixed many bugs on
> AMD64.
>
> Also, run xfs_check against that file system device. I had lots of
> problems with SuSE 9.1 crashing in general.
>
> Note also that you are using tg3. I had seen a fair number of tg3
> initiated oopses on other machines. The bcm5700 driver seemed more
> stable to me.
>
> Joe
>
>
>
> I don't think this is a 4k page issue.
>
> Suvendra Nath Dutta wrote:
>> /var/log/messages
>> Jun 14 16:39:48 sauron kernel: ----------- [cut here ] ---------
>> [please bite here ] ---------
>> Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
>> Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
>> Jun 14 16:39:48 sauron kernel: CPU 1
>> Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs
>> dm_mod
>> Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted
>> 2.6.8.1-suse91-osmp
>> Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299]
>> <ffffffff802c9456>{cmn_err+278}
>> Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>]
>> <ffffffff802c9456>{cmn_err+278}
>> Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8 EFLAGS:
>> 00010246
>> Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX:
>> 0000000000000000 RCX: ffffffff805b4ae8
>> Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI:
>> 0000000000000001 RDI: 000001006e6aab30
>> Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08:
>> 0000000000000001 R09: 0000000000000001
>> Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11:
>> 0000000000000000 R12: 0000010033f47af0
>> Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14:
>> 000001007e169000 R15: 000001007cf53a38
>> Jun 14 16:39:50 sauron kernel: FS: 0000002a9588d6e0(0000)
>> GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
>> Jun 14 16:39:50 sauron kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3:
>> 0000000037eca000 CR4: 00000000000006e0
>> Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo
>> 00000100791d0000, task 000001006e6aab30)
>> Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001
>> 0000000000000293 0000003000000020 00000100791d18a8
>> Jun 14 16:39:51 sauron kernel: 00000100791d17e8
>> ffffffff80153b08 0000000000001000 ffffffff8017677a
>> Jun 14 16:39:51 sauron kernel: 0000010078a8d080
>> 0000010033f47ac0
>> Jun 14 16:39:51 sauron kernel: Call
>> Trace:<ffffffff80153b08>{find_get_page+24}
>> <ffffffff8017677a>{__find_get_block_slow+74}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802c8ef8>{vn_purge+328}
>> <ffffffff80177e98>{unmap_underlying_metadata+8}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802c7c99>{linvfs_alloc_inode+41}
>> <ffffffff8018e6a6>{iget_locked+230}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802c91ec>{vn_initialize+124}
>> <ffffffff802a02b6>{xfs_iget+358}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802c8fe4>{vn_remove+68} <ffffffff802b6b73>{xfs_vget+51}
>> Jun 14 16:39:51 sauron kernel: <ffffffff802c87d8>{vfs_vget+40}
>> <ffffffff802a9e41>{xlog_write+1057}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802c77eb>{linvfs_get_dentry+59}
>> <ffffffff802186f0>{find_exported_dentry+64}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8021bdf0>{nfsd_acceptable+0}
>> <ffffffff8047b011>{sock_alloc_send_pskb+113}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff80491b88>{rt_hash_code+56}
>> <ffffffff80493c10>{__ip_route_output_key+48}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff804819fd>{netif_receive_skb+381}
>> <ffffffffa0013327>{:tg3:tg3_enable_ints+23}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8049a319>{ip_append_data+809}
>> <ffffffff8048f783>{qdisc_restart+35}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8022084e>{exp_find_key+126}
>> <ffffffff80218d7b>{export_decode_fh+123}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8021bc31>{fh_verify+961}
>> <ffffffff80135230>{autoremove_wake_function+0}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff80135230>{autoremove_wake_function+0}
>> <ffffffff8021d6d8>{nfsd_open+56}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8021da3b>{nfsd_write+107}
>> <ffffffff8036e63f>{scsi_end_request+223}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff8036e84c>{scsi_io_completion+492}
>> <ffffffff8015b99e>{cache_flusharray+110}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff80504bd2>{ip_map_lookup+306}
>> <ffffffff805053a5>{svcauth_unix_accept+597}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff802252d1>{nfsd3_proc_write+241}
>> <ffffffff80218f60>{nfsd_dispatch+256}
>> Jun 14 16:39:51 sauron kernel:
>> <ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
>> Jun 14 16:39:51 sauron kernel: <ffffffff80219465>{nfsd+581}
>> <ffffffff801332ee>{schedule_tail+14}
>> Jun 14 16:39:51 sauron kernel: <ffffffff801102a7>{child_rip+8}
>> <ffffffff80219220>{nfsd+0}
>> Jun 14 16:39:51 sauron kernel: <ffffffff80219220>{nfsd+0}
>> <ffffffff8011029f>{child_rip+0}
>> Jun 14 16:39:51 sauron kernel:
>> Jun 14 16:39:51 sauron kernel:
>> Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a
>> 00 48 81 c4 e0 00 00 00 5b
>> Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278}
>> RSP <00000100791d17b8>
>> On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:
>>> What kernel bug did you run into? Was it a page_allocation failure?
>>> paul
>>>
>>> Suvendra Nath Dutta wrote:
>>>
>>>> We set up a 160 node cluster with a dual processor head node with
>>>> 2GB RAM. The head node also has two RAID devices attached to two
>>>> SCSI cards. These have a XFS filesystem on them and are NFS
>>>> exported to the cluster. The head node runs very low on memory (7-8
>>>> MB). And today I ran into a kernel bug that crashed the system.
>>>> Google suggests that I should upgrade to kernel 2.6.11, but that
>>>> sounds very unpleasant. I am thinking of putting the raid boxes on
>>>> a different box. Will separating the file-server and the head node
>>>> give me back stability on the head node?
>>>>
>>>> Suvendra.
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 734 786 8452
> cell : +1 734 612 4615
More information about the Beowulf
mailing list