[Beowulf] Network Filesystems performance
joe landman
landman at scalableinformatics.com
Thu Aug 23 15:27:21 PDT 2007
I wonder what the LV impact is here. Md is the fastest i have seen on these units with lv losing quite a bit of performance (20 percent or so as i recall).
Regards
Joe
---
joe landman
landman at scalableinformatics.com
+1 734 612 4615
(sent from cell phone ... please excuse brevity and typos)
-----Original Message-----
From: "Glen Dosey" <doseyg at r-networks.net>
To: landman at scalableinformatics.com
Cc: "Jeff Blasius" <jeff.blasius at yale.edu>; "Beowulf" <beowulf at beowulf.org>
Sent: 8/23/2007 6:09 PM
Subject: Re: [Beowulf] Network Filesystems performance
On Thu, 2007-08-23 at 15:53 -0400, Joe Landman wrote:
<snip>
> Since you indicated RHEL4, its possible that something in kernel is
> causing problems. RHEL4 is not known to be a speed demon.
All the current testing is on RHEL5 actually. 64bit . It offered better
performance than RHEL4. Everything in here refers to GigE and not
infinband (since we want to keep that for MPI)
modified entries in sysctl include:
net.ipv4.tcp_window_scaling = 1
sunrpc.tcp_slot_table_entries = 128
net.core.netdev_max_backlog = 2500
net.core.wmem_max = 83886080
net.core.rmem_max = 83886080
net.core.wmem_default = 6553600
net.core.rmem_default = 6553600
net.ipv4.tcp_rmem = 4096 6553600 83886080
net.ipv4.tcp_wmem = 4096 6553600 83886080
> What about the usual suspects
>
> cat /proc/interrupts
[root at storage1 ~]# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 176708034 176467911 178788831 178782166 IO-APIC-edge timer
1: 167 112 0 246 IO-APIC-edge i8042
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
12: 189 185 51 57 IO-APIC-edge i8042
50: 69688 963247 1227309 270953 IO-APIC-level qla2xxx
58: 15112 96722 96347 7613 IO-APIC-level qla2xxx
66: 47398161 0 0 0 IO-APIC-level eth0
74: 5 0 21502 0 IO-APIC-level eth1
217: 0 0 0 0 IO-APIC-level ohci_hcd:usb1, libata
225: 1 0 0 0 IO-APIC-level ehci_hcd:usb2
233: 30917 188965 204419 94202 IO-APIC-level libata
NMI: 2933 2685 2485 1792
LOC: 710654972 710659916 710660939 710658940
ERR: 0
MIS: 0
> blockdev --getra /dev/sda
We're using logical volumes, with an 8192 sector read ahead on the lv
and disk.
> ...
> lspci -v
>
> Is your gigabit sharing a 100/133 MB/s old PCI bus with your RAID card?
> On older motherboards, the gigabit NICs were put on an old PCI branch,
> typically 100 MB/s max. If there is a PCI RAID card in the same slot,
> or, as also often happened on these older MB's, the SATA ports were
> hanging off the same old/slow PCI bus, well, it could explain your results.
We're running Altus 1300 systems. There is just a QLA242 in the system
on the PCIX Bus. There is no RAID, the storage is handled externally via
the FC. Here's the output from lspci -tv
[root at storage1 rules.d]# lspci -tv
-+-[0000:06]-+-01.0-[0000:07]--
| +-01.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
| +-02.0-[0000:08]--+-01.0 QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
| | \-01.1 QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
| \-02.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
\-[0000:00]-+-00.0 nVidia Corporation CK804 Memory Controller
+-01.0 nVidia Corporation CK804 ISA Bridge
+-01.1 nVidia Corporation CK804 SMBus
+-02.0 nVidia Corporation CK804 USB Controller
+-02.1 nVidia Corporation CK804 USB Controller
+-06.0 nVidia Corporation CK804 IDE
+-07.0 nVidia Corporation CK804 Serial ATA Controller
+-08.0 nVidia Corporation CK804 Serial ATA Controller
+-09.0-[0000:01]----07.0 ATI Technologies Inc Rage XL
+-0b.0-[0000:02]----00.0 Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
+-0c.0-[0000:03]--
+-0d.0-[0000:04]----00.0 Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
+-0e.0-[0000:05]--
The systems have 4GB RAM and are dual Opteron 285.
The externally attached Xyratex 5200 storage is connected via 2Gbit
fibre via a Qlogic Switch to a 12 disk array using a hardware raid
controller configured for 10+1 raid 5 with a hot spare and 128K chunks
for a total 1280K stripe. The ext3 filesystem was created with a stride
of 32. The partition table and volume labels were each offset by 128MB
to account for disk alignment with stripe writes. The disks are 500GB
Seagate SATA drives, model ST3500641NS. The array controller has the
read ahead disabled and and a 256MB writeaback enabled. This is the only
system utilizing the array/enclosure/controller. The filesystem is 4.9TB
in size.
Here's a set of Bonnie++ numbers if it matters(sorry for the formatting,
copied from an html file)
Ext3 8G 57304 90 92685 34 52007 12 66123 90 178088 19 401.9 0 16:786432:0/16 47 5 112 4 1782 19 49 6 41 1 378 5
or the ever popular (but totally unrealistic) series of dd tests
Read on NFS server
[root at storage1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
510933+0 records in
510932+0 records out
2092777472 bytes (2.1 GB) copied, 12.6766 seconds, 165 MB/s
(disk was unmounted on server to clear cache)
Read from NFS client
[root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
418341+0 records in
418340+0 records out
1713520640 bytes (1.7 GB) copied, 30.2718 seconds, 56.6 MB/s
Write on NFS client
[root at wopr1 ~]# dd if=/dev/zero of=/mnt/array3/file.dd bs=4k count=256000
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB) copied, 10.1124 seconds, 104 MB/s
now we unmount the NFS share, recreate the file on the server, and remount it to clear the client cache but leave it cached on the server
[root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
524287+0 records in
524287+0 records out
2147479552 bytes (2.1 GB) copied, 18.5161 seconds, 116 MB/s
Since our NFS is over TCP here's the iperf test results, which basically confirm the above dd results.
[root at wopr1 ~]# ./iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 6.25 MByte (default)
------------------------------------------------------------
[ 3] local client port 37325 connected with server port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec
iftop confirms the basic numbers I've been talking about. Additionally I
have been graphing per port utilization on the Qlogic FC switch and it
confirms the numbers I've been seeing on the disk side of things and
helps determine if the file is in cache or not (or partially).
atop shows basically the same iostat does, which is that on the initial
read the FC disk is about %85 percent utilized and the network is about
%50 utilized. No other resource seems to be close to it's limit. On
subsequent reads the disk is not touched and the network is %100
utilized.
I have never used dstat before. I will read up on it and see if it
reveals anything interesting.
>
> Which MB do you have? Which bios rev, ... Which raid card, how much
> ram, 32 or 64 bit, yadda yadda yadda (all the details you didnt give
> before).
>
> Joe
>
More information about the Beowulf
mailing list