[Beowulf] Network Filesystems performance

Thu Aug 23 15:27:21 PDT 2007

I wonder what the LV impact is here.  Md is the fastest i have seen on these units with lv losing quite a bit of performance (20 percent or so as i recall).

Regards

Joe
---
joe landman
landman at scalableinformatics.com 
+1 734 612 4615
(sent from cell phone ... please excuse brevity and typos)

-----Original Message-----
From: "Glen Dosey" <doseyg at r-networks.net>
To: landman at scalableinformatics.com
Cc: "Jeff Blasius" <jeff.blasius at yale.edu>; "Beowulf" <beowulf at beowulf.org>
Sent: 8/23/2007 6:09 PM
Subject: Re: [Beowulf] Network Filesystems performance

On Thu, 2007-08-23 at 15:53 -0400, Joe Landman wrote:
<snip>
> Since you indicated RHEL4, its possible that something in kernel is
> causing problems.  RHEL4 is not known to be a speed demon.

All the current testing is on RHEL5 actually. 64bit . It offered better
performance than RHEL4. Everything in here refers to GigE and not
infinband (since we want to keep that for MPI)

modified entries in sysctl include:
net.ipv4.tcp_window_scaling = 1
sunrpc.tcp_slot_table_entries = 128
net.core.netdev_max_backlog = 2500
net.core.wmem_max = 83886080
net.core.rmem_max = 83886080
net.core.wmem_default = 6553600
net.core.rmem_default = 6553600
net.ipv4.tcp_rmem = 4096 6553600 83886080
net.ipv4.tcp_wmem = 4096 6553600 83886080

> What about the usual suspects
> 
> 	cat /proc/interrupts

[root at storage1 ~]# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:  176708034  176467911  178788831  178782166    IO-APIC-edge  timer
  1:        167        112          0        246    IO-APIC-edge  i8042
  8:          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 12:        189        185         51         57    IO-APIC-edge  i8042
 50:      69688     963247    1227309     270953   IO-APIC-level  qla2xxx
 58:      15112      96722      96347       7613   IO-APIC-level  qla2xxx
 66:   47398161          0          0          0   IO-APIC-level  eth0
 74:          5          0      21502          0   IO-APIC-level  eth1
217:          0          0          0          0   IO-APIC-level  ohci_hcd:usb1, libata
225:          1          0          0          0   IO-APIC-level  ehci_hcd:usb2
233:      30917     188965     204419      94202   IO-APIC-level  libata
NMI:       2933       2685       2485       1792 
LOC:  710654972  710659916  710660939  710658940 
ERR:          0
MIS:          0

> 	blockdev --getra /dev/sda

We're using logical volumes, with an 8192 sector read ahead on the lv
and disk.

> ...
> 	lspci -v
> 
> Is your gigabit sharing a 100/133 MB/s old PCI bus with your RAID card?
> On older motherboards, the gigabit NICs were put on an old PCI branch,
> typically 100 MB/s max.  If there is a PCI RAID card in the same slot,
> or, as also often happened on these older MB's, the SATA ports were
> hanging off the same old/slow PCI bus, well, it could explain your results.

We're running Altus 1300 systems. There is just a QLA242 in the system
on the PCIX Bus. There is no RAID, the storage is handled externally via
the FC.  Here's the output from lspci -tv

[root at storage1 rules.d]# lspci -tv
-+-[0000:06]-+-01.0-[0000:07]--
 |           +-01.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
 |           +-02.0-[0000:08]--+-01.0  QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
 |           |                 \-01.1  QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
 |           \-02.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
 \-[0000:00]-+-00.0  nVidia Corporation CK804 Memory Controller
             +-01.0  nVidia Corporation CK804 ISA Bridge
             +-01.1  nVidia Corporation CK804 SMBus
             +-02.0  nVidia Corporation CK804 USB Controller
             +-02.1  nVidia Corporation CK804 USB Controller
             +-06.0  nVidia Corporation CK804 IDE
             +-07.0  nVidia Corporation CK804 Serial ATA Controller
             +-08.0  nVidia Corporation CK804 Serial ATA Controller
             +-09.0-[0000:01]----07.0  ATI Technologies Inc Rage XL
             +-0b.0-[0000:02]----00.0  Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
             +-0c.0-[0000:03]--
             +-0d.0-[0000:04]----00.0  Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
             +-0e.0-[0000:05]--

The systems have 4GB RAM and are dual Opteron 285.

The externally attached Xyratex 5200 storage is connected via 2Gbit
fibre via a Qlogic Switch to a 12 disk array using a hardware raid
controller configured for 10+1 raid 5 with a hot spare and 128K chunks
for a total 1280K stripe. The ext3 filesystem was created with a stride
of 32. The partition table and volume labels were each offset by 128MB
to account for disk alignment with stripe writes. The disks are 500GB
Seagate SATA drives, model ST3500641NS. The array controller has the
read ahead disabled and and a 256MB writeaback enabled. This is the only
system utilizing the array/enclosure/controller. The filesystem is 4.9TB
in size.

Here's a set of Bonnie++ numbers if it matters(sorry for the formatting,
copied from an html file)
Ext3	8G	57304	90	92685	34	52007	12	66123	90	178088	19	401.9	0	16:786432:0/16	47	5	112	4	1782	19	49	6	41	1	378	5

or the ever popular (but totally unrealistic) series of dd tests 

Read on NFS server
[root at storage1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
510933+0 records in
510932+0 records out
2092777472 bytes (2.1 GB) copied, 12.6766 seconds, 165 MB/s

(disk was unmounted on server to clear cache)

Read from NFS client
[root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
418341+0 records in
418340+0 records out
1713520640 bytes (1.7 GB) copied, 30.2718 seconds, 56.6 MB/s

Write on NFS client
[root at wopr1 ~]# dd if=/dev/zero of=/mnt/array3/file.dd bs=4k count=256000
256000+0 records in
256000+0 records out
1048576000 bytes (1.0 GB) copied, 10.1124 seconds, 104 MB/s

now we unmount the NFS share, recreate the file on the server, and remount it to clear the client cache but leave it cached on the server

[root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
524287+0 records in
524287+0 records out
2147479552 bytes (2.1 GB) copied, 18.5161 seconds, 116 MB/s

Since our NFS is over TCP here's the iperf test results, which basically confirm the above dd results.

[root at wopr1 ~]# ./iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 6.25 MByte (default)
------------------------------------------------------------
[  3] local client port 37325 connected with server port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec

iftop confirms the basic numbers I've been talking about. Additionally I
have been graphing per port utilization on the Qlogic FC switch and it
confirms the numbers I've been seeing on the disk side of things and
helps determine if the file is in cache or not (or partially).

atop shows basically the same iostat does, which is that on the initial
read the FC disk is about %85 percent utilized and the network is about
%50 utilized. No other resource seems to be close to it's limit. On
subsequent reads the disk is not touched and the network is %100
utilized.

I have never used dstat before. I will read up on it and see if it
reveals anything interesting.

> 
> Which MB do you have?  Which bios rev, ...  Which raid card, how much
> ram, 32 or 64 bit, yadda yadda yadda (all the details you didnt give
> before).
> 
> Joe
>