[Beowulf] Network Filesystems performance

Thu Aug 23 17:33:13 PDT 2007

At this point I think it is negligible. Initially we were seeing about a
%15 decrease in performance between the block device and logical volume.
Note that in this case we just have a PV with a VG and an LV, no
striping or anything else. We don't really need it, but were using it
with GFS and moving data around on different sized arrays on the fly.

Once I aligned the pv data with the underlying RAID stripes on the array
the performance hit went away. Basically I just defined the pv
metadatasize to be 128MB and checked it by dd'ing to a file and od'ing
it and looking at the offsets. 

Currently I can sequentially read data off of the LV at the FC speed of
2000Mb/s, the same as the underlying block device. Tossing ext3, GFS or
XFS on the LV and reading sequentially from a file brings that
performance down to about 160-180MB/s depending on the filesystem.

Keeping in mind we have the external array doing all the RAID, the
biggest performance benefit I've seen was disabling read-ahead on the
array controller and letting the OS do all the deciding about
read-ahead. That was probably worth %30 and put us above the bandwidth
limits of the fiber without the filesystem.

What really gets me is that while my NFS reads are around ~50MB/s , the
writes are basically at wire speed, slowing down to and holding at about
~90MB/s when we exceed the 4GB file size. That would seem to indicate to
me the server has no problem dealing with a saturated NIC and reasonably
high I/O on the QLA2342 at the same time. And clearly the server can
read from the disk faster than 50 MB/s. So why can it read from the disk
faster than 50MB/s when it's NFS that's doing the request.

On Thu, 2007-08-23 at 18:27 -0400, joe landman wrote:
> I wonder what the LV impact is here.  Md is the fastest i have seen on these units with lv losing quite a bit of performance (20 percent or so as i recall).
> 
> Regards
> 
> Joe
> ---
> joe landman
> landman at scalableinformatics.com 
> +1 734 612 4615
> (sent from cell phone ... please excuse brevity and typos)
> 
> -----Original Message-----
> From: "Glen Dosey" <doseyg at r-networks.net>
> To: landman at scalableinformatics.com
> Cc: "Jeff Blasius" <jeff.blasius at yale.edu>; "Beowulf" <beowulf at beowulf.org>
> Sent: 8/23/2007 6:09 PM
> Subject: Re: [Beowulf] Network Filesystems performance
> 
> On Thu, 2007-08-23 at 15:53 -0400, Joe Landman wrote:
> <snip>
> > Since you indicated RHEL4, its possible that something in kernel is
> > causing problems.  RHEL4 is not known to be a speed demon.
> 
> All the current testing is on RHEL5 actually. 64bit . It offered better
> performance than RHEL4. Everything in here refers to GigE and not
> infinband (since we want to keep that for MPI)
> 
> modified entries in sysctl include:
> net.ipv4.tcp_window_scaling = 1
> sunrpc.tcp_slot_table_entries = 128
> net.core.netdev_max_backlog = 2500
> net.core.wmem_max = 83886080
> net.core.rmem_max = 83886080
> net.core.wmem_default = 6553600
> net.core.rmem_default = 6553600
> net.ipv4.tcp_rmem = 4096 6553600 83886080
> net.ipv4.tcp_wmem = 4096 6553600 83886080
> 
> > What about the usual suspects
> > 
> > 	cat /proc/interrupts
> 
> [root at storage1 ~]# cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       
>   0:  176708034  176467911  178788831  178782166    IO-APIC-edge  timer
>   1:        167        112          0        246    IO-APIC-edge  i8042
>   8:          0          0          0          0    IO-APIC-edge  rtc
>   9:          0          0          0          0   IO-APIC-level  acpi
>  12:        189        185         51         57    IO-APIC-edge  i8042
>  50:      69688     963247    1227309     270953   IO-APIC-level  qla2xxx
>  58:      15112      96722      96347       7613   IO-APIC-level  qla2xxx
>  66:   47398161          0          0          0   IO-APIC-level  eth0
>  74:          5          0      21502          0   IO-APIC-level  eth1
> 217:          0          0          0          0   IO-APIC-level  ohci_hcd:usb1, libata
> 225:          1          0          0          0   IO-APIC-level  ehci_hcd:usb2
> 233:      30917     188965     204419      94202   IO-APIC-level  libata
> NMI:       2933       2685       2485       1792 
> LOC:  710654972  710659916  710660939  710658940 
> ERR:          0
> MIS:          0
> 
> 
> > 	blockdev --getra /dev/sda
> 
> We're using logical volumes, with an 8192 sector read ahead on the lv
> and disk.
> 
> 
> > ...
> > 	lspci -v
> > 
> > Is your gigabit sharing a 100/133 MB/s old PCI bus with your RAID card?
> > On older motherboards, the gigabit NICs were put on an old PCI branch,
> > typically 100 MB/s max.  If there is a PCI RAID card in the same slot,
> > or, as also often happened on these older MB's, the SATA ports were
> > hanging off the same old/slow PCI bus, well, it could explain your results.
> 
> We're running Altus 1300 systems. There is just a QLA242 in the system
> on the PCIX Bus. There is no RAID, the storage is handled externally via
> the FC.  Here's the output from lspci -tv
> 
> [root at storage1 rules.d]# lspci -tv
> -+-[0000:06]-+-01.0-[0000:07]--
>  |           +-01.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
>  |           +-02.0-[0000:08]--+-01.0  QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
>  |           |                 \-01.1  QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA
>  |           \-02.1  Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
>  \-[0000:00]-+-00.0  nVidia Corporation CK804 Memory Controller
>              +-01.0  nVidia Corporation CK804 ISA Bridge
>              +-01.1  nVidia Corporation CK804 SMBus
>              +-02.0  nVidia Corporation CK804 USB Controller
>              +-02.1  nVidia Corporation CK804 USB Controller
>              +-06.0  nVidia Corporation CK804 IDE
>              +-07.0  nVidia Corporation CK804 Serial ATA Controller
>              +-08.0  nVidia Corporation CK804 Serial ATA Controller
>              +-09.0-[0000:01]----07.0  ATI Technologies Inc Rage XL
>              +-0b.0-[0000:02]----00.0  Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
>              +-0c.0-[0000:03]--
>              +-0d.0-[0000:04]----00.0  Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express
>              +-0e.0-[0000:05]--
> 
> The systems have 4GB RAM and are dual Opteron 285.
> 
> The externally attached Xyratex 5200 storage is connected via 2Gbit
> fibre via a Qlogic Switch to a 12 disk array using a hardware raid
> controller configured for 10+1 raid 5 with a hot spare and 128K chunks
> for a total 1280K stripe. The ext3 filesystem was created with a stride
> of 32. The partition table and volume labels were each offset by 128MB
> to account for disk alignment with stripe writes. The disks are 500GB
> Seagate SATA drives, model ST3500641NS. The array controller has the
> read ahead disabled and and a 256MB writeaback enabled. This is the only
> system utilizing the array/enclosure/controller. The filesystem is 4.9TB
> in size.
> 
> 
> Here's a set of Bonnie++ numbers if it matters(sorry for the formatting,
> copied from an html file)
> Ext3	8G	57304	90	92685	34	52007	12	66123	90	178088	19	401.9	0	16:786432:0/16	47	5	112	4	1782	19	49	6	41	1	378	5
> 
> or the ever popular (but totally unrealistic) series of dd tests 
> 
> Read on NFS server
> [root at storage1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
> 510933+0 records in
> 510932+0 records out
> 2092777472 bytes (2.1 GB) copied, 12.6766 seconds, 165 MB/s
> 
> (disk was unmounted on server to clear cache)
> 
> Read from NFS client
> [root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
> 418341+0 records in
> 418340+0 records out
> 1713520640 bytes (1.7 GB) copied, 30.2718 seconds, 56.6 MB/s
> 
> Write on NFS client
> [root at wopr1 ~]# dd if=/dev/zero of=/mnt/array3/file.dd bs=4k count=256000
> 256000+0 records in
> 256000+0 records out
> 1048576000 bytes (1.0 GB) copied, 10.1124 seconds, 104 MB/s
> 
> now we unmount the NFS share, recreate the file on the server, and remount it to clear the client cache but leave it cached on the server
> 
> [root at wopr1 ~]# dd if=/mnt/array3/file.dd of=/dev/null bs=4k
> 524287+0 records in
> 524287+0 records out
> 2147479552 bytes (2.1 GB) copied, 18.5161 seconds, 116 MB/s
> 
> 
> 
> Since our NFS is over TCP here's the iperf test results, which basically confirm the above dd results.
> 
> [root at wopr1 ~]# ./iperf -c server
> ------------------------------------------------------------
> Client connecting to server, TCP port 5001
> TCP window size: 6.25 MByte (default)
> ------------------------------------------------------------
> [  3] local client port 37325 connected with server port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
> 
> 
> 
> iftop confirms the basic numbers I've been talking about. Additionally I
> have been graphing per port utilization on the Qlogic FC switch and it
> confirms the numbers I've been seeing on the disk side of things and
> helps determine if the file is in cache or not (or partially).
> 
> atop shows basically the same iostat does, which is that on the initial
> read the FC disk is about %85 percent utilized and the network is about
> %50 utilized. No other resource seems to be close to it's limit. On
> subsequent reads the disk is not touched and the network is %100
> utilized.
> 
> I have never used dstat before. I will read up on it and see if it
> reveals anything interesting.
> 
> 
> > 
> > Which MB do you have?  Which bios rev, ...  Which raid card, how much
> > ram, 32 or 64 bit, yadda yadda yadda (all the details you didnt give
> > before).
> > 
> > Joe
> > 
>