[Beowulf] [jak at uiuc.edu: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world requirements]
Eugen Leitl
eugen at leitl.org
Sun May 15 04:28:03 PDT 2005
----- Forwarded message from "Jay A. Kreibich" <jak at uiuc.edu> -----
From: "Jay A. Kreibich" <jak at uiuc.edu>
Date: Sat, 14 May 2005 01:08:51 -0500
To: xgrid-users at lists.apple.com
Subject: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world
requirements
User-Agent: Mutt/1.4.2.1i
Reply-To: jak at uiuc.edu
On Thu, May 12, 2005 at 01:45:45PM -0500, Jay A. Kreibich scratched on the wall:
> IPoFW performance is very very low. Expect 100Mb Ethernet (yes,
> that's "one hundred") to provide better performance than 400Mb FW.
> There was a big discussion about this many months ago that led to
> Apple removing any referneces to IPoFW from their Xserve and cluster
> web pages. The utilization difference is that big.
Since it appears that there are members on this list that disagree
with me and would rather cuss at me in private than have an
intelligent, rational discussion with the whole group. Since they
choose harsh language over running a few simple bandwidth tests,
I did that myself (numbers below), and will direct my a few comments
at the group as a whole. Maybe others can contribute some meaningful
comments. If you disagree with me, at least do it in public.
> While the raw bandwidth numbers for FireWire are higher, the FireWire
> MAC layer is designed around block transfers from a disk, tape, or
> similar device.
First off, let's be sure we're all on the same page. The original
question was about the use of Xgrid over FireWire based networks.
Since Xgrid runs on top of BEEP over TCP/IP, the question really
boils down to one of performance of IP over FireWire-- e.g., IPoFW.
It is important to understand that this is not an encapsulation of
an Ethernet stream on the FireWire link, or some other more traditional
networking technology, but actually running FireWire as the Layer-2
transport for IP. RFC-2734 explains how this is done.
<http://www.rfc-editor.org/rfc/rfc2734.txt>
The problem with IPoFW is that FireWire is designed as an
infrastructure interconnect, not a networking system. It has a lot
more in common with systems like SCSI, HiPPI, and Fibre Channel
than it does with systems like Ethernet.
Since every major networking technology of the last 30 years has been
frame/package or cell based (and even cell is getting more and more
rare), it shouldn't be a big shock that most traditional networking
protocols (e.g. IP) are designed and tuned with these types of physical
transport layers in mind. While FireWire is much better at large
bulk transfers, it is not so hot at moving lots of very small data
segments around, such as individual IP packets.
In many ways, it is like the difference between a fleet of large
trucks and a train of piggy-back flat cars. Both are capable of
transporting the same basic unit of data, but each is designed around
a different set of requirements. Each has its strength and weakness,
depending on what you are trying to do. If you're trying to move
data en mass from a disk (or video camera) to a host system, the
train model will serve you much better. The connection setup is
expensive, but the per-unit costs are low assuming a great number of
units. If, on the other hand, you're trying to download data from
the web, the truck model is a better deal. The per-unit costs are a
bit higher, but the system remains fairly efficient with lower
numbers of units since the connection setup is much less.
So if you hook two machines together with a FireWire cable, put
one of those machines into "target disk" mode, and start to copy
files back and forth, I would expect you get really good performance.
In fact, despite the fact that GigE has over twice the bandwidth of
FireWire 400 (GigE = 1000Mbps, FW400 = 400Mbps), I would expect the
FireWire to outperform any network based file protocol, like NFS or
AFP, running over GigE, in operations such as a copy. This is exactly
the type of operation that FireWire is designed to do, so it is no
shock that it does it extremely efficiently. When used in something
like target disk mode, it is also operating at a very low level in
the kernel (on the host side), with a great deal of hardware
assistance. NFS or AFP, on the other hand, are layered on top of
the entire networking stack (on both the "disk" side and the "host"
side) and have to deal with a great number of abstractions. Also,
because of the hardware design (largely having to do with the size
of the packets/frames) it is difficult for most hardware to fully
utilize a GigE connection, so the full 1000Mb can't be used (at all;
this limit isn't specific to file protocols). So it isn't a big
shock that a network file protocol doesn't work very efficiently
and that the slower transport can do a better job-- it is designed
to do a better job, and you aren't using the technologies in the
same way. A more valid comparison might be between FW and iSCSI
over Ethernet so that the two transport technologies are at least
working at the same level (and even then, I would still expect FW
to win, although not by as much).
This is, however a two way street. If we return to the question of
IPoFW, where you are moving IP packets rather than disk blocks, it
should be no shock that a transport technology specifically designed
to move network packets can outperform one that was designed around
block copies. Ethernet is a very light-weight protocol (which is
both good and bad, like the trucks) and deals with frame based
network data extremely well. Even if we assume that FireWire can run
with a high efficiency, it would be normal to expect GigE to
outperform it, just because it has 2.5x the bandwidth. But because
you're asking FireWire to do something it isn't all that great at,
the numbers are much worse.
So here's what I did. I hooked by TiBook to my dual 1.25
QuickSilver. On each I created a new Network Location with just the
FireWire physical interface, and assigned each one an address on the
10 net. There were not other active interfaces. I then ran a
series of 60 second tests using "iperf" from NLANR, forcing a
bi-directional data stream over the IPoFW-400 link. I used the TCP
tests, because this is the only way to have the system do directly
bandwidth measurements. This adds overhead to the transaction and
reduces the results (which are indicated as payload bytes only), but
since I ran the test the same way in all cases, that shouldn't make a
huge difference.
Anyways, with the bi-directional test, I was able to get roughly
90Mbps (yes, "ninety megabits per second") upstream, and 30Mbps
downstream using the IPoFW-400 link. It seems there was a lot of
contention issues when data was pushed both ways at the same time,
and one side seemed to always gain the upper hand. That's not a
very good thing for a network to do, and points to self-generated
congestion issues.
If I only pushed data in one direction, I could get it up to about
125Mbps. I'll grant you that's better than 100baseTX, but I'm not
sure I consider half-duplex speeds all that interesting. As was
clear from the other test, when you add data going the other way,
performance drops considerably.
Just to be sure I was doing the test correctly, I ran the same tests
with a point-to-point Ethernet cable between the machines. Both
machines have GigE, so it ran nicely around 230Mbps in both
directions. That may sound a bit low, but the TiBook is an older
machine and the processor isn't super fast. In fact, running 460Mbps
of data through the TCP stack isn't too bad for an 800MHz machine
(that's one payload byte per 14 CPU cycles, which is pretty darn good!)
that isn't running jumbo frames.
Speed aside, it is also important it point out that the up-stream and
down-stream numbers were EXACTLY the same. The network seemed to
have no contention issues, and both sides were able to run at the
maximum speed the end-hosts could sustain.
Just for kicks, I manually set both sides to 100Mb/full-duplex and
ran the test. The numbers worked out to about 92Mbps, both ways.
A bit lower than you might expect, but given the known overhead of
TCP it isn't too bad. Again, both sides were able to sustain the
same rates. It is also worth noting that the CPU loads on the systems
seemed to be considerably less for this test than the FireWire test,
even though the amount of data being moved was slightly higher.
I also ran a few UDP tests. In this case, you force the iperf to
transmit at a specific rate. If the system or network is unable to
keep up, packets are simply dropped. In a uni-directional test the
IPoFW-400 link could absorb 130 Mbps well enough, and was able to
provide that kind of data rate. When pushed to 200Mbps, the actual
transmitted data dropped to an astounding *20*Mbps or less. It seems
that if a FireWire link gets the least be congested, it totally freaks
out and all performance hits the floor. This isn't a big surprised given
the upstream/downstream difference in the other tests. These types of
operating characteristics are extremely undesirable for a network
transport protocol.
This wasn't a serious are rigorous test, but it should provide some
"back of the envelope" numbers to think about. I encourage others
to run similar tests using various network profiling tools if you
wish to get better numbers.
So call it BS if you want, but if we're talking about moving IP
packets around, I stand by the statement that one should "Expect
100Mb Ethernet to provide better performance than 400Mb FW." I'll
admit the raw numbers are close, and in the case of a nice smooth
uni-directional data stream, the FW400 link actually out-performed
what a 100Mb link could deliver-- but the huge performance
derogation caused by congestion gives me serious pause for a more
generalized traffic pattern. Regardless, it definitely isn't
anything near GigE speeds.
There are also more practical limits to the use of a FireWire network
vs Ethernet. For starters, from what I understand of FireWire
"hubs", they are usually repeater based, and not switch based, at
least in the terms of a more traditional Ethernet network. So while
the bandwidth numbers are close for a single point-to-point link, I
would expect the FireWire numbers to drop off drastically when you
started to link five or six machines together. There is also the
issue of port density. You can get 24 port non-blocking GigE
switches for a few thousand bucks. I'm not even sure if a 24 port
FireWire hub exists. If you start to link multiple smaller hubs
together (even with a switch style data isolation) your cluster's
bi-section bandwidth sucks, and your performance is going to suffer.
Beyond that, FireWire networks are limited to only 63 devices,
although I would expect that to not be a serious limitation for
most clusters.
In short, while running something over FireWire is possible, I see
very little motivation to do so, especially with the low-cost
availability of high-performance Ethernet interfaces and switches.
-j
--
Jay A. Kreibich | CommTech, Emrg Net Tech Svcs
jak at uiuc.edu | Campus IT & Edu Svcs
<http://www.uiuc.edu/~jak> | University of Illinois at U/C
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list (Xgrid-users at lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/eugen%40leitl.org
This email sent to eugen at leitl.org
----- End forwarded message -----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050515/a91e2cc6/attachment.sig>
More information about the Beowulf
mailing list