[Beowulf] [jak at uiuc.edu: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world requirements]

Eugen Leitl eugen at leitl.org
Sun May 15 04:28:03 PDT 2005


----- Forwarded message from "Jay A. Kreibich" <jak at uiuc.edu> -----

From: "Jay A. Kreibich" <jak at uiuc.edu>
Date: Sat, 14 May 2005 01:08:51 -0500
To: xgrid-users at lists.apple.com
Subject: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world
	requirements
User-Agent: Mutt/1.4.2.1i
Reply-To: jak at uiuc.edu

On Thu, May 12, 2005 at 01:45:45PM -0500, Jay A. Kreibich scratched on the wall:

>   IPoFW performance is very very low.  Expect 100Mb Ethernet (yes,
>   that's "one hundred") to provide better performance than 400Mb FW.
>   There was a big discussion about this many months ago that led to
>   Apple removing any referneces to IPoFW from their Xserve and cluster
>   web pages.  The utilization difference is that big.

  Since it appears that there are members on this list that disagree
  with me and would rather cuss at me in private than have an
  intelligent, rational discussion with the whole group.  Since they
  choose harsh language over running a few simple bandwidth tests,
  I did that myself (numbers below), and will direct my a few comments
  at the group as a whole.  Maybe others can contribute some meaningful
  comments.  If you disagree with me, at least do it in public.

>   While the raw bandwidth numbers for FireWire are higher, the FireWire
>   MAC layer is designed around block transfers from a disk, tape, or
>   similar device. 

  First off, let's be sure we're all on the same page.  The original
  question was about the use of Xgrid over FireWire based networks.
  Since Xgrid runs on top of BEEP over TCP/IP, the question really
  boils down to one of performance of IP over FireWire-- e.g., IPoFW.
  It is important to understand that this is not an encapsulation of
  an Ethernet stream on the FireWire link, or some other more traditional
  networking technology, but actually running FireWire as the Layer-2
  transport for IP.  RFC-2734 explains how this is done.

  <http://www.rfc-editor.org/rfc/rfc2734.txt>

  The problem with IPoFW is that FireWire is designed as an
  infrastructure interconnect, not a networking system.  It has a lot
  more in common with systems like SCSI, HiPPI, and Fibre Channel
  than it does with systems like Ethernet.

  Since every major networking technology of the last 30 years has been
  frame/package or cell based (and even cell is getting more and more
  rare), it shouldn't be a big shock that most traditional networking
  protocols (e.g. IP) are designed and tuned with these types of physical
  transport layers in mind.  While FireWire is much better at large
  bulk transfers, it is not so hot at moving lots of very small data
  segments around, such as individual IP packets.
  
  In many ways, it is like the difference between a fleet of large
  trucks and a train of piggy-back flat cars.  Both are capable of
  transporting the same basic unit of data, but each is designed around
  a different set of requirements.  Each has its strength and weakness,
  depending on what you are trying to do.  If you're trying to move
  data en mass from a disk (or video camera) to a host system, the
  train model will serve you much better.  The connection setup is
  expensive, but the per-unit costs are low assuming a great number of
  units.  If, on the other hand,  you're trying to download data from
  the web, the truck model is a better deal.  The per-unit costs are a
  bit higher, but the system remains fairly efficient with lower
  numbers of units since the connection setup is much less.

  So if you hook two machines together with a FireWire cable, put
  one of those machines into "target disk" mode, and start to copy
  files back and forth, I would expect you get really good performance.
  In fact, despite the fact that GigE has over twice the bandwidth of
  FireWire 400 (GigE = 1000Mbps, FW400 = 400Mbps), I would expect the
  FireWire to outperform any network based file protocol, like NFS or
  AFP, running over GigE, in operations such as a copy.  This is exactly
  the type of operation that FireWire is designed to do, so it is no
  shock that it does it extremely efficiently.  When used in something
  like target disk mode, it is also operating at a very low level in
  the kernel (on the host side), with a great deal of hardware
  assistance.  NFS or AFP, on the other hand, are layered on top of
  the entire networking stack (on both the "disk" side and the "host"
  side) and have to deal with a great number of abstractions.  Also,
  because of the hardware design (largely having to do with the size
  of the packets/frames) it is difficult for most hardware to fully
  utilize a GigE connection, so the full 1000Mb can't be used (at all;
  this limit isn't specific to file protocols).  So it isn't a big
  shock that a network file protocol doesn't work very efficiently
  and that the slower transport can do a better job-- it is designed
  to do a better job, and you aren't using the technologies in the
  same way.  A more valid comparison might be between FW and iSCSI
  over Ethernet so that the two transport technologies are at least
  working at the same level (and even then, I would still expect FW
  to win, although not by as much).

  This is, however a two way street.  If we return to the question of
  IPoFW, where you are moving IP packets rather than disk blocks, it
  should be no shock that a transport technology specifically designed
  to move network packets can outperform one that was designed around
  block copies.  Ethernet is a very light-weight protocol (which is
  both good and bad, like the trucks) and deals with frame based
  network data extremely well.  Even if we assume that FireWire can run
  with a high efficiency, it would be normal to expect GigE to
  outperform it, just because it has 2.5x the bandwidth.  But because
  you're asking FireWire to do something it isn't all that great at,
  the numbers are much worse.

  So here's what I did.  I hooked by TiBook to my dual 1.25
  QuickSilver.  On each I created a new Network Location with just the
  FireWire physical interface, and assigned each one an address on the
  10 net.  There were not other active interfaces.  I then ran a
  series of 60 second tests using "iperf" from NLANR, forcing a
  bi-directional data stream over the IPoFW-400 link.  I used the TCP
  tests, because this is the only way to have the system do directly
  bandwidth measurements.  This adds overhead to the transaction and
  reduces the results (which are indicated as payload bytes only), but
  since I ran the test the same way in all cases, that shouldn't make a
  huge difference.

  Anyways, with the bi-directional test, I was able to get roughly
  90Mbps (yes, "ninety megabits per second") upstream, and 30Mbps
  downstream using the IPoFW-400 link.  It seems there was a lot of
  contention issues when data was pushed both ways at the same time,
  and one side seemed to always gain the upper hand.  That's not a
  very good thing for a network to do, and points to self-generated
  congestion issues.

  If I only pushed data in one direction, I could get it up to about
  125Mbps.  I'll grant you that's better than 100baseTX, but I'm not
  sure I consider half-duplex speeds all that interesting.  As was
  clear from the other test, when you add data going the other way,
  performance drops considerably.

  Just to be sure I was doing the test correctly, I ran the same tests
  with a point-to-point Ethernet cable between the machines.  Both
  machines have GigE, so it ran nicely around 230Mbps in both
  directions.  That may sound a bit low, but the TiBook is an older
  machine and the processor isn't super fast.  In fact, running 460Mbps
  of data through the TCP stack isn't too bad for an 800MHz machine
  (that's one payload byte per 14 CPU cycles, which is pretty darn good!)
  that isn't running jumbo frames.
  
  Speed aside, it is also important it point out that the up-stream and
  down-stream numbers were EXACTLY the same.  The network seemed to
  have no contention issues, and both sides were able to run at the
  maximum speed the end-hosts could sustain.

  Just for kicks, I manually set both sides to 100Mb/full-duplex and
  ran the test.  The numbers worked out to about 92Mbps, both ways.
  A bit lower than you might expect, but given the known overhead of
  TCP it isn't too bad.  Again, both sides were able to sustain the
  same rates.  It is also worth noting that the CPU loads on the systems
  seemed to be considerably less for this test than the FireWire test,
  even though the amount of data being moved was slightly higher.

  I also ran a few UDP tests.  In this case, you force the iperf to
  transmit at a specific rate.  If the system or network is unable to
  keep up, packets are simply dropped.  In a uni-directional test the
  IPoFW-400 link could absorb 130 Mbps well enough, and was able to
  provide that kind of data rate.  When pushed to 200Mbps, the actual
  transmitted data dropped to an astounding *20*Mbps or less.  It seems
  that if a FireWire link gets the least be congested, it totally freaks
  out and all performance hits the floor.  This isn't a big surprised given
  the upstream/downstream difference in the other tests.  These types of
  operating characteristics are extremely undesirable for a network
  transport protocol.

  This wasn't a serious are rigorous test, but it should provide some
  "back of the envelope" numbers to think about.  I encourage others
  to run similar tests using various network profiling tools if you
  wish to get better numbers.

  So call it BS if you want, but if we're talking about moving IP
  packets around, I stand by the statement that one should "Expect
  100Mb Ethernet to provide better performance than 400Mb FW."  I'll
  admit the raw numbers are close, and in the case of a nice smooth 
  uni-directional data stream, the FW400 link actually out-performed
  what a 100Mb link could deliver-- but the huge performance
  derogation caused by congestion gives me serious pause for a more
  generalized traffic pattern.  Regardless, it definitely isn't
  anything near GigE speeds.


  There are also more practical limits to the use of a FireWire network
  vs Ethernet.  For starters, from what I understand of FireWire
  "hubs", they are usually repeater based, and not switch based, at
  least in the terms of a more traditional Ethernet network.  So while
  the bandwidth numbers are close for a single point-to-point link, I
  would expect the FireWire numbers to drop off drastically when you
  started to link five or six machines together.  There is also the
  issue of port density.  You can get 24 port non-blocking GigE
  switches for a few thousand bucks.  I'm not even sure if a 24 port
  FireWire hub exists.  If you start to link multiple smaller hubs
  together (even with a switch style data isolation) your cluster's
  bi-section bandwidth sucks, and your performance is going to suffer.
  Beyond that, FireWire networks are limited to only 63 devices,
  although I would expect that to not be a serious limitation for
  most clusters.

  In short, while running something over FireWire is possible, I see
  very little motivation to do so, especially with the low-cost
  availability of high-performance Ethernet interfaces and switches.

   -j

-- 
                     Jay A. Kreibich | CommTech, Emrg Net Tech Svcs
                        jak at uiuc.edu | Campus IT & Edu Svcs
          <http://www.uiuc.edu/~jak> | University of Illinois at U/C
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      (Xgrid-users at lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/eugen%40leitl.org

This email sent to eugen at leitl.org

----- End forwarded message -----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050515/a91e2cc6/attachment.sig>


More information about the Beowulf mailing list