problems with 3com and intel 100MB cards
Robert G. Brown
rgb at phy.duke.edu
Wed Oct 9 10:24:15 PDT 2002
On 9 Oct 2002, Marcin Kaczmarski wrote:
> It is a proven fact that happened at some University in Germany that the
> newly bought super linux alpha dual cluster with 3com NIC ( I do not
> know the model of these cards in this case) simply failed to operate
> while trying to run very demanding scientifical calculations in material
> science just because of cards. After replacing them with 3 year old dec
> tulip cards everything gone fantastic. I am highly convinced that a
When was this? Cards and drivers are constantly in (r)evolution. Four
or five years ago I think that this experience was common -- real
digital tulip cards were one of the best NICs there were and amazingly
cheap besides, and I personally had endless trouble with 3coms, even on
Intel. However, Digital became Compaq, the tulip was cloned (two or
three times) and sold to Intel besides, every vendor known to man
started adding their own proprietary crap on top of the basic tulip (or
clone, and the clones add their own intermediate layer) AND 3com cleaned
up its design and Don's drivers started to work quite well indeed with
the cards.
Finally, there is the alpha issue -- don't assume that just because
hardware works on Intel with the Intel (or AMD) kernels that it or its
drivers will work on alphas or anything else. I imagine that companies
like e.g. Scyld spend a LOT of time making sure that their kernels and
drivers do indeed work across hardware architectures for the simple
reason that a lot of the time they don't, initially.
These days, I see 3coms consistently outperform tulip clones (and don't
even want to talk about RTLs), and agree that 3com or eepro (with PXE)
are the NICs of choice for clusters and workstations alike, for at least
Intel and AMD based systems at 100BT. Gigabit cards add yet another
layer of driver and hardware compatibility questions -- you really have
to start looking at the gigabit chip being used to build the NIC and who
actually makes it.
> server NIC which runs excellently in servers may be really absolutely
> not suitable for cluster that runs calculations, because you cannot
> compare the network load that you have on servers with the network load
> that appears while running in cluster, in case of cluster it is very
> very bigger. I`m sure of that. We had another reports in cpmd mailing
> lists in September about linux 10 dual alpha cluster with 3com cards
> that hangs calculations. I do not believe that they have low price 3com
> cards in such a cluster.
This is the sort of conclusion that is very dangerous, as it is based on
a fairly small sample (N of one? two?) and hence is pretty much
anecdotal and not necessarily reflective of everybody's general
experience. It may well be that 3com cards have problems in alpha
clusters. It might also be that SOME 3com cards have had problems in
SOME alpha clusters using SOME kernels -- in the past -- and are now
fueling anecdotal reports of failure that might or might not be in the
process of being fixed or have already been fixed in current kernels.
There is, after all, a kernel mailing list and device specific mailing
lists for all the major NIC drivers (I'm still on the driver lists for
some of the primary cards like eepro, 3com and tulip) and if someone
DOES have trouble with a given card on a given architecture, they should
by all means communicate with these lists and hence with the primary
kernel/driver maintainers. Sometimes that is still Don Becker (revered
by all for his work over years on network drivers, beowulfery and more),
sometimes not.
You might find that the "fix" is just matter of changing a line in e.g.
/etc/modules.conf to ensure that the right driver is being loaded
instead of the wrong one, or upgrading the kernel to a more current one
because of a bug in the particular kernel snapshot you are using. I
personally don't think that it is likely to be because of any
fundamental flaw in 3com design, as they work pretty well on tens to
hundreds of machines here (stable under all loads, some of the best
bandwidth/latency numbers when netperf or netpipe or lmbenched). On
Intel/AMD, of course, and a variety of kernels from 2.2 on (not so much
under 2.0 kernels).
rgb
>
> kind regards
> Marcin Kaczmarski
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf
mailing list