[Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA

Tue Aug 21 03:08:21 PDT 2018

Dear all,

> All complex systems have flaws. It's more a matter of deciding which flaws
> are acceptable and which aren't, which is driven by economic factors for
> the most part - the cost of fixing the flaw (and potentially introducing a
> new one) vs the cost of damage from the flaw.

I agree with this. 

> I'd find it hard to believe that Intel's CPU designers sat around
> implementing deliberate flaws ( the Bosch engine controller for VW model).

There is the famous example where NIST deliberately weakened some encryption 
standards which was published not so long ago. 

> I'd not find it hard to believe that someone, somewhere raised a speculation
> about a potential flaw, among many others.  That one just didn't happen to
> get resources applied to it, others did.  Picking which ones to attack and
> spend resources on is a difficult question, and often gets answered based
> on totally irrelevant factors.
> 
> That's not negligence - that's just "it is impossible to discover and fix
> all possible bugs"

My understanding about the recent CPU 'problems' is that researchers were 
looking into that some time ago (I believe they were from the TU Graz in 
Austria but I might be wrong here). My hunch here is there is some 'common 
wisdom' how to design a CPU and maybe that sometimes does not get questioned 
enough and in the detail we need it. As a scientist, a friend of mine once 
told me: never jeopardized your result by running a second experiment. I 
totally disagree with this but these days it seems to be common practice, not 
only in IT. 

> This is not unusual even in MUCH simpler chips-I have some 8 bit wide level
> shifters (from 2.5 to 3.3V logic) that have an obscure behavior with the
> rate at which the two power supplies come up that causes them not to pass
> data (preventing the system in which they are installed from booting).
> About 1 out of 500 times. The mfr's response is "yeah, we think we can
> duplicate that, but we've moved on to a newer version of that chip, why
> don't you replace the chips with the new ones".  This isn't an necessarily
> an issue of the chip not performing to the datasheet specs (essentially,
> the data sheet is silent on this).

And that is exactly the problem: instead of understanding why it is behaving 
like this, there is a patch and we move on. Why bother? It only costs money. 
Less profit for the company. Shareholders like to see high profits. And so on.
So we never understand what is causing this in the first place, we don't have 
in-depth knowledge, but we somehow fixed it. Lesson learned? None. 
Again, wearing my gentleman scientist hat: if we understand this problem, we 
might not need to patch it but we can learn from it and *fix* it properly. 
Hell, we might even improve our design! Oh, hang on, that would require 
putting resources towards it. Sorry folks! :-)

> The Errata and Notes lists for complex parts (like CPUs and large FPGAs)
> runs to hundreds of pages, and continuously grows as people find more odd
> behaviors.

No doubt about that, the same is true in my subject: chemistry. 

> Therefore - one should assume your system has unknown flaws and design your
> software and operational procedures accordingly.

So in a nutshell: we simply have to accept that bridges might collapse so we 
issue everybody a security cable when they want to cross the bridge. Can this 
be the solution?

Don't get me wrong. I am deliberately playing devil's advocate here with the 
aim to illustrate the underlying problem. 

Added: see also Chris' email which arrived whilst composing this one here. 

All the best from a sunny London!

Jörg