
On Monday, Linux kernel creator Linus Torvalds gave a frustrated speech about the lack of Error Correction Checksum (ECC) RAM on PCs and laptops.
… the wrong and inverted “consumers don’t need ECC” policy, [made] the ECC memory market is going away.
The arguments against the ECC have always been complete rubbish. Now, even memory manufacturers are starting to do ECC internally because they have finally admitted the fact that it is absolutely necessary.
If you’re not familiar with ECC RAM, it’s probably because you don’t build or specify dedicated servers using server-level CPUs and motherboards – which, unfortunately, is practically the only place where you actually find ECC. In short, ECC RAM includes a small amount of extra memory used for error detection and correction.
Memory errors and probability
In most modern implementations, this means that for each 64-bit word stored in RAM, there are eight bits of verification. A single bit error – 0 changed to 1 or 1 changed to 0 – can be detected and corrected automatically. Two bits inverted in the same word can be detected, but not corrected. Three or more bits inverted in the same word will probably be detected, but detection is not guaranteed.
Bit inversion can happen for several reasons, starting with the impact of cosmic rays or a simple hardware failure. A large-scale study of Google’s servers found that about 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory error per year. But the vast majority of them are single-bit errors – and since Google is using server CPUs and ECC RAM, this means that the machines in question are still in transit.
On consumer machines, even these single-bit errors – which are more than 40 times more likely to occur than multi-bit errors, according to Google data – go unnoticed and can introduce system instability and data corruption .
Bit changes are not always accidental
Not every RAM error is the result of a hardware failure or an unintended EMF problem. In recent years, researchers have developed increasingly practical physics-based side channel attacks, using fast, controlled bit flips in areas of RAM accessible to an application to deduce or modify data values in adjacent areas of RAM that should not be able to.
Although ECC RAM cannot mitigate RAMBleed-style attacks that deduce the values of adjacent memory, it can usually stop Rowhammer attacks – where the rapid inversion of bits in one area of RAM causes the bits in an adjacent area to change.
Even when the ECC cannot actively prevent a Rowhammer attack from having an impact on the system – for example, when it reverses several bits in one word – it can at least alert the system to the problem and, in most cases, prevent the Rowhammer attack from doing anything other than causing downtime. (Most ECC systems are configured to stop the entire machine if an uncorrectable error is detected.)
Torvalds blames Intel
And memory makers say it’s because of the economy and lower power consumption. And they are lying bastards – let me again point to the hammer about how these problems have been around for generations, but these idiots sold broken hardware to consumers and claimed it was an “attack”, when it was always “we are hacking shortcuts. “
How many times did a bit-flip like a rowing hammer happen just out of bad luck on real charges without attack? We will never know. Because Intel was pushing shit on consumers.
Torvalds takes the bold position that the lack of ECC RAM in consumer technology is Intel’s fault due to the company’s artificial market segmentation policy. Intel is interested in pushing more resourceful businesses to its most expensive – and profitable – server-level CPUs rather than allowing these entities to effectively use the lower-margin consumables.
Removing support for ECC RAM from CPUs that do not target the server world directly is one of the ways that Intel has kept these markets highly segmented. Torvalds’ argument here is that Intel’s refusal to support ECC RAM in its consumer-oriented parts – along with its almost de facto monopoly in that space – is the real reason why ECC is almost unavailable outside of server space.
The usual argument about why ECC is not present in consumer technology revolves around cost, but we suspect that Torvalds is right about that. Although ECC RAM is essentially a special piece that is hard to find, it usually costs only about 20 percent more per DIMM than non-ECC in retail. The real problem is that without motherboards and processors that support it, it won’t do you any good.