Single-bit error on RAID device?

Question: For the first time since I’ve owned a PC (30 years) I have experienced an undetected, uncorrected single-bit disk error. ?In a RAID array. The sequence of events was

Upload a collection of digital images (Camera Raw files) from a CF card

Do some editing in Lightroom (which does not update the original file)

Back up everything to an external archive disk (using Retrospect)…time passes (about 1 week)…

Open the file again in Lightroom — it’s corrupted (a big square magenta blotch)

Restore a copy from the archive disk — the restored copy is NOT corrupted

Compare the two files. ?There is only a single bit difference… a byte that was originally 0x34 is now 0xB4

The online device is a pair of 2TB drives in RAID-1 on a hardware RAID card (3WARE 9560SE-4LPML).

Given the above sequence, the error was clearly introduced sometime after step 3 since the archived copy was not corrupted, so it couldn’t have occurred during the original write. ?The file is a Canon CR2 raw file, and Lightroom never updates original RAW files, they are considered “digital negatives” and never updated. ?Instead it saves all edits as sidecar XMP files containing the sequence of edits applied. ?The file date/time are unmodified from the original.

Clearly somehow the bit error occurred and was propagated by the RAID hardware without producing a warning. I’ve checked the RAID error logs and there’s nothing noteworthy for the last 18 months (since I last upgraded the software and firmware).

To summarize:

The data was originally written correctly

It was then read correctly when it was copied to the backup. ?

Sometime after that the bit got flipped on the disk (since nothing rewrote the file).

The RAID hardware is set to run a “verify” once a week. ?It did not detect the error.

That’s just freaky. I would expect a miscompare error in the RAID hardware logs.

Also I can rule out a failing disk since the SMART data on both drives shows zero for all the applicable error attributes:

196 Reallocated_Event_Count 0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?0197 Current_Pending_Sector ?0x0022 ?100 ?100 ?000 ?Old_age ?Always ?- ?0198 Offline_Uncorrectable ?0x0008 ?100 ?100 ?000 ?Old_age ?Offline ?- ?0199 UDMA_CRC_Error_Count ?0x000a ?200 ?200 ?000 ?Old_age ?Always ?- ?0

and everything else is nominal as well.

Anyone have a scenario under which this would happen undetected?

Answer: One possibility is a random bit flip in RAM or the controller on read in step 4. If data was corrupted on read then you would see it in step 4, then if it was still cached you’d also see it in step 6 when comparing files, since the corrupt cached data might still be used.

To test this case, power cycle all of your hardware to ensure the caches are cleared and try opening the file (and running the comparison with the backup) again. If all is well then this was the problem (there’s no way to know at what stage of the read the bit flip occurred in so you’ll just have to chalk it off as an unsolved mystery).?

Failing this, a second, even unluckier possibility is perhaps a random RAM (or more likely on the RAID controller, based on your description) bit flip on write in step 1; but you were operating on a good cached copy in steps 2 and 3 despite a corrupt copy existing on disk. A week later when you accessed the data again, you of course re-read it from the disk, and ended up with the corrupt data that had been written originally. This makes many assumptions and relies on a bit of bad luck. If this is the case you’ll just have to restore the backup file and move on.

Those are the only two things I can think of, really. It doesn’t sound like an issue with the drives themselves. In any case since there’s no way to tell where in the hardware the error occurred, I recommend running a full memory diagnostic just to be safe, although more likely the cause was unfortunate EMI or cosmic rays. As Canadian Luke mentioned in his answer, ECC RAM, if your motherboard supports it, will protect against this type of event, at least on the RAM side. It’s actually not uncommon at all.

? This case ended up being the OP’s problem, rather than the second possibility.

Single-bit error on RAID device?

Related Post

What are the Windows A: and B: drives used for?

Why is Google so much faster than a hard-drive search?

Is there still a reason to choose a 10,000 RPM hard drive over an SSD?