Question: Is it likely that a physical degradation of a hard disk could cause bits to ‘flip’ in file contents without the OS ‘noticing’ and telling you about it when reading the file? e.g. could a ‘p’ in an ASCII text file (binary 01110000) change to a ‘q’ (01110001) and then a user (me) be able to open the file and see ‘q’ without being aware that a failure has occurred?

I’m interested in answers relating to FAT, NTFS, or ReFS… if it makes a difference.

I want to know if the OS protects me from this, or if I should be checking my data for invariance between copies/over time.

Answer: Yes, there is a thing called bit rot.

But no, it won’t affect you unnoticed.

When a drive writes a sector to the platters, it doesn’t just write the bits in the same way they’re stored in RAM – it uses an encoding to make sure there aren’t any sequences of the same bit that are too long, and it adds ECC codes, that allow it to repair errors that affect a few bits, and detect errors that affect more than a few bits.

When the drive reads the sector, it checks these ECC codes, and repairs the data, if neccesary and possible. What happens next depends on the circumstances and the firmware of the drive, which is influenced by the designation of the drive.

  • If a sector can be read and has no ECC problems, it’s passed to the OS
  • If a sector can be repaired easily, the repaired version may be written to disk, read back, and verified, to determine if the error was a random one (cosmic rays …) or if there is a systematic error with the media
  • If the drive determines there is an error with the media, it reallocates the sector
  • If a sector can be neither read nor corrected after a few read attempts, on a drive that’s designated as a RAID drive, the drive will give up, reallocate the sector, and tell the controller there was a problem. It relies on the RAID controller to reconstruct the sector from the other RAID members, and write it back to the failed drive, which then stores it in the reallocated sector that hopefully doesn’t have the problem.
  • If a sector can’t be read or corrected on a desktop drive, the drive will do a lot more attempts to read it. Depending on the quality of the drive, this might involve repositioning the head, checking if there are any bits that flip when read repeatedly, checking which bits are the weakest, and a few other things. If any of these attempts succeed, the drive will reallocate the sector and write back the repaired data.

(This is one of the main differences between drives that are sold as “Desktop”, “NAS/RAID” or “Video surveillance” drives. A RAID drive can just give up quickly and make the controller repair the sector to avoid latency on the user side. A desktop drive will retry again and again, because having the user wait a few seconds is probably better than telling them the data is lost. And a Video drive values constant data rate more than error recovery, as a damaged frame typically won’t even be noticed.)

Anyway, the drive will know if there has been bit rot, will typically recover from it, and if it can’t, it will tell the controller which will in turn tell the driver which will tell the OS. Then, it’s up to the OS to present this error to the user and act on it. This is why cybernard says

?

I have never witnessed a single bit error myself, but I have seen plenty of hard drives where entire sectors have failed.

the drive will know there’s something wrong with the sector, but it doesn’t know which bits have failed. (One single bit that has failed will always be caught by ECC).

Please note that chkdsk, and automatically repairing filesystems, do not address reparing data within files. Those are targeted at corruption withing the structure of the filesystem; like a file size being different between the directory entry and the number of allocated blocks. The self-healing feature of NTFS will detect structural damages and prevent them from affecting your data further, they will not repair any data that is already damaged.

There are, of course, other reasons why data may become damaged. For example. bad RAM on a controller may alter data before it’s even sent to the drive. In that case, no mechanism on the drive will detect or repair the data, and this may be one reason why the structure of a filesystem is damaged. Other reasons include plain software bugs, blackout while writing the disk (although this is addressed by filesystem journaling), or bad filesystem drivers (the NTFS driver on Linux defaulted to read-only for a long time, as NTFS was reverse engineered, not documented, and the developers didn’t trust their own code).

?

I had this scenario once, where an application would save all its files to two different servers in to different data centres, to keep a working copy if the data under all circumstances. After a few months, we noticed that on one of the copies, about 0.1% of all files didn’t match the MD5 sum that the application stored in its database. Turned out to be a faulty fiber cable between the server and the SAN.

These other reasons are why some filesystems, like ZFS, keep additional checksum information to detect errors. They’re designed to protect you from a lot more things that can go wrong than just bit rot.