Question: Why does a hard drive which is known to have bad blocks (verified in HDTune and HDDScan), freezes down my entire system?
It is not the OS drive; it is attached to another SATA port, and I’m trying to copy files from it to another healthy drive.
I have experienced this issue with almost every damaged hard drive and every Windows PC.
I would expect to see freezing only for the program I’m using to copy the files (Windows Explorer, etc.), but instead my entire PC gets jerky, and I cannot browse web or watch movies while copying files from the damaged drive.
The long story.
I live in a rural area where there are problems with electricity (brownouts, etc.). I myself am using a UPS and my own hard drives are perfectly fine. But my neighbors often ask for help with their PC issues, and I often find that their hard drives are damaged, most probably because of electricity issues. Of course, after replacing the damaged drive I suggest my neighbors to buy an UPS.
I have always wondered, why my PC freezes entirely while retrieving data from damaged drives. Is it a hardware issue? Is it caused by the way OS reads data? Is it something Windows-specific, and I won’t experience it on *nix?
Anyway, from now on I will use some dedicated software (such as Roadkil’s Unstoppable Copier) instead of Windows Explorer, although I’m not sure if this will work differently, without freezing entire PC.
It is not a request for help, it is more for educational purposes, so I know why the things work that way.
Answer: This is one of those areas where SATA is suboptimal. The problem is at the storage device interconnect protocol level, and thus not related to what software you are running. Using another file copier or another operating system won’t magically make things better, except that it might try to set different timeout values to reduce the impact of the problem (which may or may not be possible depending on the hardware and firmware; see below).
There are a few important points here:
Point #1 is one of the main selling points for SAS on servers; SAS has significantly better error handling than SATA. Point #2 is a drive firmware limitation, and #3 becomes a problem really only because of #2.
So what happens is that the OS issues a “read sectors” command to the disk, and the particular sectors are somehow damaged. Thus, the disk goes into retry mode to try to get the data off the platters, trying the read again and again until it gets good enough data that the disk’s own error correction (FEC) is able to correct for the remaining errors. If you are unlucky, this might be never, but the drive will keep trying for some fairly long period of time before deciding that this read isn’t going to succeed.
Because the operating system is waiting for the read, this will at the very least slow down the copying process to a crawl, and depending on the exact OS architecture can cause the OS to become jerky or even freeze for the duration. The disk, at this point, is busy with the original read and won’t respond to further read commands until the one that is currently executing ends (successfully or unsuccessfully), and other software generally won’t do better than the operating system it is running on.
Hence, anything that triggers a read elsewhere (ideally, only on the damaged drive) is going to have to wait in line until the damaged drive either successfully reads the sector in question, or determines that it cannot be read. Because of SATA’s less than optimal handling of nonresponsive drives, this can mean that not only the drive you are copying from is going to have its I/O delayed. This can very easily cause other software to become slow or unresponsive as well, as that software waits for a different I/O request to finish, even if the operating system is able to cope.
It’s also important to note here that disk I/O can happen even though you aren’t explicitly accessing any files on disk. The two main causes for this would be load-on-demand executable code, and swap. Since swap is sometimes used even when the system is not under memory pressure, and load-on-demand executable code is common on modern systems and with modern executable file formats, unintended disk read activity during normal use is a very real possibility.
As pointed out in a comment to the question by Matteo Italia, one mitigative strategy is to use a different storage interconnect, which is a complicated way of saying “put the disk in a USB enclosure”. By abstracting through the USB mass storage protocol, this isolates the problematic SATA portion from the rest of your system, which means that in theory, only I/O on that specific disk should be affected by I/O problems on that disk.
As a bit of an aside, this is pretty much why SATA (particularly, SATA without drive-level ERC) is often discouraged for RAID (especially RAID levels with redundancy, which among the standard ones is all except RAID 0); the long timeout periods and poor error handling can easily cause a whole device to be thrown out of the array for a single bad sector, which the RAID controller could handle just fine if redundancy exists and the storage controller simply knows that this is the problem. SAS was designed for large storage arrays, and thus with the expectation that there will be problems on various drives occasionally, which led to it being designed to handle the case of a single problematic drive or I/O request gracefully even if the drive doesn’t. Problematic disks are not very common in consumer systems simply because those tend to not have many disks installed, and the ones that are installed virtually never have redundancy; since SATA aimed to replace PATA/IDE not SCSI (the latter being the niche SAS aimed for), it is likely that its error handling features and demands (or guarantees) were considered adequate for its intended use case.