Question: First of all, I think everyone knows that hard drives fail a lot more than the manufacturers would like to admit. Google did a study that indicates that certain raw data attributes that the S.M.A.R.T status of hard drives reports can have a strong correlation with the future failure of the drive.
?
We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.
Seagate seems like it is trying to obscure this information about their drives by claiming that only their software can accurately determine the accurate status of their drive and by the way their software will not tell you the raw data values for the S.M.A.R.T attributes. Western digital has made no such claim to my knowledge but their status reporting tool does not appear to report raw data values either.
I’ve been using HDtune and smartctl from smartmontools in order to gather the raw data values for each attribute. I’ve found that indeed… I am comparing apples to oranges when it comes to certain attributes. I’ve found for example that most Seagate drives will report that they have many millions of read errors while western digital 99% of the time shows 0 for read errors. I’ve also found that Seagate will report many millions of seek errors while Western Digital always seems to report 0.
Q: How do I normalize this data? Is Seagate producing millions of errors while Western digital is producing none? Wikipedia’s article on S.M.A.R.T status says that manufacturers have different ways of reporting this data.
Here is my hypothesis:
I think I found a way to normalize (is that the right term?) the data.
Seagate drives have an additional attribute that Western Digital drives do not have (Hardware ECC Recovered). When you subtract the Read error count from the ECC Recovered count, you’ll probably end up with 0. This seems to be equivalent to Western Digitals reported “Read Error” count. This means that Western Digital only reports read errors that it cannot correct while Seagate counts up all read errors and tells you how many of those it was able to fix.
I had a Seagate drive where the Read error count was less than the ECC Recovered count and I noticed that many of my files were becoming corrupt. This is how I came up with my hypothesis. The millions of seek errors that Seagate produces are still a mystery to me.
Please confirm or correct my hypothesis if you have additional information.
Here is the smart status of my western digital drive just so you can see what I’m talking about:
james@ubuntu:~$ sudo smartctl -a /dev/sdasmartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce AllenHome page is http://smartmontools.sourceforge.net/=== START OF INFORMATION SECTION ===Device Model: ?WDC WD1001FALS-00E3A0Serial Number: ?WD-WCATR0258512Firmware Version: 05.01D05User Capacity: ?1,000,204,886,016 bytesDevice is: ?Not in smartctl database [for details use: -P showall]ATA Version is: ?8ATA Standard is: ?Exact ATA specification draft version not indicatedLocal Time is: ?Thu Jun 10 19:52:28 2010 PDTSMART support is: Available – device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDSMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME ?FLAG ?VALUE WORST THRESH TYPE ?UPDATED ?WHEN_FAILED RAW_VALUE ?1 Raw_Read_Error_Rate ?0x002f ?200 ?200 ?051 ?Pre-fail ?Always ?- ?0 ?3 Spin_Up_Time ?0x0027 ?179 ?175 ?021 ?Pre-fail ?Always ?- ?4033 ?4 Start_Stop_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?270 ?5 Reallocated_Sector_Ct ?0x0033 ?200 ?200 ?140 ?Pre-fail ?Always ?- ?0 ?7 Seek_Error_Rate ?0x002e ?200 ?200 ?000 ?Old_age ?Always ?- ?0 ?9 Power_On_Hours ?0x0032 ?098 ?098 ?000 ?Old_age ?Always ?- ?1468 10 Spin_Retry_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?0 11 Calibration_Retry_Count 0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?0 12 Power_Cycle_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?262192 Power-Off_Retract_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?46193 Load_Cycle_Count ?0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?223194 Temperature_Celsius ?0x0022 ?105 ?102 ?000 ?Old_age ?Always ?- ?42196 Reallocated_Event_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?0197 Current_Pending_Sector ?0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?0198 Offline_Uncorrectable ?0x0030 ?200 ?200 ?000 ?Old_age ?Offline ?- ?0199 UDMA_CRC_Error_Count ?0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?0200 Multi_Zone_Error_Rate ?0x0008 ?200 ?200 ?000 ?Old_age ?Offline ?- ?0
Edit: Here is the Seagate drive that I was talking about that was causing data corruption. This data is from HDTune.
HD Tune: ST3250623A HealthID ??Current ?Worst ?ThresholdData ?Status ?(01) Raw Read Error Rate ?45 ?38 ?6 ?77882492 ?Ok ?(03) Spin Up Time ?99 ?98 ?0 ?0 ?Ok ?(04) Start/Stop Count ?100 ?100 ?20 ?640 ?Ok ?(05) Reallocated Sector Count ?100 ?100 ?36 ?0 ?Ok ?(07) Seek Error Rate ?85 ?60 ?30 ?359872048 ?Ok ?(09) Power On Hours Count ?94 ?94 ?0 ?6028 ?Ok ?(0A) Spin Retry Count ?100 ?100 ?97 ?0 ?Ok ?(0C) Power Cycle Count ?100 ?100 ?20 ?689 ?Ok ?(C2) Temperature ?25 ?55 ?0 ?25 ?Ok ?(C3) Hardware ECC Recovered ?50 ?47 ?0 ?201555081 ?Ok ?(C5) Current Pending Sector ?100 ?100 ?0 ?0 ?Ok ?(C6) Offline Uncorrectable ?100 ?100 ?0 ?0 ?Ok ?(C7) Ultra DMA CRC Error Count ?200 ?199 ?0 ?1 ?Ok ?(C8) Write Error Rate ?100 ?253 ?0 ?0 ?Ok ?(CA) TA Counter Increased ?100 ?253 ?0 ?0 ?Ok ?Power On Time ?: 6028Health Status ?: Ok
The fact that the Hardware ECC Recovered is larger than the Raw Read Error Rate is counter intuitive in my opinion.
This is what I’ve found to be a “normal” seagate drive where the ECC Recovered matches the Raw Read Error Rate:
HD Tune: ST380011A HealthID ??Current ?Worst ?ThresholdData ?Status ?(01) Raw Read Error Rate ?62 ?46 ?6 ?79986164 ?Ok ?(03) Spin Up Time ?98 ?98 ?0 ?0 ?Ok ?(04) Start/Stop Count ?100 ?100 ?20 ?6 ?Ok ?(05) Reallocated Sector Count ?100 ?100 ?36 ?0 ?Ok ?(07) Seek Error Rate ?83 ?60 ?30 ?210309663 ?Ok ?(09) Power On Hours Count ?93 ?93 ?0 ?6516 ?Ok ?(0A) Spin Retry Count ?100 ?100 ?97 ?0 ?Ok ?(0C) Power Cycle Count ?99 ?99 ?20 ?1325 ?Ok ?(C2) Temperature ?25 ?52 ?0 ?25 ?Ok ?(C3) Hardware ECC Recovered ?62 ?46 ?0 ?79986164 ?Ok ?(C5) Current Pending Sector ?100 ?100 ?0 ?0 ?Ok ?(C6) Offline Uncorrectable ?100 ?100 ?0 ?0 ?Ok ?(C7) Ultra DMA CRC Error Count ?200 ?188 ?0 ?18 ?Ok ?(C8) Write Error Rate ?100 ?253 ?0 ?0 ?Ok ?(CA) TA Counter Increased ?100 ?253 ?0 ?0 ?Ok ?Power On Time ?: 6516Health Status ?: Ok
EDIT:
I want to clarify that I know that Google generally considers S.M.A.R.T useless. I know that everyone should backup their data. I am however in the business of fixing other peoples computers. Most people do not have backups or have RAID. It is not cost effective for corporations to troubleshoot hard drives, so they just run them on a RAID until they die. I find it useful in my line of work to check the SMART status of the hard drive. It takes like 30 seconds. If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I’ll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.
I’m just trying to fine tune this process.
Answer: It does appear that different manufacturers use SMART values for sometimes radically different things, as you can see here:
?
My hard disk(s) in ReadyNAS is reporting high SMART Raw Read Error Rate, Seek Error Rate, and Hardware ECC Recovered. What should I do?
?
Seagate uses these SMART fields for internal counts, so this is a known issue with Seagate disks. Look for abnormal counts in other fields, especially Reallocated Sector Ct and ATA Error Count.
So when it comes to your actual question …
?
If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I’ll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.
I’d say a good rule of thumb is, you can only expect SMART settings to be comparable within the same drive manufacturer, and maybe even the same drive model!
So when you’re looking at diagnosing those SMART counts, keep that in mind… one manufacturer’s “read error retry count” may mean something totally different than another manufacturer’s. Sad but true. 🙁