Question: I have a LVM volume set up with several harddrives, and one of them seems to be failing, or at least something strange is going on. Every time the logical volume series sees heavy write activity, the running program (rTorrent most of the time) crashes, and dmesg reports
ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozenata6.00: irq_stat 0x00400000, PHY RDY changedata6: SError: { PHYRdyChg LinkSeq TrStaTrns }ata6.00: failed command: FLUSH CACHE EXTata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 ?res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)ata6.00: status: { DRDY }ata6: hard resetting linkata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)ata6.00: configured for UDMA/133end_request: I/O error, dev sdf, sector 0ata6: EH completeI/O error in filesystem (“dm-3”) meta-data dev dm-3 block 0x640092a ?(“xlog_iodone”) error 5 buf count 32768xfs_force_shutdown(dm-3,0×2) called from line 1043 of file fs/xfs/xfs_log.c. ?Return address = 0xffffffff8119b919Filesystem “dm-3”: Log I/O Error Detected. ?Shutting down filesystem: dm-3Please umount the filesystem, and rectify the problem(s)xfs_force_shutdown(dm-3,0×2) called from line 811 of file fs/xfs/xfs_log.c. ?Return address = 0xffffffff8119ccfbFilesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned…. and so on
The volume itself:
— Logical volume — ?LV Name ?/dev/storage/series ?VG Name ?storage ?LV UUID ?sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3 ?LV Write Access ?read/write ?LV Status ?available ?# open ?1 ?LV Size ?2.86 TiB ?Current LE ?748800 ?Segments ?29 ?Allocation ?inherit ?Read ahead sectors ?auto ?- currently set to ?256 ?Block device ?253:3
I then umount all the LVM-volumes, and tries to run xfs_check on one (all the logical volumes are using XFS). It says
?
ERROR: The filesystem has valuable metadata changes in a log which needs to ?be replayed. ?Mount the filesystem to replay the log, and unmount it before ?re-running xfs_check. ?If you are unable to mount the filesystem, then use ?the xfs_repair -L option to destroy the log and attempt a repair. ?Note that destroying the log may cause corruption — please attempt a mount ?of the filesystem before doing this.
so I go ahead and mount it, which works fine, then unmount again so I can run the check. This runs for a while, until it is killed for using too much memory.
# xfs_check /dev/storage/series /usr/sbin/xfs_check: line 31: 14350 Killed ?xfs_db$DBOPTS -F -i -p xfs_check -c “check$OPTS” $1
dmesg then reports
xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0xfs_db cpuset=/ mems_allowed=0Pid: 14350, comm: xfs_db Tainted: P ?2.6.32-gentoo-r7 #1Call Trace: [
The memory problems are most likely unrelated, though I don’t know why xfs_check should need that much.
smartctl has this to say about the drive:
# smartctl -a /dev/sdfsmartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Model Family: ?Western Digital Caviar Blue Serial ATA familyDevice Model: ?WDC WD5000AAKS-00YGA0Serial Number: ?WD-WCAS80682099Firmware Version: 12.01C02User Capacity: ?500,107,862,016 bytesDevice is: ?In smartctl database [for details use: -P show]ATA Version is: ?8ATA Standard is: ?Exact ATA specification draft version not indicatedLocal Time is: ?Tue May 17 23:17:17 2011 CESTSMART support is: Available – device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status: ?(0x82) Offline data collection activity ???was completed without error. ???Auto Offline Data Collection: Enabled.Self-test execution status: ?( ?0) The previous self-test routine completed ???without error or no self-test has ever ???been run.Total time to complete Offlinedata collection: ?(13200) seconds.Offline data collectioncapabilities: ??(0x7b) SMART execute Offline immediate. ???Auto Offline data collection on/off support. ???Suspend Offline collection upon new ???command. ???Offline surface scan supported. ???Self-test supported. ???Conveyance Self-test supported. ???Selective Self-test supported.SMART capabilities: ?(0x0003) Saves SMART data before entering ???power-saving mode. ???Supports SMART auto save timer.Error logging capability: ?(0x01) Error logging supported. ???General Purpose Logging supported.Short self-test routinerecommended polling time: ?( ?2) minutes.Extended self-test routinerecommended polling time: ?( 154) minutes.Conveyance self-test routinerecommended polling time: ?( ?5) minutes.SCT capabilities: ?(0x303f) SCT Status supported. ???SCT Feature Control supported. ???SCT Data Table supported.SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME ?FLAG ?VALUE WORST THRESH TYPE ?UPDATED ?WHEN_FAILED RAW_VALUE ?1 Raw_Read_Error_Rate ?0x000f ?200 ?200 ?051 ?Pre-fail ?Always ?- ?0 ?3 Spin_Up_Time ?0x0003 ?226 ?181 ?021 ?Pre-fail ?Always ?- ?3675 ?4 Start_Stop_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?33 ?5 Reallocated_Sector_Ct ?0x0033 ?200 ?200 ?140 ?Pre-fail ?Always ?- ?0 ?7 Seek_Error_Rate ?0x000e ?200 ?200 ?051 ?Old_age ?Always ?- ?0 ?9 Power_On_Hours ?0x0032 ?061 ?061 ?000 ?Old_age ?Always ?- ?28688 10 Spin_Retry_Count ?0x0012 ?100 ?253 ?051 ?Old_age ?Always ?- ?0 11 Calibration_Retry_Count 0x0012 ?100 ?253 ?051 ?Old_age ?Always ?- ?0 12 Power_Cycle_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?32192 Power-Off_Retract_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?19193 Load_Cycle_Count ?0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?35194 Temperature_Celsius ?0x0022 ?112 ?095 ?000 ?Old_age ?Always ?- ?38196 Reallocated_Event_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?0197 Current_Pending_Sector ?0x0012 ?200 ?200 ?000 ?Old_age ?Always ?- ?0198 Offline_Uncorrectable ?0x0010 ?200 ?200 ?000 ?Old_age ?Offline ?- ?0199 UDMA_CRC_Error_Count ?0x003e ?200 ?200 ?000 ?Old_age ?Always ?- ?1200 Multi_Zone_Error_Rate ?0x0008 ?200 ?200 ?051 ?Old_age ?Offline ?- ?0SMART Error Log Version: 1No Errors LoggedSMART Self-test log structure revision number 1Num ?Test_Description ?Status ??Remaining ?LifeTime(hours) ?LBA_of_first_error# 1 ?Extended offline ?Completed without error ?00% ?28541 ?-SMART Selective self-test log data structure revision number 1 SPAN ?MIN_LBA ?MAX_LBA ?CURRENT_TEST_STATUS ?1 ?0 ?0 ?Not_testing ?2 ?0 ?0 ?Not_testing ?3 ?0 ?0 ?Not_testing ?4 ?0 ?0 ?Not_testing ?5 ?0 ?0 ?Not_testingSelective self-test flags (0x0): ?After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.
SMART seems to think there’s not much wrong, but obviously something is happening. Unfortunately, I’m not sure what I should try now. I’d like to avoid switching cables or replacing the drive until I know for sure it’s needed, but any suggestions are welcome.
Update
As suggested by @Zoredache, I ran badblocks on the drive.
# badblocks -s /dev/sdfChecking for bad blocks (read-only test): done
and from what I could understand, this is supposed to output a list of bad blocks, meaning it didn’t find any
Answer: Try turning off NCQ for the problematic drive (reference: this page and this page)
echo 1 > /sys/block/sdX/device/queue_depth
You might also try changing out the SATA cable to the drive, because a weak/borderline electrical connection might also cause those kinds of errors.
As for your memory problem when running xfs_check; you just need more RAM and/or swap space. That’s a pretty big filesystem so I’m not surprised that xfs_check needs a lot of memory.