Troublesome hard-drive in LVM is it broken?

Question: I have a LVM volume set up with several harddrives, and one of them seems to be failing, or at least something strange is going on. Every time the logical volume series sees heavy write activity, the running program (rTorrent most of the time) crashes, and dmesg reports

ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozenata6.00: irq_stat 0x00400000, PHY RDY changedata6: SError: { PHYRdyChg LinkSeq TrStaTrns }ata6.00: failed command: FLUSH CACHE EXTata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 ?res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)ata6.00: status: { DRDY }ata6: hard resetting linkata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)ata6.00: configured for UDMA/133end_request: I/O error, dev sdf, sector 0ata6: EH completeI/O error in filesystem (“dm-3”) meta-data dev dm-3 block 0x640092a ?(“xlog_iodone”) error 5 buf count 32768xfs_force_shutdown(dm-3,0×2) called from line 1043 of file fs/xfs/xfs_log.c. ?Return address = 0xffffffff8119b919Filesystem “dm-3”: Log I/O Error Detected. ?Shutting down filesystem: dm-3Please umount the filesystem, and rectify the problem(s)xfs_force_shutdown(dm-3,0×2) called from line 811 of file fs/xfs/xfs_log.c. ?Return address = 0xffffffff8119ccfbFilesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned.Filesystem “dm-3”: xfs_log_force: error 5 returned…. and so on

The volume itself:

— Logical volume — ?LV Name ?/dev/storage/series ?VG Name ?storage ?LV UUID ?sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3 ?LV Write Access ?read/write ?LV Status ?available ?# open ?1 ?LV Size ?2.86 TiB ?Current LE ?748800 ?Segments ?29 ?Allocation ?inherit ?Read ahead sectors ?auto ?- currently set to ?256 ?Block device ?253:3

I then umount all the LVM-volumes, and tries to run xfs_check on one (all the logical volumes are using XFS). It says

ERROR: The filesystem has valuable metadata changes in a log which needs to ?be replayed. ?Mount the filesystem to replay the log, and unmount it before ?re-running xfs_check. ?If you are unable to mount the filesystem, then use ?the xfs_repair -L option to destroy the log and attempt a repair. ?Note that destroying the log may cause corruption — please attempt a mount ?of the filesystem before doing this.

so I go ahead and mount it, which works fine, then unmount again so I can run the check. This runs for a while, until it is killed for using too much memory.

# xfs_check /dev/storage/series /usr/sbin/xfs_check: line 31: 14350 Killed ?xfs_db$DBOPTS -F -i -p xfs_check -c “check$OPTS” $1

dmesg then reports

xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0xfs_db cpuset=/ mems_allowed=0Pid: 14350, comm: xfs_db Tainted: P ?2.6.32-gentoo-r7 #1Call Trace: [] ? 0xffffffff81067aec [] 0xffffffff8107a848 [] ? 0xffffffff8104ee2c [] 0xffffffff8107ac83 [] 0xffffffff8107adf1 [] 0xffffffff8107d460 [] ? 0xffffffff8129d69e [] 0xffffffff8108a40d [] 0xffffffff8108bd67 [] 0xffffffff810258ff [] 0xffffffff8140290fMem-Info:DMA per-cpu:CPU ?0: hi: ?0, btch: ?1 usd: ?0CPU ?1: hi: ?0, btch: ?1 usd: ?0DMA32 per-cpu:CPU ?0: hi: ?186, btch: ?31 usd: 103CPU ?1: hi: ?186, btch: ?31 usd: 177Normal per-cpu:CPU ?0: hi: ?186, btch: ?31 usd: ?35CPU ?1: hi: ?186, btch: ?31 usd: 155active_anon:717606 inactive_anon:271926 isolated_anon:0 active_file:155 inactive_file:217 isolated_file:0 unevictable:0 dirty:0 writeback:48 unstable:0 free:6959 slab_reclaimable:1102 slab_unreclaimable:4133 mapped:156 shmem:0 pagetables:3644 bounce:0DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15272kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yeslowmem_reserve[]: 0 2999 4009 4009DMA32 free:10020kB min:6052kB low:7564kB high:9076kB active_anon:2377112kB inactive_anon:594248kB active_file:252kB inactive_file:268kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3071904kB mlocked:0kB dirty:0kB writeback:16kB mapped:196kB shmem:0kB slab_reclaimable:1620kB slab_unreclaimable:3980kB kernel_stack:56kB pagetables:3636kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:800 all_unreclaimable? yeslowmem_reserve[]: 0 0 1010 1010Normal free:1928kB min:2036kB low:2544kB high:3052kB active_anon:493312kB inactive_anon:493456kB active_file:368kB inactive_file:600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:176kB mapped:428kB shmem:0kB slab_reclaimable:2788kB slab_unreclaimable:12552kB kernel_stack:1008kB pagetables:10940kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2872 all_unreclaimable? yeslowmem_reserve[]: 0 0 0 0DMA: 0*4kB 0*8kB 3*16kB 3*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kBDMA32: 459*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 10020kBNormal: 482*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1928kB2990 total pagecache pages2626 pages in swap cacheSwap cache stats: add 129611, delete 126985, find 334/869Free swap ?= 0kBTotal swap = 498004kB1048560 pages RAM34218 pages reserved1846 pages shared1006066 pages non-sharedOut of memory: kill process 14350 (xfs_db) score 105765 or a childKilled process 14350 (xfs_db)

The memory problems are most likely unrelated, though I don’t know why xfs_check should need that much.

smartctl has this to say about the drive:

# smartctl -a /dev/sdfsmartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Model Family: ?Western Digital Caviar Blue Serial ATA familyDevice Model: ?WDC WD5000AAKS-00YGA0Serial Number: ?WD-WCAS80682099Firmware Version: 12.01C02User Capacity: ?500,107,862,016 bytesDevice is: ?In smartctl database [for details use: -P show]ATA Version is: ?8ATA Standard is: ?Exact ATA specification draft version not indicatedLocal Time is: ?Tue May 17 23:17:17 2011 CESTSMART support is: Available – device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status: ?(0x82) Offline data collection activity ???was completed without error. ???Auto Offline Data Collection: Enabled.Self-test execution status: ?( ?0) The previous self-test routine completed ???without error or no self-test has ever ???been run.Total time to complete Offlinedata collection: ?(13200) seconds.Offline data collectioncapabilities: ??(0x7b) SMART execute Offline immediate. ???Auto Offline data collection on/off support. ???Suspend Offline collection upon new ???command. ???Offline surface scan supported. ???Self-test supported. ???Conveyance Self-test supported. ???Selective Self-test supported.SMART capabilities: ?(0x0003) Saves SMART data before entering ???power-saving mode. ???Supports SMART auto save timer.Error logging capability: ?(0x01) Error logging supported. ???General Purpose Logging supported.Short self-test routinerecommended polling time: ?( ?2) minutes.Extended self-test routinerecommended polling time: ?( 154) minutes.Conveyance self-test routinerecommended polling time: ?( ?5) minutes.SCT capabilities: ?(0x303f) SCT Status supported. ???SCT Feature Control supported. ???SCT Data Table supported.SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME ?FLAG ?VALUE WORST THRESH TYPE ?UPDATED ?WHEN_FAILED RAW_VALUE ?1 Raw_Read_Error_Rate ?0x000f ?200 ?200 ?051 ?Pre-fail ?Always ?- ?0 ?3 Spin_Up_Time ?0x0003 ?226 ?181 ?021 ?Pre-fail ?Always ?- ?3675 ?4 Start_Stop_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?33 ?5 Reallocated_Sector_Ct ?0x0033 ?200 ?200 ?140 ?Pre-fail ?Always ?- ?0 ?7 Seek_Error_Rate ?0x000e ?200 ?200 ?051 ?Old_age ?Always ?- ?0 ?9 Power_On_Hours ?0x0032 ?061 ?061 ?000 ?Old_age ?Always ?- ?28688 10 Spin_Retry_Count ?0x0012 ?100 ?253 ?051 ?Old_age ?Always ?- ?0 11 Calibration_Retry_Count 0x0012 ?100 ?253 ?051 ?Old_age ?Always ?- ?0 12 Power_Cycle_Count ?0x0032 ?100 ?100 ?000 ?Old_age ?Always ?- ?32192 Power-Off_Retract_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?19193 Load_Cycle_Count ?0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?35194 Temperature_Celsius ?0x0022 ?112 ?095 ?000 ?Old_age ?Always ?- ?38196 Reallocated_Event_Count 0x0032 ?200 ?200 ?000 ?Old_age ?Always ?- ?0197 Current_Pending_Sector ?0x0012 ?200 ?200 ?000 ?Old_age ?Always ?- ?0198 Offline_Uncorrectable ?0x0010 ?200 ?200 ?000 ?Old_age ?Offline ?- ?0199 UDMA_CRC_Error_Count ?0x003e ?200 ?200 ?000 ?Old_age ?Always ?- ?1200 Multi_Zone_Error_Rate ?0x0008 ?200 ?200 ?051 ?Old_age ?Offline ?- ?0SMART Error Log Version: 1No Errors LoggedSMART Self-test log structure revision number 1Num ?Test_Description ?Status ??Remaining ?LifeTime(hours) ?LBA_of_first_error# 1 ?Extended offline ?Completed without error ?00% ?28541 ?-SMART Selective self-test log data structure revision number 1 SPAN ?MIN_LBA ?MAX_LBA ?CURRENT_TEST_STATUS ?1 ?0 ?0 ?Not_testing ?2 ?0 ?0 ?Not_testing ?3 ?0 ?0 ?Not_testing ?4 ?0 ?0 ?Not_testing ?5 ?0 ?0 ?Not_testingSelective self-test flags (0x0): ?After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART seems to think there’s not much wrong, but obviously something is happening. Unfortunately, I’m not sure what I should try now. I’d like to avoid switching cables or replacing the drive until I know for sure it’s needed, but any suggestions are welcome.

Update

As suggested by @Zoredache, I ran badblocks on the drive.

# badblocks -s /dev/sdfChecking for bad blocks (read-only test): done

and from what I could understand, this is supposed to output a list of bad blocks, meaning it didn’t find any

Answer: Try turning off NCQ for the problematic drive (reference: this page and this page)

echo 1 > /sys/block/sdX/device/queue_depth

You might also try changing out the SATA cable to the drive, because a weak/borderline electrical connection might also cause those kinds of errors.

As for your memory problem when running xfs_check; you just need more RAM and/or swap space. That’s a pretty big filesystem so I’m not surprised that xfs_check needs a lot of memory.

Troublesome hard-drive in LVM is it broken?

Update

Related Post

What are the Windows A: and B: drives used for?

Why is Google so much faster than a hard-drive search?

Is there still a reason to choose a 10,000 RPM hard drive over an SSD?