SMART SSD Warnings

So one of my workstations has started popping up SMART notifications:

Its neat that KDE has this all integrated now. If you don’t have SMART Status in your KDE menu, then install plasma-disks.

Now, why is that SSD reported as failing – it’s fairly new, and I don’t use that machine very much …

Click on the ‘Detailed Information’ button and see what it says. There may be errors logged. Look for anything that doesn’t appear normal, such as reallocated sector counts (should be 0 or very close to it). Something has triggered the warning, it should be evident looking through the detailed info. I’ve never had an SSD fail so I’m not sure what kinds of errors these things throw when they give indication of impending failure.

@jeffsFOM here is the log:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.8-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 500GB
Serial Number:    S6PXNZ0RB29592Z
LU WWN Device Id: 5 002538 fc1b1f566
Firmware Version: SVT01B6Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 16 14:34:53 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  85) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   006   006   010    Pre-fail  Always   FAILING_NOW 576
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       6574
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       7
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       11
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   006   006   010    Pre-fail  Always   FAILING_NOW 576
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   006   006   010    Pre-fail  Always   FAILING_NOW 576
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       675
190 Airflow_Temperature_Cel 0x0032   071   056   000    Old_age   Always       -       29
195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       675
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       3
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9244365198

SMART Error Log Version: 1
ATA Error Count: 678 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 678 occurred at disk power-on lifetime: 6551 hours (272 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f8 50 d8 8d 40  Error: UNC at LBA = 0x008dd850 = 9295952

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 f8 50 d8 8d 40 1f   9d+17:21:05.731  READ FPDMA QUEUED
  47 00 01 00 00 00 40 1c   9d+17:21:05.731  READ LOG DMA EXT
  47 00 01 30 06 00 40 1c   9d+17:21:05.731  READ LOG DMA EXT
  47 00 01 30 00 00 40 1c   9d+17:21:05.731  READ LOG DMA EXT
  47 00 01 00 00 00 40 1c   9d+17:21:05.731  READ LOG DMA EXT

Error 677 occurred at disk power-on lifetime: 6551 hours (272 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 e0 50 d8 8d 40  Error: UNC at LBA = 0x008dd850 = 9295952

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 e0 50 d8 8d 40 1c   9d+17:21:05.547  READ FPDMA QUEUED
  61 08 a0 78 97 ed 40 14   9d+17:21:05.547  WRITE FPDMA QUEUED
  47 00 01 00 00 00 40 1b   9d+17:21:05.547  READ LOG DMA EXT
  47 00 01 30 06 00 40 1b   9d+17:21:05.547  READ LOG DMA EXT
  47 00 01 30 00 00 40 1b   9d+17:21:05.547  READ LOG DMA EXT

Error 676 occurred at disk power-on lifetime: 6551 hours (272 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 d8 50 d8 8d 40  Error: UNC at LBA = 0x008dd850 = 9295952

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 d8 50 d8 8d 40 1b   9d+17:21:05.346  READ FPDMA QUEUED
  61 50 18 48 bd 7d 40 03   9d+17:21:05.346  WRITE FPDMA QUEUED
  61 48 10 00 b8 7d 40 02   9d+17:21:05.346  WRITE FPDMA QUEUED
  61 38 08 c8 25 25 40 01   9d+17:21:05.346  WRITE FPDMA QUEUED
  61 80 00 48 20 25 40 00   9d+17:21:05.346  WRITE FPDMA QUEUED

Error 675 occurred at disk power-on lifetime: 6550 hours (272 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 d8 d8 a6 2f 40  Error: WP at LBA = 0x002fa6d8 = 3122904

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 d8 d8 a6 2f 40 1b   9d+16:07:42.488  WRITE FPDMA QUEUED
  61 08 08 18 ad eb 40 01   9d+16:07:42.488  WRITE FPDMA QUEUED
  61 08 90 40 27 70 40 12   9d+16:07:42.488  WRITE FPDMA QUEUED
  60 08 30 50 d8 8d 40 06   9d+16:07:42.488  READ FPDMA QUEUED
  61 08 28 68 ad eb 40 05   9d+16:07:42.488  WRITE FPDMA QUEUED

Error 674 occurred at disk power-on lifetime: 6550 hours (272 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 d0 68 ad eb 40  Error: WP at LBA = 0x00ebad68 = 15445352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 d0 68 ad eb 40 1a   9d+16:07:42.291  WRITE FPDMA QUEUED
  61 08 c8 48 27 70 40 19   9d+16:07:42.291  WRITE FPDMA QUEUED
  61 08 c0 e0 27 70 40 18   9d+16:07:42.291  WRITE FPDMA QUEUED
  61 08 b8 d8 3c 70 40 17   9d+16:07:42.291  WRITE FPDMA QUEUED
  60 08 b0 e0 6c 94 40 16   9d+16:07:42.291  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

One post mentioned updating drive firmware, so I started down that process:

[cbrake@ariel ~]$ sudo fwupdmgr get-devices
[sudo] password for cbrake: 
WARNING: UEFI capsule updates not available or enabled in firmware setup
  See https://github.com/fwupd/fwupd/wiki/PluginFlag:capsules-unsupported for more information.
Gigabyte Technology Co., Ltd. To be filled by O.E.M.
│
├─SSD 870 EVO 500GB:
│     Device ID:          602b0a6cc821d155208724f0e22f8d111542b74c
│     Summary:            ATA drive
│     Current version:    SVT01B6Q
│     Vendor:             Samsung (ATA:0x144D, OUI:002538)
│     Serial Number:      S6PXNZ0RB29592Z
│     GUIDs:              97fe3b7f-7d9c-553d-b3bc-365d46ba8a0b ← IDE\Samsung_SSD_870_EVO_500GB_______________SVT01B6Q
│                         c8b943ea-8532-519a-afa8-55d93323a07d ← IDE\0Samsung_SSD_870_EVO_500GB_______________
│                         d86b4839-1599-56b3-99b5-203ea834e2c2 ← Samsung SSD 870 EVO 500GB
│     Device Flags:       • Internal device
│                         • Updatable
│                         • System requires external power source
│                         • Needs a reboot after installation
│                         • Device is usable for the duration of the update

But then it gave me:

[cbrake@ariel ~]$ sudo fwupdmgr update
WARNING: UEFI capsule updates not available or enabled in firmware setup
  See https://github.com/fwupd/fwupd/wiki/PluginFlag:capsules-unsupported for more information.
Devices with no available firmware updates: 
 • SSD 870 EVO 500GB
 • ST2000VX000-1CU164
 • ST3000NC000
 • ST32000645NS
 • ST4000DM006-2G5107
No updatable devices

A post was split to a new topic: Linux Vendor Firmware Service

Anyone have thoughts on what brand SSDs to get? I’m glad to pay a little more for something that is reliable (thought I was doing that with Samsung …). I purchased this drive in Jan, 2022, so not even a year old.

I have had Toshiba, Samsung EVO and Western digital, So far they all are well. Toshiba runs rootfs on my builder and is 5+ years old now. Samsung Evo 850 (1TB) is the scratch and has been so for 5+ years and its on SATA interface so a bit slower but works reliably, WD Black (1 TB) is the latest one (3 years old) and its the fastest one since its NVME. I think Samsung EVO or WD are good value for money. btw. I use f2fs for filesytem

1 Like

@cbrake you should be able to warranty that drive back to Samsung and get a replacement with so few hours. Those EVO 870 disks should have a few years of warranty on them.

If you see READ FPDMA QUEUED type errors, sometimes these can be poor signal quality on the SATA lines (I’m assuming this is a SATA drive). If you still see these kinds of errors when using a different drive you may try buying a new SATA cable.

Given some very recent disappointment with Samsung 800-series SATA SSD drives, I’m very hesitant to recommend them. But I have found Micron 5200 drives to be quite good. The 5200 series has been replaced by the 5300 series, which I presume are also quite good even though I’ve not had any personal experience with them. I’ve also had a very good experience with Intel S3710 drives, although Solidigm doesn’t seem that interested in SATA SSDs anymore.

I’m a convert to using datacenter grade SSDs now. The performance is simply good and solid. Consumer grade stuff costs less, but you definitely get what you pay for, especially if you care about write endurance or reliable write performance of small files.

1 Like

Thanks for the ideas!

The SMART warnings were timely, but not much warning – the disk just died completely and does not even detect on boot. Impressive that they gave me some warning before it died, but not much.

@cbrake hopefully you have backups or haven’t lost too much important data! :frowning:

Yesterday I learned that the Micron 5400 series drives are actually quite competitively priced. If you only need 500GB or 1TB they don’t command that much of a premium over a consumer grade SATA SSD. Sometimes price is what matters, but if you can swing the money I’d definitely consider a datacenter-grade drive going forwards.

1 Like

yeah, have backups for static files. Most project files are in Git, mail is on Imap server, etc.

Where do you buy Micron 5400s? It seems Amazon is geared more toward consumer grade stuff. BTW, has Amazon completely taken over the computer parts market? Places like Newegg, which I used to use in the past, seem to have completely lost it and have become. BH Photo seems like a good source for some computer gear, but they again are geared toward consumer.

This is interesting – installed new drive, and attempted to install Arch – first time, got HD I/O errors pretty quickly. Tried re-seating the SATA cable at the MB, and it went much further, but eventually failed. So, may have a cable/MB problem here – drive may still be good … to be continued …

It could be misdiagnosed problem it happens all the time :slight_smile: interested to see what you find

Hooked the dead SSD up to other workstation – can’t get much to happen. Initially it shows up in lsblk, but as soon as I do something like fdisk -l /dev/sdb, it shows size of 0 and I get tons of kernel errors. Perhaps the motherboard caused enough errors that the drive shut itself down or something. Could also be a power supply issue where the supply to the drive is not stable. My best guess is the MB caused the failure though. If I could run a factory reset on the drive or something, may be able to recover it, but a quick search is not showing anything. Hopefully I did not roast the new drive …

Unless there was some kind of power supply problem on the motherboard I doubt the drive got roasted, probably just generally corrupted that hopefully it will still be usable after a repartition and creating new filesystems. Doesn’t help with recovery of the data on board unfortunately, but that might be toasted anyway. I have had USB memory sticks destroyed by desktops, but almost assuredly that was due to ESD when inserting the sticks or other power anomalies related to hot plugging. I think it might be possible for rogue software to wreck an SSD but I doubt this is what happened to you.

CDW has them in stock: Solid State Drives | CDW

The product brief with info about write endurance and the different types of 5400 drives: https://media-www.micron.com/-/media/client/global/documents/products/product-flyer/5400_product_brief.pdf?la=en&rev=c11ea20b11fd4391b5b1aae897b79e8b

1 Like

I’ve had this kind of failure happen with a USB connected flash drive. Sometimes I’m able to write data to it but always within a short time it goes bonkers and there’s tons of kernel errors reported for it. Wiping it, blkdiscarding it, repartitioning, nothing helps, it’s dead. You may have a similar kind of failure. Sadly there’s very little info publicly available to understand how/why managed flash devices die.

1 Like