Checking Hard Drives from the Linux Command Line

When buying new hard drives it’s always a good idea to check them for bad blocks. If a drive had bad blocks it can be an indication it’s not long for this world, especially if the number of bad blocks grows over a short period of time. In this article I’ll show you how to quickly check out the health of a hard drive. I should point out that none of these tests give a guarantee that your drive is fine, all they show is that nothing could be detected at the time of the test. For that reason always backup anything you care about.

NOTE: in the examples below I’m working as root since this is just a demo machine I put together. Some commands (e.g. fdisk -l) will usually require sudo to be prepended to the start of the command (e.g. sudo fdisk -l)

Discovering Disks

The first thing you need to do is find out what disks your system knows about. The easiest way to do that is with the fdisk command.

root@pm1:~# fdisk -l

Disk /dev/sdb: 298.09 GiB, 320072933376 bytes, 625142448 sectors
Disk model: WDC WD3200AAKX-0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F891A5A3-B02B-084D-ACC8-8BD998ABF9CF

Device         Start       End   Sectors   Size Type
/dev/sdb1       2048 625125375 625123328 298.1G Solaris /usr & Apple ZFS
/dev/sdb9  625125376 625141759     16384     8M Solaris reserved 1


Disk /dev/sda: 149.01 GiB, 160000000000 bytes, 312500000 sectors
Disk model: ST3160812AS     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 200C55F7-350D-5C46-9A32-39A5EABCFD2C

Device         Start       End   Sectors  Size Type
/dev/sda1       2048 312481791 312479744  149G Solaris /usr & Apple ZFS
/dev/sda9  312481792 312498175     16384    8M Solaris reserved 1


Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DL003-9VT1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 77070F1F-38F8-41D4-B39B-D921AD600CFA

Device       Start        End    Sectors  Size Type
/dev/sdc1       34       2047       2014 1007K BIOS boot
/dev/sdc2     2048    1050623    1048576  512M EFI System
/dev/sdc3  1050624 3907029134 3905978511  1.8T Linux LVM


Disk /dev/sdd: 149.01 GiB, 160000000000 bytes, 312500000 sectors
Disk model: ST3160812AS     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: A835D014-F977-8345-A677-99B78387A342

Device         Start       End   Sectors  Size Type
/dev/sdd1       2048 312481791 312479744  149G Solaris /usr & Apple ZFS
/dev/sdd9  312481792 312498175     16384    8M Solaris reserved 1


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

This shows you a lot of information about the drives in your system. Each drive starts with a “Disk” line that indicates where the disk can found e.g. /dev/sda. It also tells you the size, model and a bunch of other information about the disk. As you can see I have four physical disks installed in this system. If any disks are missing from this list but they are plugged in correctly then either it’s hidden in the BIOS or the disk is faulty.

Checking for SMART Errors

Modern hard drives have the ability to monitor their own health using a feature called SMART (Self-Monitoring, Analysis and Reporting Technology). If you keep an eye on the SMART information you’ll often get a warning that a disk is about to fail. In my experience the warning is usually quite short, maybe a few days, but it’s better than nothing. To read the SMART information from a drive you need a utility called smartmontools installed, if it’s not already installed you can install it like this:

sudo apt-get install smartmontools

Note that SMART was originally designed for spinning disks but SSD’s also have SMART reporting. Some of the original SMART fields, such as spin up time, don’t make sense for SSDs but will often still be reported. What is discussed here is aimed at HDDs but much of it will also be relevant to SSDs.

Checking the health of a disk is a one line command with a simple one line pass / fail output as shown below. Note: it’s -H not -h, the latter prints help for the smartctl command.

root@pm1:~# smartctl -H /dev/sda

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

As you can see this drive passes the SMART check. If you want more information you can replace -H with -a which prints all the information for a drive as shown here.

root@pm1:~# smartctl -a /dev/sda

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.9
Device Model:     ST3160812AS
Serial Number:    5LS9ADML
Firmware Version: 3.ADJ
User Capacity:    160,000,000,000 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Thu May  4 10:05:17 2023 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  54) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   088   006    Pre-fail  Always       -       112686697
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       468
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       99514528
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       17808
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       476
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   063   045    Old_age   Always       -       30 (Min/Max 18/32)
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   053   046   000    Old_age   Always       -       129752706
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     17185         -
# 2  Extended offline    Completed without error       00%     14683         -
# 3  Extended offline    Completed without error       00%         1         -
# 4  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This gives a lot of information but the section we’re really interested in is in the middle where we have access to the raw values the drive is reporting. The first one to look at is ID#9 Power On Hours, this drive shows 17808 hours which is just over 2 years of use. This drive is about 16 years old so that tells us that most of the time it’s been sitting around not doing anything. Now look at ID#12 Power Cycle Count which this drive reports as 476. This drive was power cycled almost every day that it was active which indicates it probably wasn’t in a server farm – drives in a server farm would power cycle quite rarely. To put these numbers into perspective my current NAS has drives with power on hours of around 76500 (8.7 years) and power cycle counts of 29.

Now it’s time to look at the factors that give an indication of a possibly failing drive.

  • ID#5 Reallocated Sectors Count
  • ID#187 Reported Uncorrectable Errors
  • ID#188 Command Timeout
  • ID#197 Current Pending Sector Count
  • ID#198 Uncorrectable Sector Count or Offline Uncorrectable

All of these, if present, should ideally have a raw value of zero. Notice that this drive doesn’t report ID#188, not all drives will report all statistics. These five measures are generally considered to be the ones that point to a possible issue. According to Backblaze 77% of drives that fail will be reporting one of these values as non-zero at the time of failure (4% of working drives will have one value that is non-zero). Multiple non-zero values is definitely cause for concern. In my (limited) experience drives can run for a very long time with one of these values being non-zero. One of my NAS drives has had a non-zero current pending sector value for years, all the other values it reports are zero. The key point though is that the value isn’t changing which is more of a concern.

SMART Self Testing

The smartctl application can not only read the SMART data but also trigger a number of self tests on the disk. To perform a quick test you’d use a command such as:

sudo smartctl -t short /dev/sdd

This will run a short test to look for obvious errors, a test like this should last no more than ten minutes – on modern drives it seems to last just a few seconds. I follow that up with a “conveyance” test which is designed to look for any shipping damage and then I run a long test if I really want to give the drive a workout. Just to give you and idea of how long a long test will take, on the 16TB drives I use it’s about 24 hours.

Now for the bad news…

If you are buying used drives the SMART records might not be reliable initially as it is possible to reset the SMART values on at least some drives (Seagate seem to be particularly susceptible from a quick search). The tools needed don’t seem to be particularly easy to get hold of and I doubt that this is a widespread problem but it’s worth keeping in mind if you come across a used disk that has raw values of zero across the board. See below what what you can do about this. The ongoing SMART data the drive collects once you have it should be reliable.

Checking for Bad Blocks

Before SMART was widely used the best way to check a hard drive was a scan for bad blocks. The problem with this method is that it’s slow, especially for large modern drives, and a write scan is destructive to the data on the drive. Considering how good SMART is now a bad block scan is usually unnecessary but there is one time you might consider it. If you have a used drive and you suspect someone might have tampered with the SMART data a bad block check will turn up any issues. You might not actually get a report of bad blocks, as the drive may silently remap them, but you will see something reported in the SMART statistics.

Personally, I don’t think running badblocks is worth it. On a decent sized modern drive, lets say 16TB, you might expect a full destructive badblocks scan to take a week. Why so long? It completely fills the drive and then reads it back four times. All of this data has to go over the HBA and hit the processor, this is will take time. Add in latency from seeks etc and you can see why it takes so long. Additionally, if badblocks turns up a bad block the drive is finished as that indicates SMART was unable to remap the block. By the time badblocks is finding issues SMART should have been screaming at you for a while.

The badblock utility can be run in a number if different ways but it basically boils down to whether you want to run a write test. If you really want to find bad blocks you need to write to the disk, the only real downside to performing a write test is time – it takes ages. For a typical destructive write test you need to be able to sacrifice all the data and structure on the disk. This will then perform four passes writing and reading different data patterns and checking they match. On the 160GB drive we’ve been using in this article so far as single write and read of one pattern took about 90 minutes! Obviously for a large modern drive you’re probably talking several days.

The command I use for a destructive write test is:

badblocks -wsv /dev/disk/by-id/<device_id>

Note, generally you’d use the device alias such as /dev/sda. I have shown the ID because I run a ZFS raidz1 array where you should really use the ID. I actually offlined a disk for this article just to capture the output of badblocks.

Conclusion

If neither smartctl nor badblocks turns up any problems with the drive then it’s probably good to use. Backblaze has released plenty of analyses on drive failures and their conclusion is that if a drive doesn’t show any SMART warnings it’s probably good. There are about 20% of drives that just die out of the blue without giving any warning but no amount of testing is going to find those.