7

I have 320GB HDD that has bad sectors and disk is failing due to errors on read, reported via smartctl. To save as much data, I wanted to perform dd/ddrescue on this disk.

ddrescue had abysmally slow read speed on default settings (300kB/s) from get-go and had no time to experiment with settings so I went with dd. Lets skip the topic of improving ddrescue speed for now.

I used dd command, copying to SSD having >1TB free space:

dd if=/dev/sda of=recovery.img conv=noerror,sync iflag=fullblock status=progress

The problem is, the longer dd goes, the slower it becomes. It started from 10MB/s, quickly falling to 5MB/s and keeps slowing down. It is understandable that when reading bad block I will get slower speed and read error, but speed does never recover and never goes up, even when there are no errors for many gigabytes. Example output:

36571460608 bytes (37 GB, 34 GiB) copied, 7521 s, 4.9 MB/s 
dd: error reading '/dev/sda': Input/output error
71428640+0 records in
71428640+0 records out
36571463680 bytes (37 GB, 34 GiB) copied, 7522.87 s, 4.9 MB/s

[a lot of read errors here]

163873310720 bytes (164 GB, 153 GiB) copied, 55200 s, 3.0 MB/s dd: error reading '/dev/sda': Input/output error 320065087+1 records in 320065088+0 records out 163873325056 bytes (164 GB, 153 GiB) copied, 55202.2 s, 3.0 MB/s

[a lot of read errors here]

180528095744 bytes (181 GB, 168 GiB) copied, 105746 s, 1.7 MB/s dd: error reading '/dev/sda': Input/output error 352593785+152 records in 352593937+0 records out 180528095744 bytes (181 GB, 168 GiB) copied, 105748 s, 1.7 MB/s 184232141312 bytes (184 GB, 172 GiB) copied, 115892 s, 1.6 MB/s

184232509952 bytes (184 GB, 172 GiB) copied, 115893 s, 1.6 MB/s 192463561216 bytes (192 GB, 179 GiB) copied, 138368 s, 1.4 MB/s 211374223872 bytes (211 GB, 197 GiB) copied, 190337 s, 1.1 MB/s dd: error reading '/dev/sda': Input/output error 412840143+153 records in 412840296+0 records out 211374231552 bytes (211 GB, 197 GiB) copied, 190342 s, 1.1 MB/s 211374232064 bytes (211 GB, 197 GiB) copied, 190342 s, 1.1 MB/s dd: error reading '/dev/sda': Input/output error 412840143+154 records in 412840297+0 records out 211374232064 bytes (211 GB, 197 GiB) copied, 190344 s, 1.1 MB/s

In above example, between 181GB - 211GB there were no read errors so many sectors should be ok, but speed never increased to starting ~10MB/s, it kept dropping. There were also no read errors for 1st 37GB (thus missing output), but here speed decrease is understandable due to cache running out and disk failing.

hdparm uses optimal settings for disk speed. iostat reports disk being utilized 100%:

r/s     rkB/s   rrqm/s  %rrqm   r_await rareq-sz Device
89.50    358.0k     0.00   0.0%   11.12     4.0k sda
w/s     wkB/s   wrqm/s  %wrqm   w_await wareq-sz Device
0.00      0.0k     0.00   0.0%    0.00     0.0k sda
d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
0.00      0.0k     0.00   0.0%    0.00     0.0k sda
f/s f_await  aqu-sz  %util Device
0.00    0.00    0.99  99.5% sda

My question is, why would something like this be a case? How does it work that drive keeps getting slower and slower read speed over time, even without read errors from bad blocks? Why speed never goes up?

Second part of the question: Is it possible for ddrescue to have better speed than dd for good sectors?

Giacomo1968
  • 58,727
Myszsoda
  • 231
  • 2
  • 8

2 Answers2

14

Why does failing HDD speed keep on going down during dd operation?

The "speed" number that you seem to be looking at does not represent what you think it does. The numbers that you seem to be scrutinizing are an overall throughput value; it's the quotient of the total bytes transferred divided by the total elapsed time. These values are cumulative, and not instantaneous.

For example:

36571463680 bytes (37 GB, 34 GiB) copied, 7522.87 s, 4.9 MB/s

36571 MB / 7523 sec = 4.86 MB/sec

And

211374232064 bytes (211 GB, 197 GiB) copied, 190344 s, 1.1 MB/s  

211374 MB / 190344 sec = 1.11 MB/sec


These calculations are using cumulative values to report an average.
A read error causes retries of the read operation. That adds to the time component, and reduces the throughput.
To refer to these throughput numbers as "speed" can be deceiving.
They do not indicate any actual rate of data transfers over any interface.


How does it work that drive keeps getting slower and slower read speed over time ...

You're not analyzing these results properly; the output is not periodic at all.
The first 164 GB of transfers has only 2 reports, but the next 48 GB produces 11 reports.
The last 10 GB (of that 48 GB) of transfers generates 5 (of those 11) reports.
The average throughput goes down because read-retries and/or bad blocks are occurring more often!


Addendum

The problem I have is that even when there are ~30GB without error, speed does not increase (and we are already in kB/s range). I would expect average speed to increase over seemingly good sectors but does not seem to be a case.

Perhaps what you think are "good sectors" is not accurate.
The mere absence of "errors" is not a good or reliable indicator of data integrity and/or drive health.
"Errors" in storage devices (employing ECC) are not necessarily black and white; there are correctable errors as well as uncorrectable errors.

A read request typically is deemed in error (i.e. failed) only after the drive has performed N retries and the data is still uncorrectable. It's these repeated attempts to read the sector that add operational time (at least one revolution of the platter per retry) and reduce the throughput, and why the concept of "speed" can be inappropriate.

When one of these read retries is good or (more likely) has errors that are correctable (using the ECC), then the read operation is deemed complete and successful. The host is provided with status indicating whether the sector data had to be corrected, but I'm not aware if the retry count is returned.
Regardless dd does not report any error for a read request that returns valid sector data in spite of any extended time to obtain that data.


So a deflated throughput number can be held low (or further reduced) by "slow" reads that required retries and did not generate any error messages.


You could obtain a better picture of your drive's health and throughput capabilities by using a (much) smaller transfer.
Instead of dding the entire drive, read just a small, select range of blocks.

dd if=/dev/sda of=/dev/null skip=<start> count=<len>

where
<start> is the LBA ("sector" number) from where the transfer begins
<len> is the number of blocks/sectors to transfer, e.g. 16 or even 1.


Also consult your drive's SMART attributes for the corrected and uncorrected read-error statistics.

sawdust
  • 18,591
5

Three factors:

  • As others have pointed out failure is a compound event the more you try to save the drive the more it is likely to fail - your best bet is either commercial data recovery (if critical data is on the drive) or a single pass to get the data off onto a good drive.
  • Could be thermal - on a failing drive the internals could be getting hot and seizing up.
  • Since physical drives spin at constant rate their transfer rate is faster at the edge than in the center.
DavidT
  • 1,242