Badblocks reports "weird value (4294967295) in do_write" when writing patterns

Question

This is the second time that I'm receiving this error running badblocks, approximately 2 years apart from the last time, and the vast majority of factors from hardware (cables, etc.) to software (the installation of the operating system itself) have changed since, with the only relevant common factors being Cygwin and the badblocks program itself, making it highly likely that the issue is between those.

When running badblocks in destructive mode (i.e. with the -w switch), I get the error:

Weird value (4294967295) in do_writerrors

...at each stage of writing the patterns to the drive.

As far as I can tell, I seem to get this error only when running the command with the specified last block reported by fdisk -l:

$ fdisk -l /dev/sda
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

$ badblocks -b 512 -vws /dev/sda 1953525168 1953525168
Checking for bad blocks in read-write mode
From block 1953525168 to 1953525168
Testing with pattern 0xaa: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: 1953525168ne, 0:00 elapsed. (0/0/0 errors)
done
Testing with pattern 0x55: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0xff: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0x00: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Pass completed, 1 bad blocks found. (1/0/0 errors)

$ badblocks -b 512 -vws /dev/sda 1953525168 1950000000
Checking for bad blocks in read-write mode
From block 1950000000 to 1953525168
Testing with pattern 0xaa: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: 1953525168ne, 0:49 elapsed. (0/0/0 errors)
done
Testing with pattern 0x55: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0xff: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Testing with pattern 0x00: Weird value (4294967295) in do_writerrors)
done
Reading and comparing: done
Pass completed, 1 bad blocks found. (1/0/0 errors)

As can be seen, this also results in a false positive of a bad block, whereas this supposed bad block is nowhere to be found via CrystalDiskInfo:

At this point the drive has been zeroed multiple times and had badblocks write to its last few blocks tens of times, so there's been plenty of opportunity for the SMART values to have picked up a bad sector in block 1953525168 if one existed.

What do these errors actually mean, and what could be causing them?

Tom Yan · Accepted Answer · 2019-12-02T14:19:45.650

Although harrymc might have given you the core of my answer (that 4294967295 is -1 as unsigned int), he didn't further explain why badblocks doesn't simply "recognize" it as -1 (i.e. why the "weird value" error with a Cygwin build of it on Windows).

I took a look into the code of badblocks and Cygwin:

https://github.com/tytso/e2fsprogs/blob/v1.45.4/misc/badblocks.c#L463

https://github.com/cygwin/cygwin/tree/01c253a4c58b6c1da01615431bdc4c88fcba48ea/newlib/libc/syscalls/syswrite.c

https://github.com/cygwin/cygwin/tree/01c253a4c58b6c1da01615431bdc4c88fcba48ea/newlib/libc/reent/writer.c

And I have came up with this:

[tom@archlinux ~]$ cat test.c 
#include <stdio.h>

unsigned int eh() {
  return -1;
}

int main() {
  long got;
  got = eh();
  printf("%ld\n", got);
  got = (long) eh();
  printf("%ld\n", got);
  got = (int) eh();
  printf("%ld\n", got);
}
[tom@archlinux ~]$ cc test.c 
[tom@archlinux ~]$ ./a.out 
4294967295
4294967295
-1
[tom@archlinux ~]$

Basically this is saying, if you want to interpret an unsigned variable (that may be used intentionally to store a signed value) as a signed one, you should interpret with its own size, but not the size of another variable that you are going to put its value into.

I am not exactly familiar with programming, but as you can see, the (_ssize_t) type casting in reent/writer.c is probably wrong. If we assume _write() is of the int type (or any signed type), such type casting is redundant. If we assume _write() is of the unsigned int type, then the type casting it needs should be (int). (For the record, it is needed only because we are "expanding" its value to a _ssize_t (i.e. ret). Comparison like (an_unsigned_int == -1) could work just fine, AFAIK.)

Though I have to say this is merely my guessing, as I don't really know about the _write() Cygwin uses (like, whether it has anything to do with this, and if so, whether the documentation is just crap). But I think it is a valid case for a bug report, which might get you to find out more.

Update: This could be the commit that introduces the "regression" (as you can see, _ssize_t would be based on __SIZE_TYPE__ (which is essentially size_t according the commit message). It would likely end up being unsigned long when Cygwin is 64-bit, based on this and this), so I'm betting you won't be able to reproduce the problem with 32-bit Cygwin (even on 64-bit Windows, that is). It might be worth mentioning that an even earlier commit probably once "fixed" it. That's why I call it a "regression".

Update 2: and yes, I'm right: Perhaps now I should get Visual Studio and check _write() (and maybe write()) for a bit...

P.S. You shouldn't bump into the "weird value" error if you are doing a read-only test on "last block + 1" as _read() would return 0, unlike _write() which would return -1 and set errno to ENOSPC, when it "tries to read at end of file" (the drive).

harrymc · Answer 2 · 2019-12-02T08:31:36.930

The decimal value 4294967295, in hex FFFFFFFF, is simply -1 depicted as an unsigned 32-bit integer. This is a common API error code and has no other meaning. The utility badblocks is very basic, written decades ago by Linus Torvalds, which only writes out data and reads it back.

Uncorrectable Sector Count denotes the number of bad sectors that the disk firmware has detected but has not been able to relocate to good sectors because these sectors could not be read. The firmware has given up on trying to relocate these sectors.

So there are 459 uncoverable sectors that the firmware has detected but is not able to remap.

The disk is undoubtedly in a terminal phase.

If you wish to salvage the disk and don't care about its contents, you could try to deep format it, to rewrite and renew all the good sectors, while marking as bad the sectors that the firmware cannot touch. A utility by the manufacturer is preferable here. Cygwin is to be avoided, as its Linux utilities are not guaranteed of a good Windows integration.

The DiamondMax Support page suggests the quite recent disk utility DiscWizard Version: 23.0.17160, which could perhaps be able to do the deep format. This is a Windows utility.

If the disk in question is the Windows system disk, you might need to execute the utility from a Windows PE boot disk or from such a rescue disk as Bob.Omb’s Modified Win10PEx64. You might also use a Bootable Windows PE-Based Recovery Disc such as Hiren’s BootCD PE. In a pinch you could try to format the disk from a Linux Live boot.

(Addition for the rewritten post)

The above answer was apparently accepted by the poster two years before it was written and the disk was replaced. This part is about the new disk.

The new disk is in perfect shape and with zero defects, yet badblocks gives one error message.

Badblocks is an ancient utility, written by Linus Torvalds, perhaps even before Linux existed. All it does is create a temporary file, write to it until end-of-space is encountered, then re-read the data. As a disk test it is abysmal, and only "tests" the free space on the disk.

In addition, it is being run on Cygwin and not even on Windows, so its understanding of Windows returned error codes is extremely doubtful. It cannot even report the real error code, instead always reporting a -1 as error code. There is no way to imagine what would be the result of Cygwin trying to translate a Windows API error code to what it imagines is the equivalent Linux error code.

Quite frankly, I would ignore this one spurious error as meaningless, probably just coming from the misunderstanding of the "no-more-space" return code, misunderstood by either badblocks or Cygwin. The data returned by the SMART firmware is much more to the point.

In the post Equivalent of badblocks on Windows or DOS several suggestions were offered, all of them much better than badblocks, as they test the entire disk and not only the free space.

One good alternative is chkdsk /r, which uses the Windows utility chkdsk to locate bad sectors and recover readable information, analyzing physical disk errors on the entire disk.

Badblocks reports "weird value (4294967295) in do_write" when writing patterns

What do these errors actually mean, and what could be causing them?

2 Answers2

Linked