32

We have a system that uses an SSD (4TB Samsung 860 Pro) that we power on for 10 minutes to write data to and then off every hour for 24/7 for about six months via a Linux system. We manually turn on the power to the drive and wait for the O/S to to see the drive mounted. This usually takes between 12 seconds to 22 seconds to do. We consider a failure to mount if the drive wouldn't show up after 30 seconds of waiting to mount. The first time we did this, everything was working fine. We did a second round with the same drive, but the drives stopped mounting under 30 seconds after about 1 month to 3 months between the 5 systems we ran.

Basically, in the first round, the drive would have been on and off at least 4,320 times. With the drive failing to mount consistently during the second test round, it seems to be between 5,000 and 7,000 total power cycles. All the drives are still working if you wait more than 30 seconds, but they are considered not reliably mounting anymore with our system.

I can't seem to find any SSD drive specifications regarding power cycling and whether there's a limit to doing this. The 4TB 860 Pro drive was very expensive when we bought it (>$1k) and supposedly very reliable with very high Program/Erase (P/E) cycles. However, there are no specs on power cycling.

Is frequent power cycling a bad thing for the SSD drive? I know that most people probably don't do this and the drive probably doesn't get power cycled more than once a day. we basically ran 12 years' worth of everyday power cycling in 6 months.


Edit 1 (additional info from comments): We are running on batteries so power usage is very limited.


Edit 2 (additional info from comments): The SSD drive is connected to a RPi 2B v.1.2 using a modified USB 3 to SATA cable. We have an external power control to turn the power on and off to the cable. Basically, the Pi turns on the power to the SSD and then monitor that the SSD is connected to a specific USB port and then attempt to mount the drive. This is done via a bash script and it runs a mounting loop with 1 second delay until the SSD could be accessed. We give it up to 30 loop counts (1 sec delay each after a fail to mount).


Edit 3 (additional info from comments): The unmounting procedure is to do umount of the drive and then turn off the power. We verified that the data is completely written before unmounting and powering off. The data size is a compressed file typically around 1.2GB to 1.6GB. It's normally just a single file in one hour and it takes about 10 minutes or so to compress the file from the raw data on an SD card and transfer it to the SSD. So the SSD is on for 10-12 minutes before turning off.

Edit 4 After checking more drives, I have found one that already has over 13,000 power cycles and it's still mounting the way we want. I'm waiting to get the failed drives back to see what the counts are on them. We know that we have used them in at least 2 prior runs so I'm expecting to see over 10k power cycles for each of them.

Edit 5 File type on the SSD is Ext4.

9 Answers9

25

Rather than answer your question, I suggest you reevaluate how to control power to the drive(s). Have you factored in the added hardware cost and parasitic power consumption for the capability for directly controlling the power?

SoCs save on power by disabling the clock to a device, rather than disabling power to the device. Instead of denying it power, the device is put to sleep, and it responds by consuming (demanding) less power. So rather than turning off power to the drive, see if you can put the drive to sleep. See Device Sleep (DevSleep) Using the drive's low-power mode(s) eliminates any external power-switching hardware, and transfers the responsibility of conserving power to the drive itself. Presumably such a drive can sustain repeated sleep-wake cycles.

The need to consume less power and provide extended battery life is a critical part of today’s mobile devices. To meet the ever more aggressive power/battery life requirements in this new environment, the SATA interface is evolving. DevSleep is a new addition to the SATA specification, which enables SATA-based storage solutions to reach a new level of low power operation.

The DevSleep specification does not state what power levels a device will reach while in the DevSleep state, but SSDs are targeting 5mW or less.

sawdust
  • 18,591
15

Yes, power cycles are a wear factor for SSDs, and are tracked as "Power Cycle Count" in the internal s.m.a.r.t. monitoring. Only the manufacturer can say how much is too much, but enterprise-level drives are designed to be powered on 24/7, at a consistent temperature, and with a clean power supply. The farther outside those bounds you go, the less reliable your drives may be.

That said, longer mount times are not really a common symptom of SSD wear, unless matched with read/write errors. If the SSD is working normally once it's mounted, then it's much more likely that something at the OS level is causing the mount operation to take longer - though the causes can vary based on OS, firmware, filesystem, etc.

Cpt.Whale
  • 10,914
14

No, there's no good reason for your SSD to be wearing out from just 7,000 power cycles.

But, if it takes 12-22 seconds to mount when empty, it could easily take twice as long to mount when full (it's hard to say what the drive needs to do to report itself as ready, but that activity could easily scale with the file count, for example). You haven't mentioned how you're filling your drive up over time, but you could try saving the mount time vs boot count for each drive. I am guessing you will then see a gradual increase of mount time with each boot, and further details should give clues to help explain this better.

bobuhito
  • 653
10

Turning on an electrical device amounts to creating a power surge as power goes from zero to 100 percent. Power-on is the most dangerous operation for electronic equipment, which is why hardware problems are often detected when turning on the computer.

So yes, there is a negative impact, but for a good-quality SSD it would take a very large number of power cycles to see an effect.

SSDs are protected from power outages by either hardware or firmware PLP (Power Loss Protection). PLPs within SSDs have improved over the years, so the newer the drive, the more likely it is to be protected by the latest PLP technology. The Samsung 860 Pro seems to have come out in 2018, so is not the latest technology.

I don't believe that any SSD company will have ratings for the maximum number of power recycles, although all manufacturers test their SSD to assure a certain resiliency.

For example, I found that ATP SSDs undergo a testing schema that is described in the article Using Four-Corner, Temperature Cycling, and Power Cycling Tests to Verify SSD Resistance to Extreme Operating Conditions, wherein a disk passes if it can withstand 4000 such cycles. Divided by 365 days, this would mean a lifetime of more than 10 years for a typical consumer computer that is turned on once a day.

Your disk undergoes many more power cycles than the 4000 that ATP sees as the desired upper performance limit, so you're basically in uncharted grounds.

harrymc
  • 498,455
9

First it's important to recognise the 3 different layers that "damage" can be happening at here:

  1. Hardware: some physical component getting damaged. This makes sense for a spinning disk, and it's why power cycle count is a S.M.A.R.T. metric, but this isn't a spinning disk. We can't say anything for certain about whether or not power cycles are bad for the hardware of the SSD, but based on my experience working with electronics, I would call it extremely unlikely. An SSD is made of solid state components and they (mostly) don't care how many times you power cycle them. Resistors don't care. The impact on transistors and capacitors is negligible. Inductors can create voltage spikes when you cut power to them suddenly, but any good design will account for this.
  2. Firmware-level device state: stuff like bad sector relocations. SSDs have become complex. The firmware is performing all kinds of tricks behind your back, and SSD firmware is notoriously buggy. It is possible, for example, that your SSD is somehow marking sectors as bad if it happens to be in the middle of a write when you cut power. Lots of SSDs also have tiered storage, where writes are persisted to a small buffer that is invisible to the OS. This allows the SSD to re-order writes and to report writes as "durably stored" faster. Maybe something in that system is getting confused by all the power cycles. If that is what's happening, you might be able to fix it with an "ATA Secure Erase" or "NVMe Secure Erase" (this does delete everything on the drive). That said, I think this is unlikely to be the problem.
  3. Software-level device state: AKA, the filesystem. Mounting a filesystem on an SSD should take ~1 second, not 12-22 seconds. This suggests that the filesystem might not be getting cleanly unmounted. Many filesystems have to take some sort of recovery action when mounting a device which was not cleanly unmounted. This often involves "walking" the filesystem to make sure everything is valid. This gets slower as there is more data on the filesystem. Other filesystems keep a "journal" of what they were doing so they only have to check the parts of the filesystem they were working on if something gets cleanly unmounted. Other filesystems have (basically) no safeguards in place and mount very quickly, even when corrupted.

I think your issue exists at #3. There are several ways to test this:

  • If you re-format the drive, is it fast again? If so, you have a #3 problem.
  • If you byte-for-byte copy the partition from a "bad" drive to a new one (without mounting it and allowing the OS to clean anything up) is it still slow? If so, you have a #3 problem.
  • If you monitor disk I/O while you are mounting the drive, do you see a lot of activity? If so, you probably have a #3 problem.
Toby Speight
  • 5,213
9072997
  • 641
6

With regards to number of acceptable power-cycles: I can not find data on this.

But I doubt it matters. I'm inclined to believe that any sudden power-interruption is potentially harming the device at some level.

SSDs are hardly ever doing nothing

You being done writing does not mean the SSD is done writing as, as others already suggested, SSDs tend to perform all kinds of background tasks (garbage collection, wear leveling, scrubbing) in "idle time". Pulling the plug may therefore leave the FTL in an inconsistent state.

Pulling the plug does do harm at some level

So far it seems you haven't answered the question how you disable power to the SSD or how you 'shut it down'. If you sort of "pull the plug" or "flip the switch" you could indeed be damaging the SSD at some level. These are claims that can be backed up by research.

This paper examines one aspect of that data integrity by measuring the types of errors that occur when power fails during a flash memory operation. Our findings demonstrate that power failure can lead to several non-intuitive behaviors.

Apart from damage on the FTL level, file systems are not invulnerable to power interruptions either. I suppose every PC user knows this from personal experience.

The drive not mounting in x seconds does not mean it has failed

Just as the OS attempting to recover from an unclean shut-down or at least checking the 'dirty' file system we may assume a SSD's firmware will do a similar thing. These checks take time. Some manufacturers for example suggest to give the SSD 5 minutes or so to perform these.

Whether the drive is visible or not, let it sit in this state for a minimum of five minutes to allow the SSD to rebuild its mapping tables, then reboot the system and see if the drive is restored.

In the data recovery industry it's a known fact that a 'bricked' SSD may self recover by letting it sit for a while with power connected, data lines disconnected. I know of extreme cases where a SSD came back to life after being connected to power for 24 hours. But there's also cases where the firmware has failed to the degree where the controller can not even access the NAND. At some point the controller has to read the firmware from the NAND itself, and if that's too corrupt it typically comes to life but with reduced capacity.

No information about actual failure mode

Your device not mounting in x minutes does not mean by definition the SSD has terminally failed. Your device not 'mounting' in x minutes also tells us very little about the failure mode: Is a file system issue, a firmware issue, a hardware issue?

Back to SD Cards?

It is kind of 'funny' that the SD cards you have been using previously are better handling sudden power loss than more sophisticated (in many ways) SSDs. If you require a system where you can just flip the switch your choice may be switching back to the SD cards or switch to more expensive SSDs with physical power loss protection in the form of an array of 'super capacitors'.

Silent data corruption is what you perhaps should be worried about

In the end, every sudden power loss situation is bad and potentially crippling the SSD without any actual hardware component failing, but even without the unit failing it may corrupt your data, which if this goes unnoticed may be a far more serious issue.

Bit corruption hit 3 devices; 3 had shorn writes; 8 had serializability errors; one device lost 1/3 of its data; and 1 SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures (tested: 15 drives)


EDIT because of edits to question.

"We are running on batteries so power usage is very limited."

I think it's worth investigating if perhaps this is the source of the problem. So test same setup but now with wall-power. EDIT: This was investigated and not the issue

"The unmounting procedure is to do umount of the drive and then turn off the power. We verified that the data is completely written before unmounting and powering off."

I am not convinced this is the proper way as unmount does not tell the SSD to stop its background processing so it may still be writing and a such sudden power loss may corrupt the FTL. But I'm no Pi nor Linux person. For inspiration see this answer.

"I have found one that already has over 13,000 power cycles and it's still mounting the way we want"

It isn't useful info, one may fail after n power-cycles, the other after m power-cycles, next one after first time. Next one may fail for an entirely different reasons. And then we have brands, models, firmware revisions and whatnot to account for.


EDIT in reaction to comment: "Sounds like this could be the answer to the unsafe power off: echo 1 | sudo dd of=/sys/block/sdX/device/delete"

Based on my experience with SSDs in different contexts, I am inclined to believe that this is what you should be exploring: graceful power-down of the SSD.

Other than sending direct ATA commands some tool may exist that can do this for you. This was the purpose of my 'inspirational link'. Graceful unmount isn't enough, it needs to be a command that tells the drive to power down, to stop it's internal house-keeping activities.

An extra hurdle can be the USB > SATA conversion: Sending the proper commands does not per se mean the USB bridge will pass the command to the SATA drive. Again from experience it seems to me the best chance the USB > SATA adapter passes on the command is if it's powered by a Asmedia controller (ASM1153, ASM1051).

2

TL;DR

Read about Thermal Cycling Failure in Electronics and check out this cool image of thermal fatigue in solder.

enter image description here


We have a system that uses an SSD (4TB Samsung 860 Pro) that we power on for 10 minutes to write data to and then off every hour for 24/7 for about six months via a Linux system.

You're only power cycling the SSD, right? Not the entire system?

We manually turn on the power to the drive and wait for the O/S to to see the drive mounted. This usually takes between 12 seconds to 22 seconds to do. We consider a failure to mount if the drive wouldn't show up after 30 seconds of waiting to mount.

Manually?? Do you value your own worth in negative values?

MonkeyZeus
  • 9,841
1

This is entirely expected. Modern SSD's do wear levelling, which means they move logical blocks around physically. This is usually done as a low-prio background task in firmware, when the OS isn't writing. Because of this wear levelling, SSD's need to store a logical-physical block mapping. That is stored in flash as well.

Flash also requires 250 ms of stable power when writing a cell. This is hidden by the firmware, and in a sequence of writes this means you only need to have power on for 250 ms after the last physical write - but this does include the block mapping.

Since you turn off the device without warning, you risk corrupting the block mapping. Depending of the firmware, the SSD may be able to recover part or all of that mapping. But each time you turn off the SSD while it's doing wear levelling, you risk a total disk failure.

A factory reset may allow the firmware to discard the entire block mapping and generate a new one. If this is the case, all you lose is a bit of capacity from the flash blocks that were destroyed by the power-offs.

MSalters
  • 8,283
0

Some experience relating the question:

  • The particular pattern of power cycling may or may not be bad, depending on the design of the power bus. It is usually not bad.

  • If something fails because of the power cycling itself, it does not fail gradually or gracefully. It fails, period.

  • SSDs have a lot of housekeeping work that they do when left powered on and idle. This includes, but is not limited to, erasing the blocks where the data is not valid anymore (i.e. overwritten or trimmed) and moving the recently written data from buffering SLC to permanent-storage MLC blocks. There may be other background tasks as well. Failing that, SSD do show reduced performance.

  • (may be related to your mount times) We have observed SSDs from different reputable brands reducing their performance at 3-5 orders of magnitude for both reads and writes, following prolonged use. We were not able to determine the particular usage pattern that leads to this loss of performance, but it is not big sequential writes for sure. In regard to reading, the disk develops "slow spots" at particular LBA ranges and it gets painful to salvage the data from it. On the other hand, no data lost so far. On the third hand, the disk at least temporarily recovers performance after being "security erased enhanced" and then left alone (powered on) for the time advertised for the "security erase enhanced" command.

fraxinus
  • 1,262