Transforming a QLC SSD into an SLC SSD

2 years ago (theoverclockingpage.com)

You don't need to go through all that trouble to use most cheap DRAMless SSDs in pSLC mode. You can simply under-provision them by using only 25-33% capacity.

Most low end DRAMless controllers run in full disk caching mode. In other words, they first write *everything* in pSLC mode until all cells are written, only after there are no cells left they go back and rewrite/group some cells as TLC/QLC to free up some space. And they do it only when necessary, they don't go and do that in background to free up more space.

So, if you simply create a partition 1/3 (for TLC) or 1/4 (for QLC) the size of the disk, make sure the remaining empty space is TRIMMED and never used, it'll be always writing in pSLC mode.

You can verify the SSD you're interested in is running in this mode by searching for a "HD Tune" full drive write benchmark results for it. If The write speed is fast for the first 1/3-1/4 of the drive, then it dips to abysmal speeds for the rest, you can be sure the drive is using the full drive caching mode. As I said, most of these low-end DRAMless Silicon Motion/Phison/Maxion controllers are, but of course the manufacturer might've modified the firmware to use a smaller sized cache (like Crucial did for the test subject BX500).

  • How can I verify that things stay this way?

    Partitioning off a small section of the drive feels very 160 GB SCSI "Let's only use the outer sectors".

    • Even keeping the drive always 75% empty would be enough, but partitioning off is the easiest way to make sure it's never exceed 25-33% full (assuming the drive behaves like that in the first place).

      To verify the drive uses the all of the drive as a cache, you can run full drive sequential write test (like the one in HD Tune Pro) and analyze the speed graph. If, say, a 480GB drive writes at full speed for the first 120GB, and then the write speed drops for the remaining 360GB, this means the drive is suitable for this kind of use.

      I think controllers might've been doing some GC jobs to always keep some amount of cells ready for pSLC use, but it should be a few GBs at most and shouldn't affect the use case depicted here.

    • > Partitioning off a small section of the drive feels very 160 GB SCSI "Let's only use the outer sectors".

      In that it was very reliable at accomplishing the goal?

      1 reply →

  • That is what an ideal FTL would do if only a fraction of the LBAs are accessed, but as you say some manufacturers will customise the firmware to do otherwise, while this mod basically guarantees that the whole space is used as SLC.

  • Scroll down to the section called “SLC CACHING” towards the end of TFA. Your approach will work for only 45GB though the actual SLC cache is 120GB in size, because the process kicks in before the SLC is fully consumed (so it can page on and out of it).

    If you don’t need 66% of the drive’s SLC capacity, the small partition approach is indeed easier and safer.

    • Oh, I've already did. You'll see that if you scroll up and read the last sentence in my comment above.

  • How do you make sure the empty space is trimmed? Can you trim a portion of a disk?

    • The literal answer is yes, an ATA TRIM, SCSI UNMAP, or NVMe Deallocate command can cover whatever range on a device you feel like issuing it for. (The device, in turn, can clear all, none, or part of it.) On Linux, blkdiscard accepts the -o, --offset and -l, --length options (in bytes) that map more or less exactly to that. Making a(n unformatted) partition for the empty space and then trimming it is a valid workaround as well.

      But you’re most probably doing this on a device with nothing valuable on it, so you should be able to just trim the whole thing and then allocate and format whatever piece of it that you are planning to use.

    • AFAIK Windows runs TRIM when you format a partition. So you can create a dummy partition and format it. Then you can either delete or simply not use it.

      On Linux, blkdiscard can be used in the same manner (create a dummy partition and run blkdiscard on it ex. *blkdiscard /dev/sda2).

      7 replies →

  • :mind-blown:

    i knew about "preconditioning" for SSDs when it comes to benchmarking, etc. didn't realize this was the why.

    thanks!

  • ssd firmwares are a mistake. they saw how easy it is to sell crap, with non ecc (i.e. bogus ram) being sold as the default and ran (pun intended) with it.

    so if under provisioned now they work as pSLC, giving you more data resilience in short term but wasting more write cycles because they're technically writing 1111111 instead of 1. every time. if you fill them up then they have less data resilience.

    and the best part, there's no way you can control any of it based on your needs.

    • > giving you more data resilience in short term but wasting more write cycles because they're technically writing 1111111 instead of 1. every time.

      No, that's not how it works. SLC caches are used primarily for performance reasons, and they're faster precisely because they aren't doing the equivalent of writing four ones (and especially not seven!?) to a QLC cell.

      1 reply →

This hack seems to take a 480GB SSD and transform it into a 120GB SSD

However the write endurance (the amount of data you can write to the SSD before expecting failures) increases from 120TB to 4000TB which could be a very useful tradeoff, for example if you were using the disk to store logs.

I've never seen this offered by the manufacturers though (maybe I haven't looked on the right place), I wonder why not?

  • There are companies selling SLC SSDs (often using TLC or QLC flash but not using that mode) for industrial applications, for example Swissbit.

    • But they cost far more than what SLC should be expected to cost (4x the price of QLC or 3x the price of TLC.) The clear answer to the parent's question is planned obsolescence.

      1 reply →

  • I don't understand how the author goes from 3.8 WAF(Write Amplication Factor) to 2.0 WAF and gets a 30x increase in endurance. I'd expect about 2x from that.

    From what I can see, he seems to be taking the 120TBW that the OEM warranties on the drive for the initial result, but then using the NAND's P/E cycles spec for the final result, which seems suspicious.

    The only thing that I could be missing is the NAND going to pSLC mode somehow increases the P/E cycles drastically, like requiring massively lower voltage to program the cells. But I think that would be included in the WAF measure.

    What am I missing?

    • QLC memory cells need to store and read back the voltage much more precisely than SLC memory cells. You get far more P/E cycles out of SLC because answering "is this a zero or a one?" remains fairly easy long after the cells are too worn to reliably distinguish between sixteen different voltage levels.

      7 replies →

  • I wonder if it would be useful as cache disks for ZFS or Synology (with further tinkering)?

    • To dive slightly into that: You don't necessarily want to sacrifice space for a read cache disk: having more space can reduce writes as you do less replacement.

      But where you want endurance is for a ZIL SLOG (the write cache, effectively). Optane was great for this because of really high endurance and very low latency persistent writes, but, ... Farewell, dear optane, we barely knew you.

      The 400GB optane card had an endurance of 73 PB written. Pretty impressive, though at almost $3/GB it was really expensive.

      This would likely work but as a sibling commenter noted, you're probably better off with a purpose-built, high endurance drive. Since it's a write cache, just replace it a little early.

      1 reply →

    • Under provisioning is the standard recommendation for ZFS SSD cache/log/l2arc drives since those special types were a thing.

  • Manufacturers offer that, in the form of TLC drives. Which are supported, unlike this hack which might cause data loss.

    This gives you 120GB with 4000TB write endurance, but you can buy a 4TB TLC drive with 3000TB write endurance for $200.

  • data longevity depends on implementation in the firmware, which you have zero visibility. most consumer drivers will lower longevity.

What isn't prominently mentioned in the article is that endurance and retention are highly related --- flash cells wear out by becoming leakier with each cycle, and so the more cycles one goes through, the faster it'll lose its charge. The fact that SLC only requires distinguishing between two states instead of 16 for QLC means that the drive will also hold data for (much) longer in SLC mode for the same number of cycles.

In other words, this mod doesn't only mean you get extreme endurance, but retention. This is usually specified by manufacturers as N years after M cycles; early SLC was rated for 10 years after 100K cycles, but this QLC might be 1 year after 900 cycles, or 1 year after 60K cycles in SLC mode; if you don't actually cycle the blocks that much, the retention will be much higher.

I'm not sure if the firmware will still use the stronger ECC that's required for QLC vs. SLC even for SLC mode blocks, but if it does, that will also add to the reliability.

About ten years ago I got my hands on some of the last production FusionIO SLC cards for benchmarking. The software was an in-memory database that a customer wanted to use with expanded capacity. I literally just used the fusion cards as swap.

After a few minutes of loading data, the kernel calmed down and it worked like a champ. Millions of transactions per second across billions of records, on a $500 computer... and a card that cost more than my car.

Definitely wouldn't do it that way these days, but it was an impressive bit of kit.

  • I worked at a place where I can say, FusionIO saved the company. W e had a single Postgres database which powered a significant portion of the app. We tried the kick off a horizontal scale project to little success around it - turns out that partitioning is hard on a complex, older codebase.

    Somehow we end up with a FusionIO card in tow. We go from something like 5,000 read QPS to 300k reads QPS on pgbench using the cheapest 2TB card.

    Ever since then, I’ve always thought that reaching for vertical scale is more tenable than I originally thought. It turns out hardware can do a lot more than we think.

    • The slightly better solution for these situations is to set up a reverse proxy that sends all GET requests to a read replica and the server with the real database gets all of the write traffic.

      But the tricky bit there is that you may need to set up the response to contain the results of the read that is triggered by a successful write. Otherwise you have to solve lag problems on the replica.

    • You can get up to, I think, half a thousand cores in a single server, with multiple terabytes of RAM. You could run the entirety of Wikipedia's or Stack Overflow's o Hacker News's business logic in RAM on one server, though you'd still want replicas for bandwidth scaling and failover. Vertical scaling should certainly get back in vogue.

      Not to mention that individual servers, no matter how expensive, cost a tiny fraction of the equivalent cloud.

      Remember the LMAX Disruptor hype? Their pattern was essentially to funnel all the data for the entire business logic onto one core, and make sure that core doesn't take any bullshit - write the fastest L1-cacheable nonblocking serial code with input and output in ring buffers. Pipelined business processes can use one core per pipeline stage. They benchmarked 20 million transactions per second with this pattern - in 2011. They ran a stock exchange on it.

  • Back when the first Intel SSDs were coming out, I worked with an ISP that had an 8 drive 10K RAID-10 array for their mail server, but it kept teetering on the edge of not being able to handle the load (lots of small random IO).

    As an experiment, I sent them a 600GB Intel SSD in laptop drive form factor. They took down the secondary node, installed the SSD, and brought it back up. We let DRBD sync the arrays, and then failed the primary node over to this SSD node. I added the SSD to the logical volume, then did a "pvmove" to move the blocks from the 8 drive array to the SSD, and over the next few hours the load steadily dropped down to nothing.

    It was fun to replace 8x 3.5" 10K drives with something that fit comfortably in the palm of my hand.

  • In the nineties they used battery backed RAM that cost more than a new car for WAL data on databases that desperately needed to scale higher.

I’d also recommend this if you’re using eMMC in embedded devices. On a Linux system, you can use the `mmc` command from `mmc-utils` to configure your device in pSLC mode. It can also be done in U-Boot but the commands are a bit more obtuse. (It’s one-time programmable, so once set it’s irreversible.)

In mass-production quantities, programming houses can preconfigure this and any other eMMC settings for you.

I wish this kind of deep dive with bus transfer rates was more common. It would be great to have a block diagram that lists every important IC model number / working clock frequency + bus width / working clock rate between these ICs for every SSD.

Some Kingston SSDs allow you to manage over-provisioning (i.e. to choose the capacity-endurance tradeoff) by using a manufacturer-provided software tool.

  • I don't think that would change how many bits are stored per cell, though? If you, say, set overprovisioning to 80%, then that's going to be 80% of the QLC capacity, and it's going to use the remaining 20% still in QLC mode, it's not going to recognize that it can use SLC with 20% of the SLC overprovisioned.

    • Yeah, all over provisioning does is gives the controller more spare cells to play with. The cells will still wear at the same rate as if you didn’t over provision, however depending on how the controller is wear leveling it could further improve the life of the drive because each cell is being used less often.

      This mod (I only just skimmed the post) provides a longer life not by using the cells less often (or keeping more in reserve), but by extending each cells life by decreasing the tolerance of charge needed to store the state of the cell, but in return decreasing the bits that can be stored in the cell so decreasing the capacity.

It be nice if manufacturers provide a way to downgrade SSD to SLC via some driver settings.

  • While ssds do not, all flash chips do. So if you were ever going to try building your own SSD or simply connect some flash directly up to your soc via some extra pins, you would be able to program them this way. I imagine extending NVMe to offer this is possible if there was enough popular demand.

  • Great thing about disks is that they don't require drivers at all. The driver settings Windows app is not going to be open sourced if such thing were to exist.

Wild! I had assumed this is a hardware level distinction

  • How many bits a particular NAND chip can store per cell is presumably hardware-level, but I believe it's possible to achieve SLC on all of them anyway, even if they support TLC or QLC.

    Hell, the Silicon Power NVMe SSD I have in my machine right now will use SLC for writes, then (presumably) move that data later to TLC during periods of inactivity. Running the NAND in SLC mode is a feature of these drives, it's called "SLC caching".

  • Of course it is trivial to just write 000 for zero and 111 for one in the cells of a TLC SSD to turn it into effectively a SLC SSD, but that in itself doesn't explain why it's so much faster to read and write compared to TLC.

    For example, if it had been DRAM where the data is stored as charge on a capacitor, then one could imagine using a R-2R ladder DAC to write the values and a flash ADC to read the values. In that case there would be no speed difference between how many effective levels was stored per cell (ignoring noise and such).

    From what I can gather, the reason the pseudo-SLC mode is faster is down to how flash is programmed and read, and relies on the analog nature of flash memory.

    Like DRAM there's still a charge that's being used to store the value, however it's not just in a plain capacitor but in a double MOSFET gate[1].

    The amount of charge changes the effective threshold voltage of the transistor. Thus to read, one needs to apply different voltages to see when the transistor starts to conduct[2].

    To program a cell, one has to inject some amount of charge that puts the threshold voltage to a given value depending on which bit pattern you want to program. Since one can only inject charge, one must be careful not to overshoot. Thus one uses a series of brief pulses and then do a read cycle to see if the required level has been reached or not[3], repeating as needed. Thus the more levels per cell, the shorter pulses are needed and more read cycles to ensure the required amount of charge is reached.

    When programming the multi-level cell in single-level mode, you can get away with just a single, larger charge injection[4]. And when reading the value back, you just need to determine if the transistor conducts at a single level or not.

    So to sum up, pseudo-SLC does not require changes to the multi-level cells as such, but it does require changes to how those cells are programmed and read. So most likely it requires changing those circuits somewhat, meaning you can't implement this just in firmware.

    [1]: https://en.wikipedia.org/wiki/Flash_memory#Floating-gate_MOS...

    [2]: https://dr.ntu.edu.sg/bitstream/10356/80559/1/Read%20and%20w...

    [3]: https://people.engr.tamu.edu/ajiang/CellProgram.pdf

    [4]: http://nyx.skku.ac.kr/publications/papers/ComboFTL.pdf

    • > but it does require changes to how those cells are programmed and read. So most likely it requires changing those circuits somewhat, meaning you can't implement this just in firmware

      Fortunately everyone shipping TLC/QLC disks needs to use a pSLC cache for performance reasons, so that hardware is already there.

Could this be used to extend the lifetime of an already worn-out SSD? I wonder if there's some business in china taking those and reflashing them as "new".

  • The only rejuvenation process that I know is heat, either long period exposure to 250°C or short-term at higher temperature (800°C).

    https://m.hexus.net/tech/news/storage/48893-making-flash-mem...

    https://m.youtube.com/watch%3Fv%3DH4waJBeENVQ&sa=U&ved=2ahUK...

    • That first article was 12 years ago when MLC was the norm and had 10k endurance.

      Macronix have known about the benefits of heating for a long time but previously used to bake NAND chips in an oven at around 250C for a few hours to anneal them – that’s an expensive and inconvenient thing to do for electronic components!

      I wonder if the e-waste recycling operations in China may be doing that to "refurbish" worn out NAND flash and resell it. They already do harvesting of ICs so it doesn't seem impossible... and maybe this effect was first noticed by someone heating the chips to desolder them.

  • Technically, QLC NAND that is no longer able to distinguish at QLC levels should certainly still be suitable as MLC for a while longer, and SLC, for all practical intents and purposes, forever.

    • Yes, but certainly no consumer or even enterprise ssd firmware has bothered to integrate that functionality.

I thought memory QLC and TLC memory chips are different at the physical level, not that is just a matter of firmware.

  • There are physical differences, QLC requires more precise hardware, since you need to distinguish between more charge levels. But you can display a low-quality picture on a high-definition screen, or in a camera sensor average 4 physical pixels to get a virtual one, same thing here, you combine together some charge levels for increased reliability.

    Put another way, you can turn a QLC into a TLC, but not the other way around.

  • The memory cells are identical. The peripheral circuitry for accessing the memory array gets more complicated as you support more bits per cell, and the SRAM page buffers have to get bigger to hold the extra bits. But everyone designs their NAND chips to support operating with fewer bits per cell.

    Sometimes TLC and QLC chips will be made in different sizes, so that each has the requisite number of memory cells to provide a capacity that's a power of two. But it's just as common for some of the chips to have an odd size, eg. Micron's first 3D NAND was sold as 256Gbit MLC or 384Gbit TLC (literally the same die), and more recently we've seen 1Tbit TLC and 1.33Tbit QLC parts from the same generation.

Is it possible for SSD firmware to act “progressively” from SLC to MLC to TLC and to QLC (and maybe PLC in the future)? E.g. for a 1TB QLC SSD, it would act as SLC for usage under 256GB, then MLC under 512GB, then TLC under 768GB, and then QLC under 1TB (and PLC under 1280GB).

  • It's theoretically possible, but in practice when a drive is getting close to full what makes sense is to compact data from the SLC cache into the densest configuration you're willing to allow, without any intermediate steps.

DIWhy type stuff. Still, fun hack. TLC media has plenty of endurance. We see approximately 1.3-1.4x NAND write amplification in production workloads at ~35% fill rate with decent TRIMing.

It mentions the required tool being available from um... interesting places.

Doing a Duck Duck Go search on the "SMI SM2259XT2 MPTool FIMN48 V0304A FWV0303B0" string in the article shows this place has the tool for download:

https://www.usbdev.ru/files/smi/sm2259xt2mptool/

The screenshot in the article looks to be captured from that site even. ;)

Naturally, be careful with anything downloaded from there.

  • There were several instances were I saw an interesting tool for manipulating SSDs and SD cards only available from strange Russian websites. This one at least has an English UI ... A lot of research seems concentrated there and I wonder why it did not catch the same level of interest in the west.

    • These are genuine factory tools supplied by chip vendors such as Silicon Motion, supposedly under NDA, leaked and passed around loosely among Chinese factories. These things are sometimes repacked with malware installers, so blindly running on your dev machine with AWS keys might not be the best idea. Trying to run it on Linux, macOS, or rewriting in Rust might not be great too.

      It doesn't happen in the West because manufacturing happens in China in Chinese language. I suppose it's easier for Russian guys to (figuratively) walk into their smoke room and ask for the USB key.

    • and I wonder why it did not catch the same level of interest in the west.

      Because people in the west are too scared of IP laws.

    • Yeah. That site has a lot of info for a huge number of flash controllers/chipsets/etc.

      Wish I had a bunch of spare time to burn on stuff like this. :)

      1 reply →

  • I was unable to find the source code, so it is important to be careful. In my case it sounds like a faith jump that I don't have (my apologies to the developers).

    In any case, this is a feature that manufacturers should provide. I wonder how it could be obtained.

  • In countries where people have been less conditioned to be mindless sheep, you can more easily find lots of truth that doesn't toe the official line.

    Spreading xenophobic FUD only serves to make that point clearer: you can't argue with the facts, so you can only sow distrust.