Comment by ksec

8 hours ago

>Which is weird....

It isn't weird at all. I would be surprised if it ever succeed in the first place.

Cost was way too high. Intel not sharing the tech with others other than Micron. Micron wasn't committed to it either, and since unused capacity at the Fab was paid by Intel regardless they dont care. No long term solution or strategy to bring cost down. Neither Intel or Micron have a vision on this. No one wanted another Intel only tech lock in. And despite the high price, it barely made any profits per unit compared to NAND and DRAM which was at the time making historic high profits. Once the NAND and DRAM cycle went down again cost / performance on Optane wasn't as attractive. Samsung even made some form of SLC NAND that performs similar to Optane but cheaper, and even they end up stopped developing for it due to lack of interest.

A ways back, I wrote a sort of database that was memory-mapped-file backed (a mistake, but I didn’t know that at the time), and I would have paid top dollar for even a few GB of NVDIMMs that could be put in an ordinary server and could be somewhat straightforwardly mounted as a DAX filesystem. I even tried to do some of the kernel work. But the hardware and firmware was such a mess that it was basically a lost cause. And none of the tech ever seemed to turn into an actual purchasable product. I’m a bit suspicious that Intel never found product-market fit in part because they never had a credible product on the NVDIMM side.

Somewhere I still have some actual battery-backed DIMMs (DRAM plus FPGA interposer plus awkward little supercapacitor bundle) in a drawer. They were not made by Intel, but Intel was clearly using them as a stepping stone toward the broader NVDIMM ecosystem. They worked on exactly one SuperMicro board, kind of, and not at all if you booted using UEFI. Rebooting without doing the magic handshake over SMBUS [0] first took something like 15 minutes, which was not good for those nines of availability.

[0] You can find my SMBUS host driver for exactly this purpose on the LKML archives. It was never merged, in part, because no one could ever get all the teams involved in the Xeon memory controller to reach any sort of agreement as to who owned the bus or how the OS was supposed to communicate without, say, defeating platform thermal management or causing the refresh interval to get out of sync with the DIMM temperature, thus causing corruption.

I’m suspicious that everything involved in Optane development was like this.

I worked at Micron in the SSD division when Optane (originally called crosspoint “Xpoint”) was being made. In my mind, there was never a real serious push to productize it. But it’s not clear to me whether that was due to unattractive terms of the joint venture or lack of clear product fit.

There was certainly a time when it seemed they were shopping for engineers opinions of what to do with it, but I think they quickly determined it would be a much smaller market anyway from ssds and didn’t end up pushing on it too hard. I could be wrong though, it’s a big company and my corner was manufacturing and not product development.

  • I worked at Intel for a while and might be able to explain this.

    There were/are often projects that come down from management that nobody thinks are worth pursuing. When i say nobody, it might not just be engineers but even say 1 or 2 people in management who just do a shit roll out. There are a lot of layers of Intel and if even one layer in the Intel Sandwich drag their feet it can kill an entire project. I saw it happen a few times in my time there. That one specific node that intel dropped the ball on kind of came back to 2-3 people in one specific department, as an example.

    Optane was a minute before I got there, but having been excited about it at the time and somewhat following it, that's the vibe I get from Optane. It had a lot of potential but someone screwed it up and it killed the momentum.

  • A friend was working at Micron on a rackmount network server with a lot of flash memory, I didn't ask at the time what kind of flash it used. The project was cancelled when nearly finished.

Cost was fantastically cheap, if you take into account that Optane is going to live >>10x longer than a SSD.

For a lot of bulk storage, yes, you don't have frequently changing data. But for databases or caches, that are under heavy load, optane was not only far faster, but if looking at life-cycle costs, way way less.

  • Optane was in the market during a time when the mainstream trend in the SSD industry was all about sacrificing endurance to get higher capacity. It's been several years, and I'm not seeing a lot of regrets from folks who moved to TLC and QLC NAND, and those products are more popular than ever.

    The niche that could actually make use of Optane's endurance was small and shrinking, and Intel had no roadmap to significantly improve Optane's $/GB which was unquestionably the technology's biggest weakness.

  • Write endurance of the drive would be measured in TBW, and TLC flash kept adding enough 3D layers to stay cheap enough, quickly enough, that Optane never really beat their pricing per TBW to make a practical product.

    I have to wonder if it isn't usable for some kind of specialized AI workflow that would benefit from extremely low latency reads but which is isn't written often, at this point. Perhaps integrated in a GPU board.

    • Optane practical TBW endurance is way higher than that of even TLC flash, never mind QLC or PLC which is the current standard for consumer NAND hardware. It even seems to go way beyond what's stated on the spec sheet. However, while Optane excels for write-heavy workloads (not read-heavy, where NAND actually performs very well) these are also power-hungry which is a limitation for modern AI workflow.

    • The extra capacity of modern SSD is a good point, especially now that we have 100TB+ SSD.

      But Optane still offered 100 DWPD (drive writes per day), up to 3.2TB. Thats still just so many more DWPD than flash ssd. A Kioxia CM8V for example will do 12TB at 3 DWPD. The net TBW is still 10x apart.

      You can get back to high endurance with SLC drives like the Solidigm p7-p5810, but you're back down to 1.6TB and 50 DWPD, so, 1/4 the Intel P5800X endurance, and worse latencies. I highly suspect the drive model here is a homage, and in spite of being much newer and very expensive, the original is still so much better in so many ways. https://www.solidigm.com/content/solidigm/us/en/products/dat...

      You also end up paying for what I assume is a circa six figure drive, if you are substituting DWPD with more capacity than you need. There's something elegant about being able to keep using your cells, versus overbuying on cells with the intent to be able to rip through them relatively quickly.

  • So instead of replacing every 5 years you replace every 5 years because if you need that level of performance you're replacing servers every 5 years anyway