Comment by chao-

20 hours ago

Crazy to think that my first personal computer's entire storage (was 160MB IIRC?) could fit into the L3 of a single consumer CPU!

It's probably not possible architecturally, but it would be amusing to see an entire early 90's OS running entirely in the CPU's cache.

https://github.com/coreboot/coreboot/blob/main/src/soc/intel...

  • Context: Early in the firmware boot process the memory controller isn't configured yet so the firmware uses the cache as RAM. In this mode cache lines are never evicted since there's no memory to evict them to.

    • I remember the talk about the Wii/WiiU hacking they intentionally kept the early boot code in cache so that the memory couldn’t be sniffed or modified on the ram bus which was external to the CPU and thus glitchable.

    • There may be server workloads for which the L3 cache is sufficient, would be interesting if it made sense to create boards for just the CPU and no memory at scale.

      I imagine for such a workload you can always solder a small memory chip to avoid having to waste L3 on unused memory and a non-standard booting process so probably not.

      1 reply →

In my case it began with 16K (yes, 161024 bytes) and 90K (yes, 901024 bytes) 5.25" floppy disks (although the floppies were a few months after the computer). Eventually upgraded to 48K RAM and 180K double density floppy disks. The computer: Atari 800.

  • I'll see your Atari 800 and raise you my Atari 2600 with its whopping 128 bytes of RAM. Bytes with a B. I can kinda sorta call it a computer because you could buy a BASIC cartridge for it (I didn't and stand by that decision - it was pretty bad).

Maybe in 50 years the cache of CPUs and GPUs will be 1TB. Enough to run multiple LLMs (a model entirely run for each task). Having robots like in the movies would need LLMs much much faster than what we see today.

KolibriOS would fit in there, even with the data in memory. You cannot load it into the cache directly, but when the cache capacity is larger than all the data you read there should be no cache eviction and the OS and all data should end up in the cache more or less entirely. In other words it should be really, really fast, which KolibriOS already is to begin with.

  • Unless you lay everything out continuously in memory, you’ll still get cache eviction due to associativty and depending on the eviction strategy of the CPU. But certainly DOS or even early Windows 95 could conceivably just run out of the cache

    • Windows 95 only needed 4MB RAM and 50 MB disk, so that's certainly doable. The trick is to have a hypervisor spread that allocation across cache lines.

    • Yeah, cache eviction is the reason I was assuming it is "probably not possible architecturally", but I also figured there could be features beyond my knowledge that might make it possible.

      Edit: Also this 192MB of L3 is spread across two Zen CCDs, so it's not as simple as "throw it all in L3" either, because any given core would only have access to half of that.

    • Well, yeah, reality strikes again. All you need is an exploit in the microcode to gain access to AMD's equivalent to the ME and now you can just map the cache as memory directly. Maybe. Can microcode do this or is there still hardware that cannot be overcome by the black magic of CPU microcode?

  • That assumes KolibriOS or any major component is pinned to one core and one cache slice instead of getting dragged between CCDs or losing memory affinity. Throw actual users, IO, and interrupts at it and you get traffic across chiplets, or at least across L3 groups, so the nice 'everything lives in cache' story falls apart fast.

    Nice demo, bad model. The funny part is that an entire OS can fit in cache now, the hard part is making the rest of the system act like that matters.

You had ~160,000 times more storage than I did for my first personal computer.

I wonder how much faster dos would boot, especially with floppy seek times...

  • Instantly.

    If you run a VM on a CPU like this, using a baremetal hypervisor, you can get very close to "everything in cache".

  • You can get close with a VM, but there's overhead in device emulation that slows things down.

    Consider a VM where that kind of stuff has been removed, like the firecracker hypervisor used for AWS Lambda. You're talking milliseconds.