Comment by fsckboy

2 years ago

> The Linux kernel has been accidentally hardcoded to a maximum of 8 cores for the past 15 years and nobody noticed

I can understand having a bug like that, but unnoticed for 15 years? more than 8 cores was rare 15 years ago, and as a percentage of chips sold is still rare, but presumably people with threadrippers ran benchmarks? optimized? etc? just doesn't seem possible

The article is extremely confusingly worded, sometimes close to nonsensical (epyc as baseline? In what damn world?), and definitely clickbaity.

From what I understand what was limited to 8 cores is the scaling of the preemption delay (min_granularity / min_slice). Again from what I understand that what this is is the window during which a process can not be preempted, so this is only relevant when the scheduler has more tasks to run than available slices (the system is heavily / over - loaded).

I would assume well-administered systems where this would be relevant:

1. Are not overloaded

2. Have the important tasks pinned to avoid migrations

3. Have priorities configured to avoid preempting / descheduling their primary workloads

As such, on a well-administered system this would mostly translate to possibly over-pre-empting low-priority tasks (and most likely not pre-empting anything because the machine is configured with capacity for those ancillary / transient low priority tasks). This may show up during transient overloads, and worsen an already bad situation, but it probably wouldn’t show up during normal operations.

It also doesn’t seem accidental, the maintainers literally slapped an `min(8, …)` on there, so they explicitly designed the scaling to have an upper bound. Maybe it’s a mistake, maybe it’s too low, maybe it should be a tunable, but I’d think it makes sense to not allow the preemption delay to grow infinitely.

Because it’s a completely misleading headline.

The number of cores in the heuristic used to calculate task switch frequency was capped to 8.

This is a reasonable thing to do as a heuristic, because you don’t want your time slice to grow indefinitely with core count on an interactive system.

  • Exactly. The code is adjusting for responsiveness. With less cpus you need a smaller minimum slice. As you have more cpus you can increase the slice and still schedule the same number of processes per second.

    E.g. 1 ms slice with 1 core = 1000 process switches per second. With 2 cores you can increase the slice to 2 ms and still maintain the same number of switches per second for the system, but reducing the switches per second on each core to 500. This reduces the overhead for the scheduler.

    It seems like at around 8 times the slice efficiency starts to go the other way, so they’ve limited it. Seems reasonable, but scheduler math is crazy.

    Note, that this has nothing to do with the scheduler assignments per core which have clearly been working or people would’ve noticed!

It's possible, the kernel itself not using more than 8 cores will be imperceptible except on things like network devices where theres' no userland<->kernel barrier. In those kinds of high-throughput applications usually you offload things the kernel does to hardware anyway.

You'll lose your mind once you realise that Windows NT handles a lot of things single-threaded. I had a situation once where I was handling a few million packets per second of TCP on Windows and it only pins a single core.

Though: that's not what TFA is looking at: in this case you're not actually limited to 8 cores, you're limited to slicing your executions into 8 parts per cpu per "tick".

This is a known way of scheduling, you only get 1/8th of a tick with a fair scheduler.

Yeah how come people running massive computers didn’t notice the limit?

  • Because it's a misleading headline and doesn't mean linux only used 8 cores for computing for 15 years.

  • Because the framing is wrong and click bait... As anyone with many cores can tell you: Those are used.

    The issue us more subtle: "[the minimum granularity] is supposed to allow tasks to run for a minimum amount of [3ms] when the system is overloaded".

    That's supposed to scale with number of cores, but the scaling us limited to 8 cores. However, imho that's not even necessarily a bad thing. It's a trade off between responsiveness and throughput in overload situations. You don't want slices to become too tiny/large...

  • If i skimmed this correctly then its a malus on performance not a complete cliff. I guess people just thought "hohum - there gotta be some overhead in scheduling".

    • Yes, when you run in parallel, and e.g. see all >8 cores nicely nagging up to 100% why assume something wrong?

      Still don't get after rereading the article, what is the malus, it must be small by that? Because you definitely saw linear scaling with parallelizable problems on >8 cores, otherwise people would have noticed?

      3 replies →