Comment by m_ke

6 days ago

We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.

For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.

  • yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.

    Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.

  • As the lead author, why do you think so?

    • I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:

      - Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).

      - Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.

      It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).

      To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.

      2 replies →

Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.

There also needs to be tools that can author that code!

Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.

Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail

Deep Learning models would work way better with much higher dimensional sparse vectors

Citations?

  • There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.

    The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.

    • If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.

My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.