Comment by zackmorris

6 hours ago

Just wanted to mention genetic algorithms (GAs), popularized by John Koza and others.

The post uses a 4 instruction program as an example having about 256^4 or 4 billion combinations. Most interesting programs are 10, 100, 1000+ instructions long, which is too large of a search space to explore by brute force.

So GAs use a number of tricks to investigate the search space via hill climbing without getting stuck at local optima. They do that by treating the search space as a bit string, then randomly flipping bits (mutation) or swapping bits (sexual reproduction) to hop to related hills in the search space. Then the bit string is converted back to instructions and tested to see if it performs the desired algorithm.

The bit string usually encodes the tree form of a Lisp program to minimize syntax. We can think of it as if every token is encoded in bits (like Huffman encoding inspired by Morse code) For example, the tokens in a (+ 1 2) expression might have the encoding 00, 01 and 10, so the bit string would be 000110, and we can quickly explore all 2^3 = 8 permutations (2^6 = 64 if we naively manipulate an uncompressed bit string whose encoded token sizes vary).

Note that many of the bit strings like (+ + 1) or (2 1 +) don't run. So guard rails can be added to reduce the search space, for example by breaking out early when bit strings throw a compiler exception, or using SAT solvers or caching to weed out nonviable bit strings.

We could build a superoptimizer with GAs, then transpile between MOS 6502 assembly and Lisp (or even run the MOS 6502 assembly directly in a sandbox) and not have to know anything about how the processor works. To me, this is the real beauty of GAs, because they allow us to solve problems without training, at the cost of efficiency.

I don't think that LLMs transpile to Lisp when they're designing algorithms. So it's interesting that they can achieve high complexity and high efficiency via training, without even having verification built-in. Although LLMs trained on trillions of parameters running on teraflops GPUs with GBs of memory may or may not be viewed as "efficient".

I suspect that someday GAs may be incorporated into backpropagation to drastically reduce learning time by finding close approximations to the matrix math of gradient descent. GAs were just starting to be used to pseudorandomly produce the initial weights of neural nets around 2000 when I first learned about them.

Also quantum computing (QC) could perform certain matrix math in a fraction of the time, or even preemptively filter out bit strings which aren't runnable. I suspect that AI will get an efficiency boost around 2030 when QC goes mainstream. Which will probably lead us to a final candidate learning algorithm that explains how quantum uncertainty and emergent behavior allow a physical mind to tune into consciousness and feel self-aware, but I digress.

Because modern compilers don't do any of this, and we aren't accustomed to multicore computing, then from a sheer number of transistors perspective, we're only getting a tiny fraction of the computing power that we might otherwise have if we designed chips from scratch using modern techniques. This is why I often say that computers today run thousands of times slower than they should for their transistor budgets.