Comment by Aurornis
1 year ago
The machine learning model didn’t discover something that humans didn’t know about. It abused some functions specific to the chip that could not be repeated in production or even on other chips or other configurations of the same chip.
That is a common problem with fully free form machine learning solutions: They can stumble upon something that technically works in their training set, but any human who understood the full system would never actually use due to the other problems associated with it.
> The quadratic region (before the "on") is far more energy efficient
Take a look at the structure of something like CMOS and you’ll see why running transistors in anything other than “on” or “off” is definitely not energy efficient. In fact, the transitions are where the energy usage largely goes. We try to get through that transition period as rapidly as possible because minimal current flows when the transistors reach the on or off state.
There are other logic arrangements, but I don’t understand what you’re getting at by suggesting circuits would be more efficient. Are you referring to the reduced gate charge?
> Take a look at the structure of something like CMOS and you’ll see why running transistors in anything other than “on” or “off” is definitely not energy efficient. In fact, the transitions are where the energy usage largely goes. We try to get through that transition period as rapidly as possible because minimal current flows when the transistors reach the on or off state.
Sounds like you might be thinking of power electronic circuits rather than CMOS. In a CMOS logic circuit, current does not flow from Vdd to ground as long as either the p-type or the n-type transistor is fully switched off. The circuit under discussion was operated in subthreshold mode, in which one transistor in a complementary pair is partially switched on and the other is fully switched off. So it still only uses power during transitions, and the energy consumed in each transition is lower than in the normal mode because less voltage is switched at the transistor gate.
> In a CMOS logic circuit, current does not flow from Vdd to ground as long as either the p-type or the n-type transistor is fully switched off.
Right, but how do you get the transistor fully switched off? Think about what happens during the time when it’s transitioning between on and off.
You can run the transistors from the previous stage in a different part of the curve, but that’s not an isolated effect. Everything that impacts switching speed and reduces the current flowing to turn the next gate on or off will also impact power consumption.
There might be some theoretical optimization where the transistors are driven differently, but at what cost of extra silicon and how delicate is the balance between squeezing a little more efficiency and operating too close to the point where minor manufacturing changes can become outsized problems?
Seems like this overfitting problem could have been trivially fixed by running it on more than one chip, no?
Unfortunately not. This is analogous to writing a C program that relied on undefined behavior on the specific architecture and CPU of your developer machine. It’s not portable.
The behavior could change from one manufacturing run to another. The behavior could disappear altogether in a future revision of the chip.
The behavior could even disappear if you change some other part of the design that then relocated the logic to a different set of cells on the chip. This was noted in the experiment where certain behavior depended on logic being placed in a specific location, generating certain timings.
If you rely on anything other than the behavior defined by the specifications, you’re at risk of it breaking. This is a problem with arriving at empirical solutions via guess and check, too.
Ideally you’d do everything in simulation rather than on-chip where possible. The simulator would only function in ways supported by the specifications of the chip without allowing undefined behavior.
>The behavior could change from one manufacturing run to another. The behavior could disappear altogether in a future revision of the chip.
That's the overfitting they were referring to. Relying on the individual behaviour is the overfit. Running on multiple chips (at learning time) reduces the benefit of using an improvement that is specific to one chip.
You are correct that simulation is the better solution, but you have to do more than just limit to the operating range of the components, you have to introduce variances similar to the specified production precision. If the simulator made assumptions that the behaviour of two similar components was absolutely identical to each other then within tolerance manufacturing errors could be magnified.
2 replies →
[dead]
The previous poster was probably thinking about very low power analog circuits or extremely slow digital circuits (like those used in wrist watches), where the on-state of the MOS transistors is in the subthreshold conduction region (while the off state is the same off state as in any other CMOS circuits, ensuring a static power consumption determined only by leakage).
Such circuits are useful for something powered by a battery that must have a lifetime measured in years, but they cannot operate at high speeds.
In other words, optimization algorithms in general are prone to overfitting. Fortunately there are techniques to deal with that. Thing is, once you find a solution that generalize better to different chips, it probably won't be as small as the solution found.