← Back to context

Comment by fajitaforce5

7 hours ago

I was an intel cpu architect when transmeta started making claims. We were baffled by those claims. We were pushing the limit of our pipelines to get incremental gains and they were claiming to beat a dedicated arch on the fly! None of their claims made sense to ANYONE with a shred of cpu arch experience. I think your summary has rose colored lenses, or reflects the layman’s perspective.

I think this is a classic hill-climbing dilemma. If you start in the same place, and one org has worked very hard and spent a lot of money optimizing the system, they will probably come out on top. But if you start in a different place, reimagining the problem from first principles, you may or may not find yourself with a taller hill to climb. Decisions made very early on in your hill-climbing process lock you in to a path, and then the people tasked with optimizing the system later can't fight the organizational inertia to backtrack and pick a different path. But a new startup can.

It's worth noting that Google actually did succeed with a wildly different architecture a couple years later. They figured "Well, if CPU performance is hitting a wall - why use just one CPU? Why not put together thousands of commodity CPUs that individually are not that powerful, and then use software to distribute workloads across those CPUs?" And the obvious objection to that is "If we did that, it won't be compatible with all the products out there that depend upon x86 binary compatibility", and Google's response was the ultimate in hubris: "Well we'll just build new products then, ones that are bigger and better than the whole industry." Miraculously it worked, and made a multi-trillion-dollar company (multiple multi-trillion-dollar companies, if you now consider how AWS, Facebook, TSMC, and NVidia revenue depends upon the cloud).

Transmeta's mistake was that they didn't re-examine enough assumptions. They assumed they were building a CPU rather than an industry. If they'd backed up even farther they would've found that there actually was fertile territory there.

  • > It's worth noting that Google actually did succeed with a wildly different architecture a couple years later. They figured "Well, if CPU performance is hitting a wall - why use just one CPU? Why not put together thousands of commodity CPUs that individually are not that powerful, and then use software to distribute workloads across those CPUs?" And the obvious objection to that is "If we did that, it won't be compatible with all the products out there that depend upon x86 binary compatibility", and Google's response was the ultimate in hubris: "Well we'll just build new products then, ones that are bigger and better than the whole industry." Miraculously it worked, and made a multi-trillion-dollar company (multiple multi-trillion-dollar companies, if you now consider how AWS, Facebook, TSMC, and NVidia revenue depends upon the cloud).

    Except "the cloud" at that point was specifically just a large number of normal desktop-architecture machines. Specifically not a new ISA or machine type, running entirely normal OS and libraries. At no point did Google or Amazon or Microsoft make people port/rewrite all of their software for cloud deployment.

    At the point that Google's "bunch of cheap computers" was new, CPU performance was still rapidly improving. The competition was traditional "big iron" or mainframe systems, and the novelty was in achieving high reliability through distribution, rather than building on fault-tolerant hardware. By the time the rate of CPU performance improvement was slowing in the mid 2000s, large clusters smaller machines were omnipresent in supercomputing and HPC applications.

    The real "new architecture(s)" of this century are GPUs, but much of the development and success of them is the result of many iterations and a lot of convergent evolution.

    • > At the point that Google's "bunch of cheap computers" was new

      It wasn't even new, people just don't know the history. Inktomi and HotBot were based on a fleet of commodity PC servers with low reliability, whereas other large web properties of the time were buying big iron like Sun E10K. And of course Beowulf clusters were a thing.

      And as far as I know, google's early ethos didn't come as some far sighted strategy, but just the practical reality of Page and Brin building the first versions of their search engine on borrowed/scavenged hardware as grad students and then continuing that trajectory.

      2 replies →

  • That’s revisionist. Transmeta set out to write a software like cpu core. That will always lose to dedicated hardware.

  • > Well we'll just build new products then, ones that are bigger and better than the whole industry.

    With blackjack, and hookers!

Even the people on comp.arch at the time were baffled. No one believed it.

  • The discussions on comp.arch from that era are a gold mine. There were lead architects from the P4 team, from the Alpha team, Linus himself during his Transmeta days... all talking very frankly about the concerns of computer architecture at the time.

Not completely baffling. Intel made an attempt to create a Transmeta like hybrid software/hardware architecture at the time on one of their "VLIW" processors. It was an expensive experiment that didn't work out.

  • The i860 I think originally had a mode. But they then went ahead and doubled down on Itanium.

The Itanium felt like Intel trying on the same bet - move the speculative and analysis logic into the compiler and off the CPU. But where it differed is that it tried to leave some internal implementation details of that decoding process exposed so the compiler could call it directly, in a way that transmeta didn’t manage.

I wonder how long before we try it again.

I recall one of the biggest concerns around the time was that OOOE techniques would not continue scaling in width or depth, and that other techniques would be needed. This turned out to be true, but it was not some fringe idea -- the entire industry turned on this. Intel designed the narrow and less "brainy" Pentium 4 and hoped to achieve performance with frequency, and with HP they designed the in-order Itanium lines. AMD did some speed demon K9. IBM did the in-order POWER6 that got performance with high frequency and runahead speculative execution. Nvidia did a similar thing to Transmeta too, quite a while later IIRC.

All failures. Everybody went back to more conventional out of order designs and were able to find ways to keep scaling those.

I'm sure there were some people at all these companies who were always OOOE proponents and disagreed with these other approaches, but I think your summary has poop colored lenses :) It's a little uncharitable to say their ideas were nonsense. The reality is that this was a very uncertain and exploratory time, and many people with large shreds of cpu arch experience all did wildly different things, and many went down the wrong roads (with hindsight).

Wasn't Intel trying to do something similar in Itanium i.e. use software to translate code into VLIW instructions to exploit many parallel execution units? Only they wanted the C++ compiler to do it rather than a dynamic recompiler? At least some people in Intel thought that was a good idea.

I wonder if the x86 teams at Intel people were similarly baffled by that.

  • Itanium wasn’t really focusing on running x86 code. Intel wanted native Itanium software, and x86 execution was a bonus.

  • Adjacent but not the same bet.

    EPIC aka Itanium was conceived around trace optimizing compilers being able to find enough instruction level parallelism to pack operations into VLIW bundles, as this would eliminate the increasingly complex and expensive machinery necessary to do out of order superscalar execution.

    This wasn't a proven idea at the time, but it also wasn't considered trivially wrong.

    What happened is that the combination of OoO speculation, branch predictors, and fat caches ended up working a lot better than anticipated. In particular branch predictors went from fairly naive assumptions initially to shockingly good predictions on real world code.

    The result is that conventional designs increasingly trounced Itanium as the latter was still baking in the oven. By the time it was shipping it was clear the concept was off target, but at that point Intel/HP et all had committed so much they tried to just bully the market into making it work. The later versions of Itanium ended up adding branch prediction and more cache capacity as a capitulation to reality, but that wasn't enough to save the platform.

    Transmeta was making a slightly different bet, which is that x86 code could be dynamically translated to run efficiently on a VLIW cpu. The goal here was two fold:

    First, to sidestep IP issues around shipping an x86 compatible chip. There's a reason AMD and Cyrix are the only companies to have shipped intel alternatives in volume in that era. Transmeta didn't have the legal cover they did, so this dynamic translation approach sidestepped a lot of potential litigation.

    Second, dynamic translation to VLIW could in theory be more power efficient than a conventional architecture. VLIW at the hardware level is kinda like if a cpu just didn't have a decoder. Everything being statically scheduled also reduces design pressure on register file ports, etc. This is why VLIW is quite successful in embedded DSP style stuff. In theory, because the dynamic translation pays the cost of compiling a block once then calls that block many times, you could get a net efficiency gain despite the cost of the initial translation. Additionally, having access to dynamic profiling information could in theory counterbalance the problems EPIC/Itanium ran into.

    So this also wasn't a trivially bad idea at the time. Transmeta specifically targeted x86 compatible laptops as that was a bit of a sore point in the Wintel world at the time, where the potential power efficiency benefits could motivate sales even if absolute performance still was inferior to intel.

    From what I recall hearing from people who had them at the time, the Transmeta hardware wasn't bad but had the sort of random compatibility issues you'd expect and otherwise wasn't compelling enough to win in the market vs Intel. Note this was also before ARM rose to dominate low power mobile computing.

    Transmeta ultimately failed, but some of their technical concepts in detail have been continued on in how language JITs and GPU shader IRs work today. Or how Apple used translation to migrate off both PowerPC and x86 in turn.

    In both the case of Itanium and Transmeta I'd say it's historically inaccurate to say they were obviously or trivially wrong at the time people made these bets.

It was risky.

From my perspective it was more exciting to the programming systems and compiler community than to the computer architecture community.