Working on complex systems: What I learned working at Google

3 days ago (thecoder.cafe)

One of my pet peeves with the usage of complex(ity) out of the traditional time/space in computer science is that most of the time the OPs of several articles over the internet do not make the distinction between boundaried/arbitrary complexity, where most of the time the person has most of the control of what is being implemented, and domain/accidental/environmental complexity, which is wide open and carries a lot of intrinsic and most of the time unsolvable constraints.

Yes, they are Google; yes, they have a great pool of talent around; yes, they do a lot of hard stuff; but most of the time when I read those articles, I miss those kinds of distinctions.

Not lowballing the guys at Google, they do amazing stuff, but in some domains of domain/accidental/environmental complexity (e.g. sea logistics, manufacturing, industry, etc.) where most of the time you do not have the data, I believe that they are way more complex/harder than most of the problems that the ones that they deal with.

  • I’d wager 90% time spent at Google is fighting incidental organizational complexity, which is virtually unlimited.

    • The phrase thrown around was “collaboration headwind”, the idea was if project success depends on 1 person with a 95% chance of success, project success also had a 95% chance. But if 10 people each need to succeed at a 95% chance, suddenly the project success likelihood becomes 60%…

      In reality, lazy domain owners layered on processes, meetings, documents, and multiple approvals until it took 6 months to change the text on a button, ugh

      28 replies →

    • And when you’re at a smaller company 90% of your time is fighting societal complexity, limit of which also approaches infinity, but at a steeper angle.

      No greater Scott’s man can tell you that the reality is surprisingly complex, and sometimes you have resources to organize and fight them, and sometimes you use those resources wiser than the other group of people, and can share the lessons. Sometimes, you just have no idea if your lesson is even useful. Let’s judge the story on its merits and learn what we can from it.

      4 replies →

  • I think this is addressed with the complex vs complicated intro. Most problems with uncontrolled / uncontrollable variables will be approached with an incremental solution, e.g. you'll restrict those variables voluntarily or involuntarily and let issues being solved organically / manually, or automatisation will be plain and simple being abandoned.

    This qualify as complicated. Delving in complicated problems is mostly driven by business opportunity, always has limited scaling, and tend to be discarded by big players.

    • I don't think it is, because the intro gets it wrong. If a problem's time or space complexity increases from O(n^2) to O(n^3) there's nothing necessarily novel about that, it's just... more.

      Complicated on the other hand, involves the addition of one or more complicating factors beyond just "the problem is big". It's a qualitative thing, like maybe nobody has built adequate tools for the problem domain, or maybe you don't even know if the solution is possible until you've already invested quite a lot towards that solution. Or maybe you have to simultaneously put on this song and dance regarding story points and show continual progress even though you have not yet found a continuous path from where you are to your goal.

      Climate change is both, doing your taxes is (typically) merely complex. As for complicated-but-not-complex, that's like realizing that you don't have your wallet after you've already ordered your food: qualitatively messy, quantitatively simple.

      To put it differently, complicated is about the number of different domains you have to consider, complex is about--given some domain--how difficult the consideration in that domain are.

      Perhaps the author's usage is common enough in certain audiences, but it's not consistent with how we discuss computational complexity. Which is a shame since they are talking about solving problems with computers.

    • I don't think this is adequately addressed by the "complicated vs. complex" framing—especially not when the distinction is made using reductive examples like taxes (structured, bureaucratic, highly formalized) versus climate change (broad, urgent, signaling-heavy).

      That doesn’t feel right.

      Let me bring a non-trivial, concrete example—something mundane: “ePOD,” which refers to Electronic Proof of Delivery.

      ePOD, in terms of technical implementation, can be complex to design for all logistics companies out there like Flexport, Amazon, DHL, UPS, and so on.

      The implementation itself—e.g., the box with a signature open-drawing field and a "confirm" button—can be as complex as they want from a pure technical perspective.

      Now comes, for me at least, the complex part: in some logistics companies, the ePOD adoption rate is circa 46%. In other words, in 54% of all deliveries, you do not have a real-time (not before 36–48 hours) way to know and track whether the person received the goods or not. Unsurprisingly, most of those are still done on paper. And we have:

      - Truck drivers are often independent contractors.

      - Rural or low-tech regions lack infrastructure.

      - Incentive structures don’t align.

      - Digitization workflows involve physical paper handoffs, WhatsApp messages, or third-party scans.

      So the real complexity isn't only "technical implementation of ePOD" but; "having the ePOD, how to maximize it's adoption/coverage with a lot uncertainty, fragmentation, and human unpredictability on the ground?".

      That’s not just complicated, it’s complex 'cause we have: - Socio-technical constraints,

      - Behavioral incentives,

      - Operational logistics,

      - Fragmented accountability,

      - And incomplete or delayed data.

      We went off the highly controlled scenario (arbitrarily bounded technical implementation) that could be considered complicated (if we want to be reductionist, as the OP has done), and now we’re navigating uncertainty and N amount of issues that can go wrong.

  • If you consider their history of killing well loved products and foisting unwarranted products such as Google Plus onto customers, Google is for lack of a better word just plain stupid. Google is like a person with an IQ of 200 but would get run over by oncoming traffic because they have zero common sense.

  • I've not seen "accidental" complexity used to mean "domain" (or "environmental" or "inherent") complexity before. It usually means "the complexity you created for yourself and isn't fundamental to the problem you're solving"

  • Also, anything you do with enterprise (cloud) customers. People like to talk about scale a lot and data people tend to think about individual (distributed) systems that can go webscale. A single system with many users is still a single system. In enterprise you have two additional types of scale:

    1) scale of application variety (10k different apps with different needs and history)

    2) scale of human capability (ingenuity), this scale starts from sub-zero and can go pretty high (but not guaranteed)

  • Im a HW engineer and don't really understand "complexity" as far as this article describes it. I didn't read it in depth but it doesn't really give any good examples with specifics. Can someone give a detailed example of what the author is really talking about?

> My immediate reaction in my head was: "This is impossible". But then, a teammate said: "But we're Google, we should be able to manage it!".

Google, where the impossible stuff is reduced to merely hard, and the easy stuff is raised to hard.

  • Or "How many MDB groups do I need to get approved to join over multiple days/weeks, before I can do the 30 second thing I need to do?"

    Do not miss

  • “the difficult we do immediately. The impossible takes a little longer” WW2 US army engineer corp

    • >“the difficult we do immediately. The impossible takes a little longer”

      This was posted in my front office when I started my company over 30 years ago.

      It was a no-brainer, same thing I was doing for my employer beforehand. Experimentation.

      By the author's distinction in the terminology, if you consider the complexity relative to the complications in something like Google technology, it is on a different scale compared to the absolute chaos relative to the mere remaining complexity when you apply it to natural science.

      I learned how to do what I do directly from people who did it in World War II.

      And that was when I was over 40 years younger, plus I'm not done yet. Still carrying the baton in the industrial environment where the institutions have a pseudo-military style hierarchy and bureaucracy. Which I'm very comfortable working around ;)

      Well, the army is a massive mainstream corp.

      There are always some things that corps don't handle very well, but generals don't always care, if they have overwhelming force to apply, lots of different kinds of objectives can be overcome.

      Teamwork, planning, military-style discipline & chain-of-command/org-chart, strength in numbers, all elements which are hallmarks of effective armies over the centuries.

      The engineers are an elite team among them. Traditionally like the technology arm, engaged to leverage the massive resources even more effectively.

      The bigger the objective, the stronger these elements will be brought to bear.

      Even in an unopposed maneuver, steam-rolling all easily recognized obstacles more and more effectively as they up the ante, at the same time bigger and bigger unscoped problems accumulate which are exactly the kind that can not be solved with teamwork and planning (since these are often completely forbidden). When there must be extreme individual ability far beyond that, and it must emanate from the top decision-maker or have "equivalent" access to the top individual decision-maker. IOW might as well not even be "in" the org chart since it's just a few individuals directly attached to the top square, nobody's working for further promotions or recognition beyond that point.

      When military discipline in practice is simply not enough discipline, and not exactly the kind that's needed by a long shot.

      That's why even in the military there are a few Navy Seals here and there, because sometimes there are serious problems that are the kind of impossible that a whole army cannot solve ;)

> My immediate reaction in my head was: "This is impossible". But then, a teammate said: "But we're Google, we should be able to manage it!".

"We can do it!" confidence can be mostly great. (Though you might have to allow for the possibility of failure.)

What I don't have a perfect rule for is how to avoid that twisting into arrogance and exceptionalism.

Like, "My theory is correct, so I can falsify this experiment."

Or "I have so much career potential, it's to everyone's advantage for me to cheat to advance."

Or "Of course we'll do the right thing with grabbing this unchecked power, since we're morally superior."

Or "We're better than those other people, and they should be exterminated."

Maybe part of the solution is to respect the power of will, effort, perseverance, processes, etc., but to be concerned when people don't also respect the power and truth of humility, and start thinking of individual/group selves as innately superior?

  • Sorry to say that, but this sounds a bit like a fantasy. I think the vast majority of Google employees don't see themselves as particularly brillant or special. Even there, lots of people have imposter syndrome.

    Actually, I've found this is a constant in life, whatever you achieve, you end up in a situation where you're pretty average among your peers. You may feel proud to get into Google for a few months, and then you're quickly humbled down.

There is a certain amount of irony when the cookie policy agreement is buggy on a story about complicated & complex systems.

Clicking on "Only Necessary" causes the cookie policy agreement to reappear.

I think there are two myths applicable here. Probably more.

One myth is that complex systems are inherently bad. Armed forces are incredibly complex. That's why it can take 10 or more rear echelon staff to support one fighting soldier. Supply chain logistics and materiel is complex. Middle ages wars stopped when gunpowder supplies ran out.

Another myth is that simple systems are always better and remain simple. They can be, yes. After all, DNA exists. But some beautiful things demand complexity built up from simple things. We still don't entirely understand how DNA and environment combine. Much is hidden in this simple system.

I do believe one programming language might be a rational simplification. If you exclude all the DSL which people implement to tune it.

  • > Middle ages wars stopped when gunpowder supplies ran out.

    The arquebus is the first mass gunpowder weapon, and doesn't see large scale use until around the 1480s at the very, very tail end of the Middle Ages (the exact end date people use varies based on topic and region, but 1500 is a good, round date for the end).

    In Medieval armies, your limiting factor is generally that food is being provided by ransacking the local area for food and that a decent portion of your army is made up of farmers who need to be back home in the harvest season. A highly competent army might be able to procure food without acting as a plague on all the local farmlands, but most Medieval states lacked sufficient state capacity to manage that (in Europe, essentially only the Byzantines could do that).

  • Following the definition from the article, armed forces seems like a complicated system, not a complex one. There is a structured, repeatable solution for armed forces. It does not exhibit the hallmark characteristics of complex systems listed in the article like emergent behaviors.

    • not a fan of the article for this reason alone. good points made, but no reason to redefine perfectly good words when we already have words that work fine.

  • Agreed. The problem is not complexity. Every system must process a certain amount of information. And the systems complexity must be able to match that amount. The fundamental problem is about designing systems that can manage complexity, especially runaway complexity.

  • > Middle ages wars stopped when gunpowder supplies ran out

    Ukraine would be conquered by russia rather quickly if russians weren't so hilariously incompetent in these complex tasks, and war logistics being the king of them. Remember that 64km queue of heavy machinery [1] just sitting still? This was 2022, and we talk about fuel and food, the basics of logistics support.

    [1] https://en.wikipedia.org/wiki/Russian_Kyiv_convoy

I think the definitions of complex/complicated get muddled with the question of whether something is truly a closed system. Often times something is defined as "complex" when all they mean is that their model doesn't incorporate the externalities. But I don't know if I've come across a description of a truly closed system that has "emergent behavior". I don't know if LLMs qualify.

Mostly overlapping definition of what a 'complex system' is with :

https://en.wikipedia.org/wiki/Complex_system

although I understood the key part of a system being complex (as opposed to complicated) is having a large number of types of interaction. So a system with a large number of parts is not enough, those parts have to interact in a number of different ways for the system to exhibit emergent effects.

Something like that. I remember reading a lot of books about this kind of thing a while ago :)

There are typos and rough grammar in the first few paragraphs and I am actually very happy about that because I know I'm not reading LLM slop.

  "This is one possible characteristic of complex systems: they behave in ways that can hardly be predicted just by looking at their parts, making them harder to debug and manage."

To be honest this doesn't sound too different from many smaller and medium sized internetprojects i've worked on, because of the asynchronous nature of the web, with promises, timing issues and race conditions leading to weirdness that's pretty hard to debug because you have to "playback" with the cascading randomness of request timing, responses, encoding, browser/server shenanigans etc.

Except computers attempt to model mathematics in an ideal world.

Unless your problem comes from something side effects on a computer that can’t be modeled mathematically there is nothing technically stopping you from modeling the problem as mathematical problem then solving that problem via mathematics.

Like the output of the LLM can’t be modeled. We literally do not understand it. Are the problems faced by the SRE exactly the same? You give a system an input of B and you can’t predict the output of A mathematically? It doesn’t even have to be a single equation. A simulation can do it.

  • I think the vast majority of SRE problems are in the “side effects” category. But higher level than the hardware-level side effects of the computer that you might be imagining.

    The core problem is building a high enough fidelity model to simulate enough of the real world to make the simulation actually useful. As soon as you have some system feedback loops, the complexity of building a useful model skyrockets.

    Even in “pure” functions, the supporting infrastructure can be hard to simulate and critical in affecting the outputs.

    Even doing something simple like adding two numbers requires an unimaginable amount of hidden complexity under the hood. It is almost impossible for these things to not have second-order effects and emergent behaviour under enough scale.

This is all exacerbated by a ton of the ML stack being in Python, for some god Forsaken reason.

  • How is the choice of language the cause of anything complex/complicated?

    Both python and rust (for instance) are both turing complete, and equally capable

Let's add a post scriptum:

Whatever you're working on, your project is not likely to be at Google's scale and very unlikely to be a "complex system".

  • Let's add a post post scriptum :)

    Just because your project might not be at Google's scale doesn't mean it is therefore also not complex [^1]

    Example: I'd say plenty of games fit the author's definition of "complex systems". Even the well-engineered ones (and even some which could fit on a floppy disc)

    [1]: https://en.m.wikipedia.org/wiki/Affirming_the_consequent

  • IMO even a more interesting observation is that even Google itself doesn't necessarily work on large scale, e.g. many regionalised services in Google Cloud don't have _that_ many requests in each region, allowing for a much simpler architecture compared to behemoths like GMail or Maps

  • Complex is orthogonal to Large. Some small to medium scale systems address an incredibly complex problem space. Some large systems are solving relatively simple problems. Of course I do agree that size introduces it's own complexity.

  • IMO what we term "complex" tends to be that which the current setup/system struggles to deal with or manage. Relatively speaking google has much much higher complexity, but it doesnt matter as much, because even in simpler cases we are dealing with huge amount of variety and possible states, and the principles of managing that remain the same regardless of scale.

  • For small scale one can build a simple system but I see many are trying to copy FAANG architecture anyway. IMHO it’s a fallacy - people think that if they’ll would copy architecture used by google their company will be successful like google. I think it other was around - google has to build complex systems because it has many users.

    • Yes, it's called "cargo cult" and it applies to a lot of architecture and processes decisions in IT :)

    • It’s an infectious disease among developers. Some people would spend weeks making a simple landing page, and it would require at least 3 different cloud services.