The Entropy of Complex Systems
Part 1 of Engineering Judgment
Complex systems do not usually fail because one engineer made one mistake.
They fail because the number of possible states has grown faster than the organization's ability to understand, test, and control them.
This is the part of engineering we tend to underestimate. A new feature is easy to describe. A new branch in the state space is harder to see. A compatibility path, a retry rule, a migration exception, a temporary fallback that becomes permanent - each one looks small. Together they create a system whose behavior no single person can fully hold in their head.
That is what I mean by entropy in software.
Not disorder as a vague metaphor. Not code that looks messy. Software entropy is the growth of possible system behaviors, especially the behaviors that emerge from combinations nobody explicitly designed.
The dangerous thing is that entropy often arrives disguised as progress.
Functionality Grows Additively. State Space Does Not.
Product roadmaps usually count features one by one.
Engineering systems do not experience them that way.
If a system has several modules, and each module can be in multiple states, the total number of possible system states grows through combinations. A small feature can add new states to one module, but it can also add interactions with every other module that depends on it.
That is why a change that looks local in the code review can become global in production.
The feature itself may be small. The new behavior surface is not.
This distinction matters because most planning discussions reason about implementation size. How many files change? How many APIs are touched? How many weeks will it take?
Those questions are useful, but incomplete. A more important question is:
How much new state space does this introduce?
If a feature adds one clean capability inside a well-isolated boundary, the cost may be reasonable. If it creates new interactions across the core execution path, persistence layer, recovery path, and operational tooling, the feature is not small, even if the diff is.
Many production incidents begin with this mismatch. The implementation looked bounded. The behavior was not.
Accidents Are Often Normal
Engineers naturally want to find the bug.
The bug matters. But in complex, tightly coupled systems, the bug is often the final expression of a deeper condition: the system has accumulated enough hidden interactions that some combination will eventually escape the design.
Charles Perrow's Normal Accident Theory is uncomfortable because it removes the comfort of blaming only local mistakes. In certain systems, accidents are not exceptional. They are normal outcomes of complexity and tight coupling.
This does not mean engineering quality does not matter. It means quality has to include the control of complexity itself.
A blockchain network halt, a distributed database split-brain, a cascading service outage, a failed migration rollback - these are usually explained afterward as specific bugs. That explanation is true, but too narrow.
The deeper pattern is often the same:
- a local change touched a global assumption
- a fallback path behaved differently from the primary path
- an old compatibility rule interacted with a new feature
- recovery logic was less tested than forward execution
- the team optimized the happy path and underestimated the state space around failure
The incident is visible. The entropy was already there.
Coupling Turns Small Failures Into System Failures
Complexity is not just the number of parts. It is the way parts depend on each other.
Early systems are usually understandable because their boundaries are still sharp. There is one team, one architecture, one deployment path, one mental model. When something breaks, the search space is small.
Then the system succeeds.
More teams arrive. More features arrive. More integrations arrive. The original boundaries become historical suggestions. A module that used to have one job now has three responsibilities and several exception paths. A migration that was supposed to remove old behavior keeps both paths alive because one customer still depends on the old one. A "temporary" compatibility layer becomes part of the platform.
Eventually, local reasoning stops working.
A change in one place modifies timing in another. A new cache changes consistency assumptions. A retry rule multiplies load during partial failure. A schema evolution breaks an operational script nobody remembered.
This is where systems become dangerous: not when they are obviously chaotic, but when they still look modular while the real coupling has moved underneath the abstractions.
Abstractions Hide Complexity Until They Leak
Abstractions are necessary. Without them, no large system can be built.
But abstractions do not delete complexity. They move it.
Good abstractions move complexity to a place where it can be controlled. Bad abstractions hide complexity until the moment when control matters most.
This is especially visible in infrastructure systems. A clean API may hide persistence semantics. A simple transaction interface may hide scheduler behavior. A runtime abstraction may hide memory, state access, or recovery constraints. Most of the time, the abstraction holds. Then a rare production condition appears and the hidden layer becomes the only layer that matters.
At that point, engineers have to reason through multiple levels at once:
- the public interface
- the implementation details
- the historical exceptions
- the operational environment
- the failure mode currently unfolding
The problem is not that abstractions are bad. The problem is forgetting that every abstraction has a maintenance cost, and that cost grows when the system evolves.
The Largest Source of Entropy Is Often Organizational
Technical entropy and organizational entropy reinforce each other.
A system may begin with one small team and one clear objective. Years later, it has several teams, overlapping ownership, multiple roadmaps, and layers of compatibility promises. The codebase now reflects not just technical decisions, but organizational history.
This is why some systems become hard to simplify even when everyone agrees they are too complex.
Removing a feature may require negotiating with another team. Deleting an old path may require changing support commitments. Simplifying an architecture may require admitting that previous decisions no longer make sense. The technical work is real, but the organizational cost is often the blocker.
When ownership is unclear, entropy wins.
Nobody is directly responsible for removing old complexity. Everyone is incentivized to add the new requirement they need. The system grows because growth has owners. Simplification does not.
This is how organizations accidentally build systems that nobody would intentionally design.
Entropy Reduction Is Engineering Work
The natural direction of a successful system is toward more complexity.
Entropy reduction has to be deliberate.
The best engineering teams are not the ones that add the most features. They are the ones that preserve the ability to understand and change the system after years of growth.
That requires practices that often look unglamorous.
Complexity budgets help. Before adding a feature, ask what long-term complexity it introduces. Does it add a new state, a new mode, a new operational path, a new ownership boundary, a new compatibility promise? If yes, who is paying for that complexity later?
Rollback-first design helps. If a change cannot be safely rolled back, the team should treat that as a risk signal, not an implementation detail. Rollback is not just an operations feature. It is evidence that the system is still controllable.
Deletion helps more than teams expect. Most organizations are skilled at adding functionality. Very few are disciplined about removing obsolete modules, dead flags, old compatibility paths, and redundant abstractions. But deletion is one of the only direct ways to reduce state space.
Simplicity over feature parity helps. "Competitors have this" is not enough. The better question is whether the feature solves a critical problem and whether its long-term value exceeds the complexity it introduces.
The hardest engineering decision is often not what to build.
It is what to stop carrying.
The Real Competition
In the short term, systems compete on capability.
In the long term, they compete on their ability to remain understandable.
A system that accumulates features faster than it accumulates clarity will eventually slow down. Changes become risky. Incidents become harder to diagnose. Engineers become afraid to touch certain paths. The organization starts working around the system instead of improving it.
That is entropy becoming strategy.
The job of engineering leadership is not only to accelerate delivery. It is to keep the system within the range where delivery remains possible.
That means treating complexity as a first-class cost. It means protecting iteration speed. It means creating ownership for deletion. It means asking, again and again, whether the system is still understandable enough to evolve.
Doing less is not always discipline. Sometimes it is avoidance.
But in complex systems, choosing not to add something can be one of the highest forms of engineering responsibility.
Because a system survives not by accumulating every possible feature, but by accumulating complexity more slowly than its ability to control it.