Modern IT landscapes do not break because people are careless.
They break because the systems they operate grow in complexity faster than teams can understand them.
Over the years, across insurance, finance, logistics and public-sector environments, the same patterns appear again and again. Different organizations, different architectures, different tools — and yet the pain is nearly identical.
This article outlines the most common pain points that shape the daily reality of operations teams. Not the superficial ones, but the structural forces that make modern infrastructure hard to understand, hard to predict, and sometimes hard to trust.
It is the foundation for the rest of this series.
1. Drift Everywhere Link to heading
Drift is not an exception. Drift is the baseline.
- Hotfixes applied under pressure
- Manual fixes during incidents
- Side effects from patches
- New defaults after an update
- Differing histories between servers
- Different privileges or data sets per environment
Once two machines diverge even slightly, they continue drifting apart.
This happens silently and continuously, and most teams only notice it when a deployment behaves differently “for no obvious reason”.
Drift is not a bug.
It is the natural evolution of long-lived systems.
2. Defaults Nobody Sees Link to heading
Defaults sound harmless, but in real environments they shape more behavior than explicit configuration.
A default:
- may be different per OS version
- may change with a patch
- may come from a library
- may be inherited from a global config
- may only apply if a value is missing
- may be deprecated but still active
- may be undocumented entirely
Defaults are a hidden configuration layer — and often the most influential one.
Ignoring defaults means ignoring part of the runtime architecture.
3. Configuration Layering That Nobody Fully Tracks Link to heading
Most people think configuration lives in the file they edit.
But in reality, configuration is distributed across layers:
- local configs
- global configs
- inherited configs
- machine settings
- environment variables
- fallback logic
- secrets stores
- registry
- application-level runtime state
Each layer merges into the next.
Precedence rules are often undocumented or only understood by one senior engineer.
And the effective configuration — the one that actually runs — may not appear in any file.
This layering is powerful, but also where most surprises come from.
4. Documentation Lag (CMDB, Wikis, Tickets) Link to heading
Documentation is always written in past tense — it reflects what someone believed was true when they wrote it.
By the time it is consumed:
- new changes were applied
- a shortcut was taken
- a workaround stayed in place
- a hotfix was forgotten
- a version bumped a default
- a dependency changed its behavior
But the drift is not only technical.
Organizational factors accelerate the problem:
- responsibility changes
- team boundaries shift
- ownership becomes unclear
- documentation tasks lose priority
- processes grow faster than people can maintain them
Even when documentation exists, it rarely reflects reality.
This raises a deeper question: what is the real intent of documentation?
In theory, documentation should describe the system as it is.
In practice, most documentation describes the system as people believed it was, intended it to be, or want it to be.
The longer the distance between documentation and runtime behaviour, the more documentation becomes opinion rather than fact.
5. Too Many Tools, Not Enough Understanding Link to heading
Most organizations react to complexity with tools:
- more monitoring
- more dashboards
- more logs
- more alerts
- more pipelines
- more scanners
And yet the problem remains:
Tools observe symptoms, not causes.
A dashboard shows that the application slowed down.
It does not show that a default timeout changed.
Or that one machine drifted.
Or that a config layer overrode another.
Or that an implicit fallback kicked in after a dependency updated.
You cannot visualize what you cannot see.
6. The Human Factor: Tribal Knowledge Link to heading
Companies rely heavily on unwritten knowledge:
- “This server is special because…”
- “We never upgrade that one component…”
- “Production uses a different default…”
- “That setting must never be touched…”
This knowledge is often:
- undocumented
- incomplete
- outdated
- lost when key people leave
- contradicted by reality
Operational understanding is fragmented, and very few people can explain the system end to end.
Not because they lack skill — but because the system has grown beyond individual comprehension.
7. Slow Feedback Loops Link to heading
Many problems in infrastructure last longer than they should because feedback loops are slow:
- A change is applied
- Nobody knows the immediate effect
- The real effect becomes visible weeks later
- Symptoms appear in a different area
- Teams assume the cause is elsewhere
Complex systems don’t break directly.
They break through delayed interactions.
A weak feedback loop allows small errors to accumulate silently.
8. Regulatory Pressure and Process Saturation Link to heading
In regulated environments, processes accumulate over time:
- mandatory approvals
- mandatory documentation
- mandatory change tickets
- mandatory checks
Every item is reasonable in isolation.
But together they create:
- process bottlenecks
- reduced speed
- reduced visibility
- unmaintained documentation
- people working “around” the process
Complexity moves faster than processes can adapt.
9. The Illusion of Stable Systems Link to heading
Teams often assume that systems remain in the state they were installed or deployed in.
This is never fully true.
Long-lived systems accumulate:
- decisions
- defaults
- drift
- patches
- dependencies
- exceptions
- workarounds
However — and this is important — the degree of drift strongly correlates with the degree of automation.
The more a system relies on:
- repeatable pipelines
- declarative configuration
- consistent provisioning
- automated reconciliation
- automated rollout and rollback
- enforced defaults
- standardized base images
…the less room there is for unintentional divergence between environments.
Automation does not eliminate drift.
But it constrains it.
Manual processes, emergency fixes and local variations introduce entropy.
Automation reduces the surface area where entropy can enter.
Fully automated systems drift slowly.
Semi-automated systems drift continuously.
Manually operated systems drift immediately.
Understanding this relationship is key:
Drift is not a mystery — it is the natural side effect of how changes enter a system.
Closing Link to heading
The pain landscape of modern infrastructure is not caused by individual mistakes.
It is caused by the natural behavior of complex systems: drift, defaults, layering, undocumented interactions, and constantly shifting realities.
This complexity cannot be eliminated.
But it can be understood — if we look at the underlying mechanics instead of just the symptoms.
This series continues by exploring those mechanics one by one, starting with how systems behave differently than intended.
Follow-Up Questions Link to heading
These questions emerged during writing and will be addressed in later articles:
- How does the operating model influence drift? For example: do Windows Server Core systems drift less than GUI servers because GUI comfort encourages manual changes?
- How can effective configuration be extracted reliably across all layers (file, registry, default, runtime)?
- How can we measure drift objectively and compare environments?
- What telemetry sources reflect real system behaviour?
- How does automation change the geometry of drift over time?
Get in touch
Email me: starttalking@sh-soft.de