The TTD Trap: Why Your Dashboard is Lying to You (and Your Engineers are Tired)

A few weeks ago, I wrote a piece about the obsession with measurement. While that text might have drifted into the slightly abstract, life—in its infinite wisdom—decided to provide me with a very concrete sequel.

The inspiration came from a casual office chat about incident detection. I work in a large organization where we count engineers in the hundreds, which means complexity isn’t just a challenge; it’s the default state of existence. But let’s not dive into that abyss today.

Instead, let’s talk about everyone’s favorite acronym soup: DORA metrics.

Specifically, let’s look at MTTR (Mean Time to Recover). In the post-2023 world, some people insist on calling it FDRT (Failed Deployment Recovery Time) to feel more precise, but we’re not here to split hairs. MTTR is vital for understanding IT efficiency, but as a single number, it’s about as useful as a weather forecast that says “it might rain or it might not.” It’s influenced by too many factors to be actionable.

Naturally, we break it down into smaller, bite-sized pieces. One of those pieces is TTD: Time to Detect.

On paper, TTD is a simple, logical operational indicator: the time between the occurrence of a problem and its detection. But in reality? It’s a vanity metric that management worships because it’s easy to pull from a dashboard.

And I’m here to argue that we should stop obsessing over it.

The Myth of the “Occurrence”

First, for the sake of completeness, let’s address the elephant in the room: pinpointing exactly when a problem “starts” is often a fool’s errand.

We like to imagine there’s a cinematic moment—a “big bang”—where the system breaks. Sometimes there is. But more often, we’re dealing with “slow burns”: memory leaks that creep up like a mid-life crisis, or silent failures where the code isn’t throwing exceptions but is quietly massacring your database records.

We could spend weeks debating when the problem actually “began.” Usually, we just give up and point to the moment a faulty commit hit production—assuming, of course, the problem is code and not a dying hard drive (which, believe it or not, still happens in the cloud era).

But that’s a minor headache compared to the cult of “Detection.”

The Seismograph in the Techno Club

As an industry, we are obsessed with “optimizing” the moment of detection. Management loves to stroke their egos over a low TTD.

“It took too long to find this bug? Add an alert!”
“A human reported this instead of a bot? Add an alert!”

Enter our old friend, Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure. If the goal is to keep TTD low, we flood the system with alarms. We don’t care about false positives because, hey, as long as we catch the “real” one in seconds, the report looks beautiful.

The result? Your alerting system starts looking like a NYSE stock ticker. It’s like placing a high-sensitivity seismograph in the middle of a techno club. It will certainly catch every tremor in the ground… but good luck distinguishing a tectonic shift from the bass drop of whatever the kids are calling “music” these days. Everyone eventually stops looking at the screen.

The Fatal Silence of Alarm Fatigue

This leads us to Alarm Fatigue. Our attention is a finite resource. More alarms don’t make us more attentive; they make us more deaf.

The most haunting example of this isn’t in IT, but in medicine—a field I covered in Fuckup Almanac vol. 2. In many ICUs, the logic follows a deadly loop: More alarms → less attention → delayed response → increased sensitivity → even more alarms.

There was a documented case at Massachusetts General Hospital where a patient’s heart rate slowed down over a period of 20 minutes. The monitoring system wailed the entire time. There were ten nurses on duty. No one reacted. The 89-year-old patient’s heart didn’t wait for the dashboard to be cleaned up.

In IT, the stakes aren’t usually life and death, but the psychological mechanism is identical. We learn to ignore the noise to stay sane. I once saw a post-mortem that contained the sentence: “The alarm had been firing for 20 months but was not considered significant.”

If an alarm fires for 20 months and nobody cares, it’s not an alarm—it’s background noise.

A Modest Proposal: Measure the Reaction

I don’t often offer solutions—I prefer complaining—but today I’ll make an exception. Let’s stop talking about TTD (Time to Detect) and start talking about TTR: Time to React.

This might seem like a semantic tweak, but it’s a cultural revolution.

TTD is a “Tooling” metric. You can “fix” it by throwing money at monitoring systems until your dashboard looks like an Oprah Winfrey special: “You get an alert! You get an alert! EVERYONE GETS AN ALERT!”

TTR is a “Cultural” metric. Improving it requires changing how we work, setting priorities, and—worst of all—admitting that humans have limited bandwidth.

A high TTR often means the team doesn’t trust the monitoring. They know that when the system screams, it’s probably just “crying wolf” again. To lower TTR, you might actually need to increase TTD.

Hear me out: if the system only generated high-value, 100% accurate alerts, the TTD might be slightly higher (because the system took a moment to verify the failure), but the TTR would drop to near zero. Why? Because the engineer would know: “If that phone rings, something is actually on fire.”

Conclusion

I’d rather know about a crash after 5 minutes and fix it in 2, than know in 1 second and ignore it for an hour because I was busy deleting 400 “information” alerts.

Which system would you prefer? The one that wakes you up 10 times a night with a false alarm at 2:00:01 AM (TTD: 1s)? Or the one that wakes you up once at 2:05:00 AM because the building is actually burning down?

Consider subscribing to the newsletter for more insights