What we actually monitor in production

In production, “up” is not the same as usable.

In production, "up" is not the same as usable.

A lot of teams monitor infrastructure health. Far fewer monitor user pain.

After running hundreds of blockchain nodes, we learned that green dashboards can be deeply misleading.

CPU looks fine. Memory looks fine. Disk looks fine. And the product is already degrading.

The signals that matter most are different.

p95 latency, not the average. Because averages hide the users already having a bad experience.

Error rate by critical path, not globally. Because one broken endpoint matters more than a pretty overall number.

Lag and restart patterns, not just uptime. Because a service can be technically alive and still fail in practice.

That is the real monitoring trap: teams watch machines, while users feel systems.

The right monitoring question is not "is it up?" It is "is it still usable?"

In production, that difference matters more than most teams think.