Monitoring and Alerting: The Art of Knowing When Things Break

Jun 18, 2025

Monitoring and alerting are the nervous system of modern infrastructure. Done well, they provide early warning of issues and enable rapid response. Done poorly, they create alert fatigue, erode trust, and turn on-call rotations into a dreaded experience that burns out engineers.

After years of being woken up by meaningless alerts at 3 AM and watching teams struggle with unsustainable on-call practices, I've learned that effective alerting is more art than science. It requires understanding not just the technical aspects, but the human cost of every alert you send.

The Philosophy of Good Alerting

Alert on Symptoms, Not Causes

The cardinal rule of alerting: alert on user-impacting symptoms, not internal system states. Your users don't care if your database CPU is at 85%—they care if their requests are slow or failing.

Good alerts:

  • "API response time exceeds 500ms for 95th percentile"
  • "Error rate above 1% for the last 5 minutes"
  • "Service availability below 99.9% SLO"

Bad alerts:

  • "Database CPU above 80%"
  • "Disk space at 70%"
  • "Memory usage high"

The difference is profound. Symptom-based alerts tell you something is wrong from the user's perspective. Cause-based alerts might fire when everything is working fine, or worse, miss real problems entirely.

Every Alert Must Be Actionable

If you can't define a clear action someone should take when an alert fires, don't send the alert. Period.

Before creating any alert, ask:

  • What should the person receiving this alert do immediately?
  • What's the worst-case scenario if this alert is ignored for 30 minutes?
  • Is this truly urgent enough to wake someone up?

If you can't answer these questions clearly, you need either better runbooks or fewer alerts.

Alert Fatigue is a Reliability Risk

Alert fatigue isn't just an annoyance—it's a systemic reliability risk. When engineers are bombarded with false positives and low-priority alerts, they start ignoring all alerts, including the critical ones.

The psychology is simple: if 90% of your alerts turn out to be non-issues, people will assume the next alert is also a non-issue. This is how real outages get missed.

The Hierarchy of Alerting

Not all problems require the same response urgency. Effective alerting systems use multiple channels with different escalation policies:

Page-Worthy Alerts (Immediate Response Required)

  • Service completely down
  • Error rates above critical thresholds
  • Security breaches
  • Data corruption events

These should wake people up. They represent immediate user impact or business risk.

Ticket-Worthy Alerts (Response Within Hours)

  • Performance degradation below SLO
  • Capacity warnings (before they become critical)
  • Non-critical service failures
  • Backup failures

These go to ticketing systems or chat channels. They need attention but not at 3 AM.

Dashboard-Only Metrics (Informational)

  • Resource utilization trends
  • Business metrics
  • Performance baselines
  • Historical comparisons

These inform but don't alert. They're valuable for investigation and trend analysis.

Crafting Effective Alert Messages

A good alert message answers three questions immediately:

  1. What is happening?
  2. Where is it happening?
  3. What should I do about it?

Bad alert:

CRITICAL: High error rate detected

Good alert:

CRITICAL: API error rate 5.2% (threshold: 1%) on production cluster
Impact: User login failures
Runbook: https://wiki.company.com/api-errors
Dashboard: https://monitoring.company.com/api-health

Include direct links to runbooks, dashboards, and relevant documentation. The person responding to the alert shouldn't have to hunt for information.

On-Call Rotations: Building Sustainable Practices

On-call rotations are where alerting philosophy meets human reality. Poorly designed on-call practices can destroy team morale and create a culture of fear around production systems.

Rotation Structure and Fairness

Primary/Secondary Model:

  • Primary on-call handles all initial alerts
  • Secondary provides backup and escalation
  • Clear escalation criteria and timelines

Follow-the-Sun Model:

  • Different teams handle on-call based on time zones
  • Reduces after-hours burden
  • Requires good handoff processes

Rotation Length:

  • Too short (1-2 days): Constant context switching
  • Too long (2+ weeks): Burnout and resentment
  • Sweet spot: 1 week rotations for most teams

On-Call Compensation and Recognition

On-call work is additional responsibility that should be compensated appropriately:

  • Monetary compensation for being on-call (even if not paged)
  • Additional compensation for after-hours responses
  • Time off following major incidents
  • Recognition for handling difficult situations

Teams that don't compensate on-call work fairly will struggle with participation and morale.

Escalation Policies

Clear escalation paths prevent single points of failure:

  1. Primary on-call (5-minute response time)
  2. Secondary on-call (if primary doesn't respond in 15 minutes)
  3. Team lead/manager (if secondary doesn't respond in 15 minutes)
  4. Director/VP (for major outages)

Document these policies clearly and test them regularly. Nothing is worse than an escalation path that doesn't work during a real incident.

Common Alerting Anti-Patterns

The "Everything is Critical" Problem

When everything is marked critical, nothing is critical. Use severity levels sparingly and consistently:

  • Critical: Service down, immediate user impact
  • Warning: Degraded performance, potential future impact
  • Info: Informational, no action required

Threshold Tuning Hell

Constantly adjusting alert thresholds to reduce noise is a sign of deeper problems. Instead of tweaking thresholds:

  • Focus on user-impacting metrics
  • Use dynamic baselines instead of static thresholds
  • Implement smart alerting with machine learning

The "Check Everything" Mentality

More monitoring doesn't automatically mean better reliability. Focus on:

  • Golden signals (latency, traffic, errors, saturation)
  • Business-critical workflows end-to-end
  • Dependencies that can cause cascading failures

Alert Storms

When one failure triggers dozens of related alerts, you get an alert storm that overwhelms responders. Implement:

  • Alert correlation to group related issues
  • Dependency mapping to suppress downstream alerts
  • Circuit breakers to prevent cascade failures

Building Alert Runbooks

Every alert should have an associated runbook that provides:

Initial Assessment Steps

1. Check service dashboard for overall health
2. Verify alert is not a false positive
3. Assess user impact and scope
4. Determine if this requires immediate escalation

Investigation Procedures

1. Check recent deployments
2. Review error logs for patterns
3. Examine resource utilization
4. Test key user workflows

Resolution Steps

1. Immediate mitigation options
2. Rollback procedures
3. Emergency contacts
4. Communication templates

Post-Incident Actions

1. Verify resolution
2. Update stakeholders
3. Schedule postmortem
4. Document lessons learned

Monitoring Tools and Technology

The Modern Monitoring Stack

Metrics Collection:

  • Prometheus for infrastructure metrics
  • Application-specific metrics (custom instrumentation)
  • Business metrics from databases/analytics

Log Aggregation:

  • Centralized logging (ELK stack, Splunk, or similar)
  • Structured logging with consistent formats
  • Log correlation with trace IDs

Distributed Tracing:

  • Jaeger or Zipkin for microservices
  • Request flow visualization
  • Performance bottleneck identification

Alerting Platforms:

  • PagerDuty, VictorOps, or similar for escalation
  • Slack/Teams integration for team coordination
  • SMS/phone for critical alerts

Alert Delivery Channels

Different types of alerts need different delivery mechanisms:

  • SMS/Phone: Critical production issues only
  • Push notifications: High-priority alerts during business hours
  • Email: Medium-priority issues and summaries
  • Chat (Slack/Teams): Team coordination and non-urgent alerts
  • Ticketing systems: Issues requiring investigation but not immediate response

Measuring Alerting Effectiveness

Track metrics to continuously improve your alerting:

Alert Quality Metrics

  • True positive rate: Percentage of alerts that require action
  • Mean time to acknowledge: How quickly alerts are acknowledged
  • Mean time to resolution: How quickly issues are resolved
  • Alert volume trends: Are you getting better or worse over time?

On-Call Health Metrics

  • Pages per week: Track alert volume per person
  • After-hours pages: Measure impact on work-life balance
  • Escalation frequency: How often do alerts escalate?
  • Burnout indicators: Team satisfaction surveys, turnover rates

The Human Side of Alerting

Psychological Safety in On-Call

Create an environment where people feel safe to:

  • Escalate when uncertain
  • Make mistakes without blame
  • Ask for help during incidents
  • Suggest improvements to processes

Training and Preparation

  • Shadow rotations for new team members
  • Game days to practice incident response
  • Runbook reviews and updates
  • Tool training for monitoring systems

Continuous Improvement

  • Regular retrospectives on alerting effectiveness
  • Feedback loops from on-call engineers
  • Alert tuning based on real-world performance
  • Process refinement based on incident learnings

The Future of Monitoring and Alerting

AI-Powered Alerting

Machine learning is transforming alerting:

  • Anomaly detection for dynamic thresholds
  • Alert correlation to reduce noise
  • Predictive alerting to prevent issues
  • Natural language incident summaries

Observability-Driven Development

Modern development practices integrate observability from the start:

  • Metrics as code alongside application code
  • SLO-driven development with built-in monitoring
  • Chaos engineering to test alerting systems
  • Continuous profiling for performance insights

Key Takeaways

Effective monitoring and alerting is about balance:

  1. Alert on user impact, not system metrics
  2. Make every alert actionable with clear runbooks
  3. Design sustainable on-call practices that don't burn out your team
  4. Continuously improve based on real-world feedback
  5. Remember the human element—alerting affects people, not just systems

The goal isn't perfect monitoring—it's reliable systems operated by healthy, sustainable teams. Sometimes that means accepting some risk to preserve team well-being. Sometimes it means investing heavily in automation to reduce manual burden.

The best alerting systems are invisible when everything is working and invaluable when things go wrong. They enable teams to sleep peacefully while maintaining confidence that they'll know immediately when users are impacted.

Remember: every alert you send is an interruption to someone's day (or night). Make it count.