Mastering Prometheus Alertmanager: Routing, Grouping, Inhibition, and Smart Alert Delivery to Telegram and PagerDuty

Table of Contents

The Alert Noise Problem Nobody Warns You About

You set up Prometheus, wire up a few rules, and everything looks great — until a single network blip fires 200 alerts at 3 AM. Your phone explodes. You silence everything just to get back to sleep. The next morning, a real incident was buried in the noise and you missed it.

This is the most common Alertmanager failure pattern in production. The monitoring stack is technically correct, but alerting gets configured as an afterthought. A bare alertmanager.yml with a single receiver and no routing logic destroys your on-call rotation within a week.

The fix is not fewer alerts — it’s smarter delivery. Alertmanager gives you three core tools: routing, grouping, and inhibition. Together, they transform raw alert volume into actionable, context-rich notifications that wake the right person for the right reason.

Approach Comparison: Naive vs. Structured Alertmanager

Before touching config, it helps to see what you’re actually choosing between.

Naive Setup — Single Receiver, No Logic

route:
  receiver: 'slack-default'
receivers:
  - name: 'slack-default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/...'

Routes every alert to one channel. Takes five minutes to set up. Falls apart the first time you have a busy week.

Structured Setup — Routing Tree with Grouped Delivery

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        team: platform
      receiver: 'telegram-platform'
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'cluster']

Alerts are grouped, deduplicated, routed by label, and warning noise is suppressed when a critical already fired. This is what production looks like.

Pros and Cons of Each Approach

Naive — Pro: zero config overhead. Con: alert storms, no priority separation, on-call fatigue within days.
Structured — Pro: noise reduction, team-based routing, clear escalation paths. Con: requires upfront planning and label discipline across your alert rules.

Bad routing doesn’t just cause annoyance — it erodes trust. Once engineers learn that alerts are noisy, they stop checking them. At that point your monitoring stack is theater, not safety.

Recommended Setup: Labels-First Design

Alertmanager routes on labels. Those labels come from your Prometheus alerting rules — so get them right first, before you touch Alertmanager config at all.

# In your Prometheus rules file
groups:
  - name: app.rules
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
          service: api
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

With severity, team, and service consistently set on every rule, your routing tree becomes predictable and maintainable. Skip these labels and routing becomes guesswork.

Implementation Guide

Step 1 — Install and Start Alertmanager

# Download
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xvf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64

# Start with config
./alertmanager --config.file=alertmanager.yml --storage.path=/var/lib/alertmanager

Then tell Prometheus where to find it by adding this to prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Step 2 — Configure Routing

A routing tree evaluates routes top-to-bottom and stops at the first match. Set continue: true on a route to fan out to multiple receivers instead. Keep the most specific matchers at the top.

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

  routes:
    # Critical alerts go to PagerDuty for immediate escalation
    - match:
        severity: critical
      receiver: 'pagerduty'

    # Platform team gets Telegram notifications
    - match:
        team: platform
      receiver: 'telegram-platform'

    # Database alerts stay separate
    - match_re:
        service: 'postgres|mysql|redis'
      receiver: 'telegram-db'

Step 3 — Set Up Grouping

Grouping is what stops the 3 AM alert storm. When a node goes down and 50 alerts fire simultaneously, Alertmanager batches them into one notification instead of 50 separate pages.

Three settings control this behavior:

group_wait: 30s — How long to buffer before sending the first notification for a new group. Gives related alerts time to arrive together. The 30s default works well in practice.
group_interval: 5m — How long to wait before sending a follow-up when new alerts join an existing group. Five minutes is a reasonable starting point for most teams.
repeat_interval: 12h — How often to re-notify about an already-firing alert. Use 1h for critical, 4–12h for warnings. A 1-minute repeat on a warning fires once a minute until someone fixes it — avoid this.

Step 4 — Configure Inhibition Rules

Inhibition silences lower-priority alerts when a higher-severity alert for the same system is already firing. It’s the biggest noise-reduction lever Alertmanager has.

inhibit_rules:
  # Suppress warnings when critical is active for the same service
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

  # Suppress all app alerts when the entire node is down
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']

That second rule is the one people always wish they’d added earlier. When NodeDown fires for a host, every alert originating from that host — disk full, high CPU, service down — gets suppressed automatically. Fix the node, everything clears. No manual silencing required.

Step 5 — Integrate with Telegram

Start by creating a bot via @BotFather in Telegram and grabbing the token. Then find your chat ID: send any message to the bot, hit https://api.telegram.org/bot<TOKEN>/getUpdates, and look for the chat.id field in the JSON response. Groups have negative IDs.

receivers:
  - name: 'telegram-platform'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: -1001234567890
        message: |
          {{ range .Alerts }}
          *[{{ .Status | toUpper }}]* {{ .Labels.alertname }}
          Service: {{ .Labels.service }}
          {{ .Annotations.summary }}
          {{ end }}
        parse_mode: 'Markdown'

The message field uses Go templating. Keep it concise — Telegram truncates messages over 4096 characters, and dense alert dumps are unreadable on mobile anyway. One line per alert with severity and service name gives enough context to act on.

Step 6 — Integrate with PagerDuty

PagerDuty handles critical on-call escalation. In PagerDuty, go to your service → Integrations tab → Add an integration → select Prometheus. Copy the routing key you get.

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .GroupLabels.alertname }} — {{ .GroupLabels.cluster }}'
        severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'

PagerDuty deduplicates using the dedup_key derived from your group labels. A firing alert and its resolved state link automatically — no orphaned incidents sitting open after the issue clears.

Step 7 — Verify with amtool

Never push a new alertmanager.yml without validating it first. The bundled amtool catches routing mistakes before they reach production.

# Validate config syntax
amtool check-config alertmanager.yml

# Test routing — see which receiver an alert would hit
amtool config routes test --config.file=alertmanager.yml \
  severity=critical team=backend service=api

# Fire a test alert manually
amtool alert add alertname=TestAlert severity=critical service=api \
  --annotation=summary="Test alert from amtool" \
  --alertmanager.url=http://localhost:9093

Common Mistakes to Avoid

Skipping group_by: All alerts collapse into one massive notification. Always group by at least alertname and one topology label like cluster or instance.
Setting repeat_interval too low: One minute on a warning alert means one page per minute until someone fixes it. 4–12 hours is the sane range for non-critical alerts.
Forgetting equal in inhibit rules: Without it, a critical on one host silences warnings on every other host in the cluster. Always scope inhibitions with matching labels.
One receiver for everything: Database on-call and application on-call are different people with different context. Route them separately from day one.

The Full alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        team: platform
      receiver: 'telegram-platform'
    - match_re:
        service: 'postgres|mysql|redis'
      receiver: 'telegram-db'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']

receivers:
  - name: 'default-receiver'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: -1001234567890
        message: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'

  - name: 'telegram-platform'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: -1009876543210
        message: |
          {{ range .Alerts }}
          *{{ .Status | toUpper }}* — {{ .Labels.alertname }}
          {{ .Annotations.summary }}
          {{ end }}
        parse_mode: 'Markdown'

  - name: 'telegram-db'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: -1001111111111
        message: '[DB] {{ .GroupLabels.alertname }} on {{ .GroupLabels.instance }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .GroupLabels.alertname }} — {{ .GroupLabels.cluster }}'

The real test is your first incident after this goes live. Instead of 200 individual notifications arriving in 90 seconds, you get one grouped message: 14 alerts firing, cluster=prod, severity=critical. The right person sees it. Everyone else sleeps through it.