Active Network Service Monitoring with Prometheus Blackbox Exporter: HTTP, TCP, DNS, and ICMP Without Agents

Networking tutorial - IT technology blog
Networking tutorial - IT technology blog

Six months ago, I got paged at 2 AM. A critical API endpoint had silently stopped responding — but our monitoring only checked whether the process was running, not whether it was actually reachable from outside. That incident pushed me toward proper black-box monitoring, and Prometheus Blackbox Exporter has been in production ever since.

Unlike agent-based monitoring that instruments applications from the inside, Blackbox Exporter checks your services the same way your users do — from the outside. No code changes. No per-service agent deployment. Point it at a URL, a TCP socket, a DNS resolver, or a host to ping, and it reports exactly what an external observer sees.

Quick Start: Up and Running in 5 Minutes

The fastest path is Docker:

docker run -d \
  --name blackbox_exporter \
  -p 9115:9115 \
  prom/blackbox-exporter:latest

You now have a working exporter at http://localhost:9115. Test it immediately:

# Probe google.com via HTTP
curl "http://localhost:9115/probe?target=https://google.com&module=http_2xx"

The response is raw Prometheus metrics — probe_success (1 = up, 0 = down), response time, TLS certificate expiry, and HTTP status code. Everything you need without touching application code.

For production, mount a config file:

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 204]
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s

  dns_check:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
docker run -d \
  --name blackbox_exporter \
  -p 9115:9115 \
  -v ./blackbox.yml:/config/blackbox.yml \
  prom/blackbox-exporter:latest \
  --config.file=/config/blackbox.yml

Deep Dive: Understanding Each Probe Type

Blackbox Exporter covers four probe types. Knowing when to reach for each one matters more than you’d think.

HTTP Probe

The HTTP probe is where I spend roughly 80% of my configuration time. It does a lot more than check whether port 80 is open — response code validation, time-to-first-byte measurement, TLS chain verification, and POST requests with body regex matching are all on the table.

modules:
  http_post_json:
    prober: http
    timeout: 10s
    http:
      method: POST
      headers:
        Content-Type: application/json
        Authorization: "Bearer your_token_here"
      body: '{"ping": "check"}'
      valid_status_codes: [200]
      fail_if_body_not_matches_regexp:
        - '"status":"ok"'

This catches something TCP-level checks will never surface: a service that returns HTTP 200 but serves an error page in the body. That scenario is more common than it sounds — misconfigured load balancers, half-broken deploys, stale health endpoints.

TLS certificate monitoring deserves special mention. The metric probe_ssl_earliest_cert_expiry gives a Unix timestamp for the nearest expiring certificate in the chain. One Prometheus alert rule covers it:

groups:
  - name: ssl_expiry
    rules:
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring in <30 days: {{ $labels.instance }}"

TCP Probe

TCP probes cover non-HTTP services — databases, SMTP, LDAP, custom binary protocols. Tracking both probe_success and probe_duration_seconds catches latency spikes even when the port is technically accepting connections.

modules:
  tcp_postgres:
    prober: tcp
    timeout: 5s
    tcp:
      preferred_ip_protocol: "ip4"

  tcp_smtp_starttls:
    prober: tcp
    timeout: 10s
    tcp:
      query_response:
        - expect: "^220 "
        - send: "EHLO prober\r\n"
        - expect: "^250-STARTTLS"

The query/response feature is genuinely useful — you can verify that your SMTP server speaks the correct protocol, not just that port 25 is accepting connections.

ICMP Probe

Classic ping, integrated into your existing Prometheus stack. I use ICMP probes for network infrastructure — routers, firewalls, VPN tunnel endpoints. On Linux, the binary needs raw socket permissions:

# For a direct binary install
sudo setcap cap_net_raw+ep blackbox_exporter

# For Docker, use --privileged (scope carefully)
docker run -d --privileged \
  -p 9115:9115 \
  prom/blackbox-exporter:latest

DNS Probe

DNS monitoring is what most teams skip until a CDN propagation issue or an accidentally deleted record causes an outage. The DNS probe checks that specific records resolve to what you expect:

modules:
  dns_a_record:
    prober: dns
    timeout: 5s
    dns:
      transport_protocol: "udp"
      preferred_ip_protocol: "ip4"
      query_name: "api.yourapp.com"
      query_type: "A"
      validate_answer_rrs:
        fail_if_not_matches_regexp:
          - "api.yourapp.com.\t.*\tIN\tA\t203.0.113."

This validates not just that DNS responds, but that it returns the correct IP range — catching misconfigurations before users encounter them.

Advanced Usage: Wiring Into Prometheus

Standalone, Blackbox Exporter is useful. Wired into Prometheus with relabeling, it gets powerful. The trick is passing the target URL as a query parameter to the exporter rather than as the scrape address:

# prometheus.yml
scrape_configs:
  - job_name: "blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://yourapp.com/health
          - https://api.yourapp.com/v1/status
          - https://yourapp.com/login
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115

  - job_name: "blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - postgres-primary:5432
          - redis-cluster:6379
          - kafka-broker:9092
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115

This relabeling block does two things: copies the target address into a query parameter, then rewrites the scrape address to point at the exporter. Each result carries the original target URL as its instance label — keeping your dashboards readable.

Useful PromQL Queries for Grafana

Community Grafana dashboard ID 7587 is a solid starting point. Here are the queries I always add manually:

# Uptime percentage over 24h
avg_over_time(probe_success{job="blackbox_http"}[24h]) * 100

# HTTP response time p95
histogram_quantile(0.95, rate(probe_http_duration_seconds_bucket[5m]))

# Days until SSL cert expiry
(probe_ssl_earliest_cert_expiry - time()) / 86400

Practical Tips From Six Months in Production

Running this in production for six months meant learning some things the documentation doesn’t spell out. Here’s what I’d tell myself at the start.

Set timeouts shorter than your scrape interval. If both are 30 seconds, a slow target blocks the entire scrape cycle. Keep probe timeouts at 60–70% of your scrape interval.

Monitor from multiple locations. One Blackbox Exporter instance tells you if a service appears down from that location. Two or three instances in different regions tell you whether it’s a routing issue, a regional outage, or a genuine service failure. Add a probe_location label to each scrape config — from there, one label selector in Grafana separates them cleanly.

Don’t probe too aggressively. Early on I used 10-second probe intervals across 50 endpoints — that’s 300 external HTTP requests per minute, from the monitoring layer alone. For most endpoints, 30–60 second intervals are enough.

Use the debug endpoint during setup. When a probe behaves unexpectedly, append &debug=true:

curl "http://localhost:9115/probe?target=https://yourapp.com&module=http_2xx&debug=true"

This returns a verbose trace of every step — DNS resolution time, TCP handshake duration, TLS negotiation, HTTP redirect chain. You’ll fix issues in minutes instead of hours.

Alert on probe duration, not just probe_success. A login page responding in 8 seconds is technically “up” but functionally broken. Alerting on probe_duration_seconds > 3 catches degraded performance before it becomes a reported outage.

Use two-stage SSL expiry alerts. Alert at 30 days and again at 7 days. A missed first alert won’t directly cause an outage, but a missed 7-day alert can.

Adding a new service takes under a minute: one target line in the Prometheus scrape config. It’s immediately monitored for availability, response time, and certificate validity. No agent installation. No application changes. No coordination with another team.

Share: