When OpenClaw runs as a 24/7 gateway on rented remote Mac capacity, “it worked yesterday” is not an observability strategy. This HowTo gives a minimal reproducible path: prove the gateway is alive with documented health URLs, pull or derive time series in a Prometheus-compatible way (and optionally forward the same signals through an OpenTelemetry Collector), set threshold alerts, and tie alarms to log lines your team can action. Assume you already completed base install and bind modes from our Docker deploy and hardening guide; for exposure patterns and credential rotation, cross-check Tailscale, Funnel, and token rotation and skill sandbox egress allowlists.
Mental model: pull metrics, push traces, same incidents
Prometheus traditionally scrapes HTTP endpoints on an interval. OpenTelemetry often exports from apps or collectors to OTLP backends—but you can bridge: run an OTel Collector with a Prometheus exporter receiver scraping your gateway (or a sidecar), then ship to your vendor. For the smallest blast radius on a rental Mac, start with three signals: (1) synthetic availability via /healthz and /readyz, (2) host-level CPU, memory, and disk from node_exporter or your provider’s agent, (3) structured gateway logs for causality when a probe turns red.
If upstream OpenClaw later publishes a first-class /metrics page or native OTLP, keep the same job labels (env, instance, service_name=openclaw-gateway) so dashboards and alerts do not churn.
Step 1 — Gateway health endpoints (reproducible curls)
The gateway listener (commonly 18789) exposes lightweight probes. Verify from the rental host shell first—no network path ambiguity:
# Liveness: process up and HTTP stack responding
curl -fsS http://127.0.0.1:18789/healthz
# Readiness: dependencies satisfied enough to accept work (semantics per release)
curl -fsS http://127.0.0.1:18789/readyz
From your laptop, mirror the same checks through an SSH tunnel (recommended when the gateway binds to loopback):
ssh -N -L 18789:127.0.0.1:18789 user@rental-host
curl -fsS http://127.0.0.1:18789/healthz
Record the expected body and status code in your runbook so on-call knows “green.” If readyz fails while healthz passes, treat it as a degraded state: the process is up but not safe for new sessions—correlate with disk full, stuck subprocess, or upstream API outages visible in logs.
Step 2 — Metric scrape interval (Prometheus job sketch)
Point Prometheus (or the OTel Collector’s prometheus receiver) at the same URLs. Two common patterns:
- Native scrape — If the gateway serves Prometheus text format on a path such as /metrics, scrape that target directly.
- Blackbox-style probe — Use the blackbox_exporter module http_2xx against /healthz and /readyz when you only have HTTP probes.
Interval guidelines: 15s for interactive control planes where minutes of blind spot matter; 30s as a balanced default; 60s when the Mac is oversubscribed or the scrape crosses a high-latency tailnet path. Set Prometheus global.scrape_interval and per-job scrape_interval consistently—your evaluation_interval should be less than or equal to the finest alert window you intend to enforce.
scrape_configs:
- job_name: openclaw_gateway_health
scrape_interval: 30s
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://127.0.0.1:18789/healthz
- http://127.0.0.1:18789/readyz
labels:
service: openclaw-gateway
env: prod-rental-mac
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox_exporter listen address
On Docker installs, replace 127.0.0.1 with the published host port or the bridge IP your compose file documents. The goal is a single source of truth for “can we reach the gateway?” before you invest in fancier RED metrics.
Step 3 — Alert rule examples (PromQL-style)
Below, metric names assume probe_success from blackbox; swap for your histograms if you scrape native exposition.
groups:
- name: openclaw_gateway
interval: 30s
rules:
- alert: OpenClawHealthzDown
expr: probe_success{job="openclaw_gateway_health"} == 0
for: 2m
labels:
severity: page
annotations:
summary: "OpenClaw health probe failing"
description: "Target {{ $labels.instance }} has been down for >2m."
- alert: OpenClawReadyzDegraded
expr: probe_success{job="openclaw_gateway_health", instance=~".*readyz.*"} == 0
for: 5m
labels:
severity: ticket
annotations:
summary: "OpenClaw readiness failing (work may be unsafe)"
- alert: OpenClawProbeSlow
expr: probe_duration_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway probes slower than 2s"
When you add request counters or latency histograms (via gateway-native metrics or an Envoy/Caddy sidecar), extend with classic SLO rules: error rate over 5xx or tool failures, and p95 latency over a budget. Keep label cardinality low on shared rental hosts—exploding labels per user id will hurt both Prometheus and your wallet.
Step 4 — Align alerts with OpenClaw log fields
Metrics tell you when; logs tell you why. Standardize fields so any alert deep-link resolves in one jump:
- Timestamp — RFC3339 with timezone; align clock sync (sntp) on the rental Mac.
- level — info, warn, error; route error to paging policies.
- request_id or trace_id — Correlate multi-hop tool calls.
- channel / session — Which surface hit the gateway (console, messaging bridge, CI webhook).
- tool or skill_pack — Names the code path that failed.
- upstream — Model vendor, HTTP host, or local loopback target.
- duration_ms — Pairs with probe latency spikes.
- error_class — Timeout, 429, auth, sandbox denial—maps to remediation playbooks.
In Grafana Loki or Elastic, build a saved query template: service="openclaw-gateway" AND level="error" with the same env label you use in Prometheus. For budget-related storms, pair this observability stack with per-project API caps in your gateway layer so alerts on 429s do not fight spend policies.
Production notes: gateway tokens and outbound restrictions
Tokens. The OPENCLAW_GATEWAY_TOKEN (or equivalent) is not a Prometheus metric label. Store it in launchd environment files, Docker secrets, or a vault sidecar; rotate on the same calendar as your Tailscale or API keys. Never append secrets to scrape URLs—scrap configs and exporter logs are high-leak surfaces. If probes must authenticate, terminate TLS or mTLS on a local reverse proxy and keep scrapes on loopback.
Egress. Rental Macs used as agent islands often run domain allowlists for skills. Your observability stack (OTLP endpoint, Grafana Cloud, vendor-hosted Prometheus remote_write) is another outbound destination: add it explicitly to the allowlist or run the collector on a management network. Otherwise “metrics stopped flowing” looks like a gateway outage when it is actually firewall policy doing its job.
Rate and cardinality. High-frequency scrapes plus verbose debug logs can starve the same CPU budget your agents need. Prefer sampling for debug, default info in production, and cap label dimensions—especially on multi-tenant gateways.
FAQ
Should alerts run on the rental Mac itself? Prefer an external evaluator (managed Prometheus, cloud monitoring) so a dead host can still page you. Keep a minimal watchdog script only as a last resort.
What if I only have logs today? Ship logs to your stack, derive error counts with log-based metrics, and add Prometheus probes when ready—same incident workflow, different signal quality.
Summary
A rented remote Mac running OpenClaw becomes operable when /healthz and /readyz are on a documented scrape path, intervals match your SLO appetite, PromQL-style rules encode liveness and readiness, and structured logs close the loop. Harden tokens and egress so observability never weakens your security posture.
For public pricing, purchase, and support—no login wall—use the site pages below when you are ready to pin this stack to dedicated Mac capacity.