This runbook shows how to run repeatable batch inference with Ollama on a rented remote Mac while keeping OpenClaw nearby for agents and automation. You will install both services with sane defaults, route APIs through a small reverse proxy instead of exposing raw ports, cap concurrency at the queue and server layers, and add degradation plus retries so overnight jobs survive transient overload. For MacCompute capacity and access patterns, start from Home, the notes index, or help.
Goal and recommended layout
Batches fail when parallelism overshoots VRAM, models thrash, or clients quit on the first timeout. On a remote Mac, use:
- Ollama —
127.0.0.1:11434(inference only). - OpenClaw Gateway —
127.0.0.1:18789(agents; see official Docker / CLI docs). - Edge — Caddy/Nginx for one TLS hostname on a VPN, or SSH -L from your laptop.
- Queue worker — caps in-flight jobs and adds backoff.
Install Ollama on macOS (remote Mac)
Install and pull the model you will batch:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
Keep the default loopback bind unless you intentionally set OLLAMA_HOST. For reboot-safe settings, put OLLAMA_NUM_PARALLEL in a profile or launchd plist. Smoke test: curl -fsS http://127.0.0.1:11434/api/tags. OpenAI-style clients can use /v1/chat/completions on the same port.
OpenClaw Gateway: install points that matter for co-hosting
Two supported paths: global CLI (Node 24 or 22.16+) or Docker with ./scripts/docker/setup.sh on a clone of openclaw/openclaw. Typical Mac setup: npm install -g openclaw@latest, openclaw onboard --install-daemon, openclaw gateway --port 18789 --verbose; or Docker with bind-mounted config/workspace. Keep Ollama native for Metal; let OpenClaw tools call 127.0.0.1:11434. curl -fsS http://127.0.0.1:18789/healthz. More hardening: OpenClaw deploy & remote Mac workflows.
API routing: one hostname, two upstreams
For VPN clients that need HTTPS to both stacks, terminate TLS once and split by path (conceptual Caddy):
inference.internal.example.com {
route /v1/* {
reverse_proxy 127.0.0.1:11434
}
route /openclaw/* {
reverse_proxy 127.0.0.1:18789
}
}
Add proxy rate limits (limit_req or Caddy rate-limit) to absorb client bursts. No public hostname? Use ssh -L for 11434 and 18789 instead of opening the firewall.
Routing decisions at a glance
| Traffic | Target | Why |
|---|---|---|
Batch /api/generate or OpenAI shim |
127.0.0.1:11434 |
Lowest latency; keep behind loopback or authenticated proxy. |
| Gateway UI and WS control plane | 127.0.0.1:18789 |
OpenClaw health on /healthz; tunnel when debugging. |
| Untrusted internet | None by default | Require VPN, SSH, or mutual-TLS before publishing either port. |
Queue script: concurrency cap and JSON-safe payloads
Store one prompt per line in prompts.txt. The worker below uses python3 only to build JSON safely. Parallelism uses a small bash job pool (works on stock macOS bash) instead of GNU xargs -P.
#!/usr/bin/env bash
set -euo pipefail
OLLAMA_URL="${OLLAMA_URL:-http://127.0.0.1:11434}"
MODEL="${MODEL:-llama3.2}"
MAX_JOBS="${MAX_JOBS:-2}"
PROMPTS="${1:?path to prompts.txt}"
mkdir -p out failed
run_one() {
local i="$1" line="$2"
local body try=0 delay=1
body="$(python3 -c 'import json,sys; print(json.dumps({"model":sys.argv[1],"prompt":sys.argv[2],"stream":False}))' "$MODEL" "$line")"
while (( try < 4 )); do
if curl -fsS --max-time 600 -H 'Content-Type: application/json' \
-d "$body" "$OLLAMA_URL/api/generate" -o "out/resp-$i.json"; then
return 0
fi
sleep "$delay"
delay=$(( delay * 2 ))
try=$(( try + 1 ))
done
printf '%s\n' "$line" >> "failed/prompts-$i.txt"
return 1
}
export -f run_one
export OLLAMA_URL MODEL
i=0
while IFS= read -r line || [ -n "${line-}" ]; do
i=$((i+1))
while (( $(jobs -rp | wc -l | tr -d ' ') >= MAX_JOBS )); do
sleep 0.2
done
( run_one "$i" "$line" ) || true
done < "$PROMPTS"
wait
Keep MAX_JOBS within OLLAMA_NUM_PARALLEL and unified memory; on 16 GB Macs start at 1 for 7B–8B models, then tune while watching memory pressure.
Resource limits: Ollama, macOS, and the batch driver
- Ollama —
OLLAMA_NUM_PARALLEL; optionalOLLAMA_MAX_LOADED_MODELSif you swap models often. - Queue —
MAX_JOBS≤ server parallel after reserving headroom for cron and agents. - macOS — optional
launchdSoftResourceLimits/HardResourceLimitson RAM. - OpenClaw — stagger automations so agent bursts do not align with Ollama peaks.
Degradation and retries
The worker retries HTTP errors with backoff. Add model steps: primary → smaller fallback → truncated prompt → failed/ dead letter. Cap attempts (four in the script) so one bad line cannot block the run.
FAQ
Should Ollama listen on all interfaces? Only if you fully understand the exposure. Prefer loopback plus SSH or VPN, and put auth in front of any routable address.
Where should concurrency limits live? At Ollama (OLLAMA_NUM_PARALLEL), the reverse proxy (rate limits), and the queue (MAX_JOBS). All three together prevent silent overload.
How does OpenClaw relate to Ollama here? OpenClaw runs the Gateway and agents; Ollama serves local LLM HTTP. They coexist on one Mac but should not share one process namespace.
What if jobs time out? Use curl --max-time, backoff, optional smaller model, and a dead-letter file—never infinite loops.
How do I verify both services? curl to /api/tags and /healthz; from off-box, tunnel ports with SSH instead of opening the firewall.