Compute renters batching Stable Diffusion img2img on Core ML in Hong Kong, Singapore, Japan, Korea, or US West share one ceiling: unified memory holds weights, graphs, your queue, and hot compile directories together. You get a decision matrix, executable parameters, and links to Core ML compile, ORT Core ML, queue timeouts, and region TCO; pricing and purchase stay public until checkout.
Pain points on a remote M4
- Compile spikes look like slow throughput. First-run Core ML builds and mlmodelc caches share unified memory with tensors; a second session before warm caches often spikes p95 more than mean rate.
- One timeout blends failures. A single wall clock for wait plus steps mislabels WAN staging as model error and thrashes disk cache with retries.
- Object store scratch. Pulling frames over TLS each item pins CPU in crypto while batch inference looks idle—fix NVMe prefixes before scaling hosts.
Decision matrix
Rows are profiles; tune batch size, sessions, disk cache, and split timeouts using the H2 sections. Reprofile on OS, Xcode, or checkpoint changes.
| Profile | Batch inference shape | Concurrent sessions | Disk cache stance | Timeouts Wq / Wc |
|---|---|---|---|---|
| Overnight bulk img2img | Batch up until resident bytes inflect; fixed resolution ladders | 16 GB: one lane; 24 GB: two if swap stays flat | Local mlmodelc and tile prefix; archive cold bundles off host | Wide Wq; Wc covers compile plus diffusion p95 |
| Low-latency API | Batch one to two; pin steps | Semaphore second lane; 24 GB if compile plus serve co-host | Warm deploy cache; evict cold bundles to secondary disk | Tight Wq; modest Wc; track compile apart from serve |
| Multi-tenant rental slice | Per-tenant batch and resolution caps | Account concurrency cap; export queue depth | Per-tenant TMPDIR on APFS |
Shrink batch before widening Wc |
No universal images-per-second claims. ANE and GPU routing depend on ops, precision, and build—treat published tables as guardrails, not SLAs.
Model conversion and batch size
Convert UNet and VAE to mlprogram or supported mlpackage; pin converter version to checkpoint. Raise batch inference until unified memory or planner warnings inflect; cut steps before batch when tails blow out.
See Core ML compile and ORT Core ML matrix if you mix runtimes.
Concurrent session caps
Each queue worker holds graphs and decode state together. Add a second lane only when vm.swapusage stays flat and memory pressure clears across two passes. Mirror WhisperKit split Wq versus Wc for wait versus compute.
Node selection (Hong Kong, Singapore, Japan, Korea, US West)
Co-locate with your weight bucket: Tokyo / Seoul for northeast Asia; Singapore / Hong Kong for SEA or Greater Bay storage; US West for Pacific artifacts. Benchmark one TLS pull before trusting default queue timeouts.
Regional pages: Hong Kong, Singapore, Japan, South Korea, US West, purchase; compare packages first.
Cost
Hourly rent plus egress, cold compile time, and retries from collapsed timeouts. Co-locate M4 with storage before raising batch inference without disk cache rules. Re-read region TCO after moves.
Executable parameters
Paste into bootstrap; bands are triage hints only.
# Unified memory anchors and swap (read-only)
sysctl -n hw.memsize
sysctl -n hw.perflevel0.physicalcpu
sysctl vm.swapusage
# Keep Core ML scratch and decode temps off crowded home dirs
export TMPDIR="/Users/shared/scratch/coreml-sd/$JOB_ID"
mkdir -p "$TMPDIR"
# Example knobs—tune per matrix row
export SD_MAX_BATCH=2
export SD_MAX_CONCURRENT_SESSIONS=1
export SD_WQ_SEC=120
export SD_WC_SEC=900
Runbook: five steps before adding hosts
- Pin checkpoint, converter, macOS digest on the host record.
- Warm compile once; label cold start in dashboards.
- Binary-search batch at fixed resolution until p95 or swap ticks.
- Split queue wait versus diffusion metrics; alert separately.
- Re-profile after region moves; RTT is not RAM.
Citable signals
- Resident bytes per lane versus
hw.memsizefor sixteen versus twenty-four gigabyte SKUs. - Share of jobs near Wc over rolling windows for quantization or IO drift.
- NVMe read MB/s versus GPU use to spot cache misses before raising concurrency.
FAQ
External SSD? Archives only; keep hot Core ML artifacts on internal NVMe.
Lower RTT fix OOM? No—placement helps staging only.
Summary
Stable Diffusion img2img on rented M4 needs conversion and batch discipline, session caps, metro placement, and cost that counts disk cache plus WAN—not headline concurrency. Slug: 2026-rent-remote-mac-m4-stable-diffusion-coreml-batch-unified-memory.html.
Ship next to your weights? Apply the matrix, wire sysctl and split timeouts into metrics, then open purchase pages—no login until checkout.