Does ONNX Runtime read intra_op_num_threads from an environment variable?

Generally you set intra_op_num_threads and inter_op_num_threads on SessionOptions in Python or C++. Pair those choices with OMP_NUM_THREADS and VECLIB_MAXIMUM_THREADS so BLAS-style kernels do not oversubscribe the same cores. Treat any single magic env line as suspicious until you confirm it against your ORT build docs.

How many CoreML EP sessions should share one rented M4?

Start from one warm session per GPU-style graph plus a narrow allowance for A-B tests. Add a second concurrent session only after you measure resident weights, ANE compiler artifacts, and NVMe cache pressure; unified memory makes extra sessions a fleet-wide event, not an isolated knob.

Why split queue wait and compute timeouts?

A long queue wait is backlog or upstream slowness; a long compute phase is model or EP behavior. Collapsing both into one timer mislabels retries, thrashes the CoreML compiler cache, and hides whether you need more nodes or smaller batches.

2026 Remote Mac M4 ONNX Runtime CoreML EP: Batch Inference, Threads & Queue Timeouts

Teams renting Mac mini M4 in Singapore, Tokyo, Seoul, Hong Kong, or US West often run ONNX Runtime with the CoreML execution provider for batch scoring. Here is one decision matrix for InferenceSession counts, thread caps, batch size, IO, and split queue timeouts, plus Bash exports and a concurrency checklist—no fixed speedup claims. Pair timeouts with the PyTorch MPS vs MLX matrix and native stacks with the Core ML mlmodelc note. Public pricing and purchase need no login.

Pain points on a remote M4

Session sprawl. Each InferenceSession holds graph plans, weights, and EP caches in unified memory. Per-request sessions look cheap until pressure and ANE artifacts fight NVMe reads.
Thread double booking. ORT intra-op pools, OpenMP, and Accelerate kernels can each claim P-cores. Misaligned caps raise tail latency while CPU still looks busy.
Single-timeout blind spots. One timer mixes queue backlog with CoreML compute, encouraging retries that evict hot caches.

Decision matrix

Use the rows as guardrails, then sweep B and session count against memory pressure and disk queues. Cost hints are qualitative—verify on pricing and region buy vs rent.

Profile	Warm sessions	Intra / inter threads	Batch B	Built-in NVMe IO	Timeouts W_q / W_c	Day vs monthly hint
Steady API with warm weights	16 GB: 1 primary; 24 GB: 1–2 with hard cap	Start intra 2–4, inter 1; lower if OMP oversubscribes	Raise B until p95 bends, then step back	Prefetch once; avoid parallel full-model copies	Short W_q; W_c to batch p95 plus slack	Monthly fits flat QPS
Burst jobs from CI or nightly scoring	Reuse one session per model hash; rebuild only on hash change	Keep inter 1 unless graphs are embarrassingly parallel	Moderate B; favor stable latency	Stage ONNX and external data to local NVMe before loops	Tight W_q; wider W_c during profiling only	Day-rate windows for short spikes
Shared tenant host	Global semaphore per model family	Prefer fewer threads per tenant; fairness over saturation	Small B per tenant plus admission control	Isolate scratch prefixes; watch shared write spikes	Expose both timeouts to metrics; degrade before hard kill	Mid monthly if isolation needs a larger SKU

No speedup guarantees. CoreML EP depends on opset coverage, precision, and ANE versus GPU routing—reprofile on the rented image.

Executable environment exports and concurrency checklist

Use in systemd, CI shell, or container entrypoints; tune after profiling. Defaults are conservative.

# macOS worker shell — tune after profiling
export OMP_NUM_THREADS="${OMP_NUM_THREADS:-2}"
export OMP_WAIT_POLICY="${OMP_WAIT_POLICY:-PASSIVE}"
export VECLIB_MAXIMUM_THREADS="${VECLIB_MAXIMUM_THREADS:-2}"
export ORT_LOG_SEVERITY_LEVEL="${ORT_LOG_SEVERITY_LEVEL:-3}"  # 0 verbose .. 4 fatal

Python SessionOptions keep threads explicit:

import onnxruntime as ort
so = ort.SessionOptions()
so.intra_op_num_threads = 2
so.inter_op_num_threads = 1
# providers=[('CoreMLExecutionProvider', {...}), 'CPUExecutionProvider']

When you toggle CoreML session options—precision, only-ANE style flags, or extra provider keys—log the resolved provider string each deploy. Different ORT minors can shift defaults so behavior changes even if your Bash exports stay the same.

Cap warm sessions before raising B.
One dominant OpenMP team per process; split cores across processes deliberately.
Log session id, model hash, providers, and batch wall time.
On overload: shrink B, then adjust W_c, then add nodes—not blind retries.
Surface failures with signed DLQ or summary hooks instead of silent drops.

Runbook: five steps before you scale concurrency

Pin ORT wheel and EP build; record pip freeze and image digest.
Warm once; track p50 and p95 and exclude cold start from SLAs if needed.
Binary-search B; stop when latency variance or memory pressure inflects.
Split W_q and W_c; alert backlog separately from slow EP.
Re-benchmark after region moves; lower RTT does not add cores.

Signals worth exporting to metrics

Resident bytes per session versus compile-heavy peaks.
Wall time per item (batch time divided by B).
Share of batches near W_c in a ten-minute window.

Pick region and package on public pages

Co-locate workers with weights and features to shorten NVMe staging. Compare nodes without login: Singapore, Japan, South Korea, Hong Kong, US West, plus purchase and blog index.

FAQ

Keep CPUExecutionProvider? Often yes as fallback; measure both paths.

More intra-op threads? Not always—profile instead of assuming linear gains.

Image packages? Check support for the toolchain manifest.

Summary

ORT CoreML EP on a rented M4 needs session reuse, aligned thread caps, measured batch size, and split timeouts under unified memory. Export OMP and vecLib limits, set SessionOptions in code, and re-profile after region changes. Slug: 2026-remote-mac-m4-onnxruntime-coreml-batch-inference-matrix.html.