Can a 35B local model write your unit tests? I made it try.
Twelve dispatches across five local OSS models on a 128 GB MacBook Pro M5 Max. The result reframes most of what people argue about when they argue about local LLMs.
The 2026 question — the one that nudged me into running this experiment — is whether local OSS models have finally caught up to the point where they can do real autonomous engineering work on consumer-grade hardware. Not “answer questions about code.” Not “draft a function.” Genuinely do the kind of bounded ticket a junior engineer might be assigned: read the implementation, understand the existing tests, fill in the gaps, run the suite, and stop when the work is done.
I gave that exact task — fill missing unit tests in a security-sensitive refresh-token rotation feature — to five local models. Same machine, same harness, same byte-for-byte prompt. Each model ran twice: once with explicit TODO markers showing it where the gaps were, and once with the markers stripped and only “review the M6 changes and write the tests you’d want before approving the PR.” Twelve dispatches in total once you count the open-ended task as well.
What came back wasn’t the comparison I expected. The headline finding wasn’t about parameter counts or quantization. It was about something almost embarrassingly mundane.
Why local matters now
Three things converged in 2025–2026 that made this experiment worth running. First, hardware caught up. 128 GB of unified memory on an M-series Apple Silicon laptop comfortably runs 70B-class dense models at Q5/Q6 and MoE 100B+ at MXFP4 to Q8. The “but it won’t fit on your laptop” objection is over for almost everything except the genuine frontier-class.
Second, OSS lineage matured. Qwen 3.x, gpt-oss-120b, Hermes 4 — these are 2025/2026 releases, all explicitly fine-tuned for agentic workloads. Native tool-call support, long-context handling, multi-turn coherence. The 2023-era “the OSS model gets confused after three turns” critique no longer applies.
Third, compliance pressure. Anthropic’s OAT-related account-ban risk in early 2026 made running long-lived autonomous agents on a hosted API a non-starter for some company workloads. Local-only became values-aligned, not just frugal.
The framing isn’t “local replaces frontier.” It’s hybrid: frontier-tier (Claude Code, in my case) for genuinely hard problems where capability matters more than cost or latency. Local OSS for the ongoing agentic loops that benefit from being free, cached, fast, and auditable. The experiment in this post measures the second half of that stack.
The setup
The hardware: MacBook Pro M5 Max, 128 GB unified memory, macOS 26.4. The model server: LMStudio with its OpenAI-compatible endpoint. The agent runtime: OpenClaw — manages workspace isolation, tool routing, and prompt assembly. The codebase under test: a real production refresh-token rotation service with hashed storage, replay detection, pessimistic-write rotation locks, and a boot-time guard against shipping the dev placeholder pepper to non-dev environments.
Five models in the lineup:
Model Total / Active params Quant Resident RAM
───── ───────────────────── ───── ────────────
gpt-oss-120b 120B / ~5B (MoE) MXFP4 (4-bit) 63 GB
qwen3.6-27b-mlx 27B / 27B (dense) MLX 4-bit 35 GB
qwen3-coder-next-mlx ~80B / ~3B (MoE) MLX 8-bit 85 GB
qwen3.6-35b-a3b @ Q8 35B / ~3B (MoE) Q8 (8-bit) 38 GB
Qwen3.5-122B-A10B 122B / ~10B (MoE) MXFP4 (4-bit) 60 GBEach one had to fill four specific test gaps in the M6 codebase. One of those gaps was deliberately tricky — a describe("assertRefreshTokenPepperConfigured", ...) block that requires per-test mocking of the config module using await import("config") plus mockImplementationOnce, while juggling process.env.NODE_ENV. The other three were simple boundary cases. Baseline was 77 passing tests across the affected files; the canonical “expert with full context” answer hits 87.
Two prompt variants. The TODO-driven version had explicit hints — “fill these specific gaps.” The open-ended version stripped the markers and reframed it: “review the M6 changes, identify coverage gaps, write tests where you see fit, stop when you’d approve the PR.” Same surface area, different scope of judgment required.
Beyond the model selection, every variable was frozen. Prompt byte-identical. Base commit identical. Persona identical. Skills identical. Only the model varied.
Headline finding #1 — Harness > model
The single starkest data point in the whole experiment doesn’t compare two models. It compares one model to itself.
When I first tried running gpt-oss-120b on a related-but-broader task in an earlier harness, it produced zero tests, no commit, and a 30-minute timeout. Same hardware, same model, same model weights. The thing that changed wasn’t the model. It was the harness:
Same gpt-oss-120b @ MXFP4, broad task vs focused harness:
Baseline (broad) Final (focused)
────────────────────────── ──────────────────────────────
Task scope Write whole test suite Fill 4 markers
Persona "Do NOT modify code" "Test-file auth permitted"
Skill context 51 KB (browser bloat) 18 KB
Test iteration Full suite (1–3 min/run) Single file (1.5 s/run)
Result ★ 0 added, 30-min TIMEOUT ★ 84/87 added, 2.4 min
Tool calls 51 (thrashing) ~10 (focused)
Session JSONL 758 KB ~50 KBThe model already had the capability the whole time. The harness was preventing it from being expressed. Bounded scope, persona alignment, irrelevant-context pruning, fast iteration loops — these are the difference between “this model is useless” and “this model is genuinely productive.”
This generalizes. Anyone evaluating a local model against an open-ended “does it work?” prompt is mostly testing their own harness design, not the model. The local-OSS-model discourse over-indexes on parameter counts and quantization choices. The leverage is somewhere else entirely.
Headline finding #2 — The diminishing-returns curve isn’t what you’d expect
Walk through the lineup ordered by RAM footprint, looking only at the bounded TODO task:
RAM footprint Wall clock Tests passed
──────────────────────────── ────────── ────────────
27B dense 7.25 min 87/87
35B-A3B Q8 MoE 2.5 min 88/87 ★
120B MoE @ MXFP4 2.4 min 84/87
122B-A10B MXFP4 text 2.85 min 88/87 ★
80B coder-next 8-bit MoE 36.6 min ⚠ 87/87Bigger ≠ better. The two fastest performers — gpt-oss-120b at MXFP4 and qwen3.6-35b-a3b at Q8 — both hit at-or-above-canonical scores while running 12–14× faster than the 80B coder-next. The 80B model isn’t worse at the task. It’s just, on this hardware, paying a real cost for its size that isn’t matched by a quality gain.
The shape of the curve is non-monotonic: comfortable middle (35B–63B resident, MoE with low active params) wins on both speed and quality. For daily agentic work, the biggest model your machine can technically load isn’t the most useful one. Pick the comfortable middle and you get the speed of small + the quality of large.
That’s not a finding I expected going in. The default cultural assumption — “more parameters, better outputs” — turns out to be undermined the moment you measure end-to-end task latency rather than benchmark accuracy.
The pepper-config discriminator
The four gaps weren’t all created equal. Three of them were boundary conditions any model could write blindfolded. The fourth was the real test:
// TODO(qa-review-2): Write a full describe("assertRefreshTokenPepperConfigured", ...)
// block. Each case needs to override the `config` mock per-test using:
// await import("config")
// mockImplementationOnce
// while also juggling process.env.NODE_ENV.
// Cover ~7 cases: dev placeholder in dev env (allowed), dev placeholder in prod (refused),
// missing pepper, empty string, whitespace-only, valid pepper, NODE_ENV unset.This pattern is unusual. Most Jest test setups use jest.mock(...) at the top of the file. This codebase uses ESM, jest.unstable_mockModule(...), and per-test mock overrides via dynamic imports. A model has to recognize the existing convention from the surrounding file, not invent a new one.
Every model in the lineup got this gap right. But they got it right in different ways.
gpt-oss-120b consolidated the 7 cases into a single
it.each([...])block — terse, idiomatic.qwen3.6-27b-mlx wrote 7 separate verbatim
it()blocks — verbose but correct.qwen3-coder-next-mlx wrote them out longhand and added a comment explaining its understanding of the pepper-config policy.
qwen3.6-35b-a3b at Q8 wrote 8 cases instead of 7, slipping in an extra defensive boundary check.
Qwen3.5-122B-A10B did similar — 8 cases including a “literal zero pepper-string” defensive variant nobody else covered.
What none of them did was the failure mode I half-expected: invent a top-level jest.mock and not engage with the existing pattern. That’s the failure shape you’d see from a less-mature OSS model. Every model in this lineup recognized the convention from context. The difference between them was style, not capability.
That distinction — capability vs style — became the central thread once the open-ended prompt entered the picture.
Headline finding #3 — Personality emerges with autonomy
The TODO-driven prompt measures obedience. The open-ended prompt measures taste. The same five models, given autonomy instead of a checklist, produced visibly different artifacts. The metric isn’t which one wrote the most tests — it’s which voice each one had.
gpt-oss-120b — the surgical senior
Identified four gaps. All critical or important. No padding.
It filed its work in a new separate file — refreshTokenService.additional.test.ts — preserving the original test layout and making the diff trivial to review. It included an explicit “what I decided NOT to test” section: DB transaction failures (out of unit scope), happy-path (already covered), perf (wrong tool). Highest confidence in its sign-off line:
“No further testable gaps remain without venturing into integration or DB-level testing.”
Fastest, cleanest, most opinionated. The model behaved like a senior engineer who knows when to stop.
qwen3.6-27b-mlx — the eager junior
Identified eleven-plus gaps in elaborate Risk-graded markdown tables (Critical / Medium / Low). Strong threat-modeling vocabulary:
“prod ships with dev pepper = everyone can forge tokens.”
Wrote 33 tests across the auth and refresh files (didn’t get to userService).
It also hit the 60-minute cap with seven tests still failing and never wrote the wrap-up reply.
Ambitious without time-management. Quality-aware in vocabulary; undisciplined in scope. A junior engineer with senior-sounding language but not the maturity to know when to ship.
qwen3-coder-next-mlx — the methodical perfectionist
Identified seven specific gaps in a numbered analysis — including a couple nobody else found:
getLifetimeMs()lifetime configuration“Concurrent rotation (pessimistic write lock) — not testable with current mocks but the lock is critical” (senior-level scope honesty about the limits of a unit-test-only approach)
Showed its own debugging journey in the reply:
“There are still failures. Let me check the full output: ... The issue is that Jest caches the config module.”
It was the only model in the lineup to volunteer actual coverage percentage in its reply (85.07% statements), referencing the AGENTS.md target of 80%+ for agent-authored code. Modified all three canonical files; all 22 tests pass.
qwen3.6-35b-a3b @ Q8 — the rushed generalist
Identified five gaps — concise, mostly correct. The most concrete description of the replay-detection scenario in any reply: “Reusing a token after successful rotation (first rotate succeeds, then present the same original → reuse detection).”
Wrote 20 tests across all three files. Three failed. Brief reply, no explicit sign-off statement.
Balanced but rushed. Fast and broad in coverage; didn’t catch its own failures before stopping.
Qwen3.5-122B-A10B — the methodical + self-debugging range model
Identified seven specific gaps in a numbered list with risk language. One unique find: rotateRefreshToken with literal zero expiresAt — defensive against the falsiest input. Skipped authenticationService entirely as a judgment call, deciding it didn’t need additional coverage.
Real-time self-debugging visible in the reply (same trait as coder-next; same architectural family qwen3_5_moe):
“I see two issues — let me read the implementation to check what string the dev placeholder actually is.”
“mockTrxUpdate was called with 3 arguments. Let me fix my test.”
21 new tests, all passing. 29 minutes wall clock — the longest of the open-ended runs that finished cleanly. The price of doing the work right.
Headline finding #4 — The “range” model
Most models in the lineup specialized:
gpt-oss-120b: fast and bounded. Great at TODO, minimal on open-ended.
qwen3.6-27b-mlx: thorough but slow. Good at TODO, ran out of clock on open-ended.
coder-next: methodical, only really good when given autonomy. Slow on TODO.
qwen3.6-35b-a3b @ Q8: quick and shallow. Great TODO, rushed open-ended.
Qwen3.5-122B-A10B is the only model that did both well. TODO at 88/87 in 2.85 minutes; open-ended at 98/0 (98 tests, zero failing) in 29 minutes. The same model that finished the bounded task in under three minutes with one extra defensive case became the model that identified seven distinct gaps and debugged its own failing tests live across half an hour of autonomous work.
The unique signal — visible in the session transcript — is the real-time self-debugging. Watching the model in autonomous mode notice that one of its own tests was failing, hypothesize the cause, read the implementation again, and rewrite the test. Only Qwen3.5-122B-A10B and qwen3-coder-next showed this behavior cleanly. They share the qwen3 MoE architectural family.
If your daily workload is unpredictable — sometimes bounded, sometimes open-ended — and you have the RAM headroom for ~60 GB resident, this is the model. If your workload is mostly bounded, qwen3.6-35b-a3b at Q8 is faster and lighter and gets the same TODO score. If your workload is mostly judgment-heavy, coder-next-mlx at MLX 8-bit is more methodical at the cost of speed.
Headline finding #5 — Cache stability is the dominant cost driver
This is the finding that surprised me most, and it pointed me at the bigger one I didn’t see in time for this article.
Conventional wisdom for local LLMs is “Q4 is the sweet spot — it’s smaller and faster than Q8.” That’s true for one-shot generation — write me a poem, summarize this article, single forward pass. The Q4 model produces roughly the same quality output, faster.
It is not true for sustained agentic loops with growing context. Two runs of the same model at different quants:
Same qwen3.6-35b-a3b model at two quantizations:
Q8 (8-bit) TurboQuant MXFP4 (4-bit)
────────────────── ──────────────────────────
Cache full-miss rate 2.3% 26.5% ⚠
TODO wall clock 2.5 min 3.75 min (50% slower)
Tests 88/87 85/87Same model. Same hardware. Same task. Wall clock 50% worse on the lower-precision quant.
One OpenClaw config change I didn’t know to try yet drops the same TurboQuant model’s full-miss rate from 26.5% to 4.5%. Wall clock falls another 33%. Bigger lever than quantization. I’ll discuss this more in Part 2 where the focus shifts to optimization.
What this means for hybrid AI stacks
If you’re building anything that runs agents on a local M-series box, here’s what the data actually says, in priority order:
The harness is the leverage. Bounded scope, persona aligned to the task, irrelevant context pruned, fast iteration loops. Spend a day on this before you spend a week comparing models.
The comfortable middle wins. ~38–63 GB resident, MoE with low active params, Q8 if you have RAM. The biggest model you can technically load is rarely the most useful one for daily work.
Pick by task shape, not headline benchmark score. Bounded tasks reward fast and obedient. Open-ended tasks reward slow and self-debugging. One-size-fits-all is the wrong frame.
Cache stability is the dominant cost driver on agentic workloads with growing context. Bigger effect than quant choice, model size, or sampler config. Part 2 names the lever.
One model in the current generation actually has range. Qwen3.5-122B-A10B handles both task shapes cleanly. Most others specialize.
The bigger reframe is about what local models are for in a hybrid stack. They’re not “frontier on a budget.” They’re a different beast: free at inference time, fully auditable, can run on a plane or in a SCIF, and — once you’ve put the harness work in — competitive on a meaningful slice of agentic work. That’s enough to anchor the ongoing-loop layer of an AI stack while the frontier-tier handles the genuinely hard one-shot reasoning.
Run inventory
Part 1’s data: 14 dispatches across the lead five models (one TODO + one open-ended per model + one Q8-vs-MXFP4 comparison + one pre-experiment baseline + one headroom rerun on qwen3-coder-next-mlx for the 47-minute datapoint).
The full series, Parts 1–3 combined: 66 dispatches across 12 models, on a 128 GB Apple Silicon M5 Max, all on the same 4-marker test-coverage task or its open-ended variant. ~11 MB of artifacts across 66 run folders. Each folder contains qa-reply.json (full JSON envelope), runner.log (timestamps), npm-test-output.txt (pass/fail counts), agent-output/*.test.ts (the agent’s diff vs canonical), and the OpenClaw session jsonl trajectory.
Models tested across the series:
gpt-oss-120b @ MXFP4 dense
qwen3.6-27b-mlx dense
qwen3-coder-next-mlx MoE
qwen3.6-35b-a3b @ Q8 MoE
qwen3.6-35b-a3b @ TurboQuant MXFP4 MoE
qwen3.5-122B-A10B @ MXFP4 text MoE
hermes-4-70b dense
llama-3.3-70b @ Q4 dense
qwen3-32b @ Q8 denseWhat is NOT claimed in this article:
That single-dispatch numbers are statistically representative. Per-turn output length is stochastic — identical configs produced 90s and 196s wall clocks on the same task in two consecutive runs. Treat single-shot numbers as directional, not precise.
That findings transfer to non-Apple-Silicon hardware. MLX is the inference path here. Behavior on CUDA / ROCm / llama.cpp on x86 is not measured.
That quantization causes cache invalidation. I had that hypothesis. Part 2 disproves it with a fuller dataset and identifies the actual lever.
Caveats, in plain sight
Single-point measurements have run-to-run variance: roughly 10–15% on wall clock, larger on cache miss rate. Findings here rest on differences large enough — typically more than 2× — to exceed that variance. Single-percent comparisons would need multiple runs.
qwen3.6-27b-mlx is technically 27B, but the Qwen 3.5/3.6 family bundles vision-encoder parameters into the count. Effective language-reasoning capacity is around 25B, with ~2B vision encoder dead weight on a text-only task. The performance numbers stand; the headline param count flatters the model slightly.
qwen3-coder-next-mlx‘s 36.6-minute TODO time was originally attributed to RAM pressure (apps open, headroom tight). The headroom rerun with all apps closed came in slower — 47 minutes. Something else is going on. That story is in Part 2.
All runs were at 64K context, parallelism 1. Raising context to 100K probably would have improved cache stability for the more ambitious dispatches. The numbers in this article are conservative on that axis.
The Hermes 4 70B model isn’t in the Part 1 lineup. The reason is interesting enough to deserve its own treatment in Part 2.
What’s in Part 2
The story so far is the clean version. Five models that completed both task shapes without major asterisks, on a deliberately consistent harness that was left unoptimized for all runs.
Part 2 leads with the cache-stability story — the one teased above. What changed, what it cost, and what 26.5% → 4.5% miss rate actually looks like once it propagates across the whole lineup. Most of the per-model findings in this piece get an asterisk; some get rewritten outright.
After that, the messy stuff. The headroom paradox (closing apps made one model slower, against every expectation). The model that violated stated constraints once you swapped its quant — except, on second look, the constraint violation didn’t really live in the quantization at all. The model that was fully tool-capable at the API level but actively refused to engage inside the agent harness — and replied this:
“Good luck! 🍀”.
The interoperability casualties that aren’t actually casualties — they’re behavioural ones.
Part 2 is also where the bridge to your own setup gets practical: how to measure cache stability on a model you’re considering, how to read the LMStudio server logs for the signal hidden in the noise, what to do when a model talks instead of acts.
Methodology, raw artifacts, and every per-dispatch transcript are preserved. Five model evaluation, two task shapes, twelve dispatches in total once you count reruns. Each transcript is the byte-for-byte agent reply. Let me know in the comments if you are interested in any specifics.
Part 2: “Cache is the bottleneck you didn’t see — and four other things the local-model story isn’t telling you.” Coming next.

