TrendingJune 19, 2026•8 min read·

ByAyush Chaturvedi· Independent Entrepreneur·Research byMorpheus

We Tested 5 Non-Frontier AI Models on Founder Work. Qwen Won Everything.

We ran Qwen3.7 Plus, GLM 5.2, DeepSeek V4 Flash, Kimi K2.6, and Mimo V2.5 Pro through three real founder tasks: positioning, pricing, and customer research synthesis.

Key takeaways

Qwen3.7 Plus won all three non-frontier founder tasks: positioning, pricing, and customer interview synthesis.
GLM 5.2 finished second overall and looks like the safest backup for founder strategy work.
DeepSeek V4 Flash had useful instincts, but the positioning task exposed a dangerous hallucination pattern.
The benchmark was run through OpenCode Go, not Claude Code, Codex, or Grok.
The lesson is not that one model should do everything. The lesson is that founders need a routing table for model work.

Frontier models get most of the attention. Founders often care about a different question: which cheaper, less obvious model is good enough for real operating work?

So we ran a second Founder Model Arena batch. This one removed the celebrity models from the field. No Claude Code. No Codex. No Grok. The candidates came through OpenCode Go and were tested on work a non-technical founder might actually do this week: positioning, pricing, and customer research synthesis.

The result was not close. Qwen3.7 Plus won every task. But the useful lesson is not brand worship in a new costume. The useful lesson is that non-frontier models are now strong enough that founders should build a routing table instead of defaulting to the most famous model for every job.

The setup: five non-frontier models, three founder jobs

The candidates were Qwen3.7 Plus, GLM 5.2, DeepSeek V4 Flash, Kimi K2.6, and Mimo V2.5 Pro. They were scored on strategic clarity, specificity and insight, editorial quality, founder usefulness, groundedness, and publishability.

Winner: Qwen3.7 Plus

Positioning and homepage rewrite

Winning score: 53

Qwen chose the clearest ICP, rewrote the hero with usable conversion logic, and avoided fake customer names or fake metrics. That restraint mattered as much as the copy quality.

Winner: Qwen3.7 Plus

Pricing and packaging decision

Winning score: 54

Qwen framed the real problem as segmentation, not price level. It also proposed the best zero-build pricing experiment instead of treating pricing as spreadsheet theater.

Winner: Qwen3.7 Plus

Customer interview synthesis

Winning score: 54

Qwen used confidence labels, separated signal from speculation, and turned messy interview notes into a concrete low-build product and messaging experiment.

One naming note: this is a non-frontier benchmark, not a strict open-source-only benchmark. Qwen3.7 Plus belongs in this batch because the question was practical founder utility outside the usual frontier defaults.

How the test was run

A Hermes agent orchestrated the benchmark: it generated isolated benchmark specs, called each candidate through OpenCode Go, saved raw outputs, produced judge packets, created scorecards, and verified final artifacts.

The final scored window was roughly 24 minutes, with 19 minutes and 4 seconds of measured model-generation time across the 15 final candidate calls. Every final run saved raw candidate outputs, metadata, scorecards, reports, and benchmark definitions for internal review.

Models tested

Qwen3.7 Plus, GLM 5.2, DeepSeek V4 Flash, Kimi K2.6, Mimo V2.5 Pro

Founder tasks

Positioning, pricing, and customer interview synthesis

Provider surface

OpenCode Go

No Claude Code, Codex, or Grok competitors in this batch

Scored window

~24m

From the first final run directory timestamp to the final archive timestamp

Measured model time

19m 4s

Sum of candidate-generation latency across the 15 final model calls

The final scoreboard

Rank

Model

Avg score

Task scores

Wins

Readout

Qwen3.7 Plus

53.7

53 / 54 / 54

Best practical founder output across positioning, pricing, and synthesis.

GLM 5.2

48.3

47 / 50 / 48

Strong reasoning and structure; safest backup when Qwen is unavailable.

DeepSeek V4 Flash

43.3

34 / 49 / 47

Useful instincts, but hallucinated proof in the positioning task.

Kimi K2.6

39.7

36 / 43 / 40

Clean in places, but thinner on evidence discipline and confidence labels.

Mimo V2.5 Pro

38.7

41 / 41 / 34

Energetic and bold, but too overconfident for final founder decisions.

Qwen won with an average score of 53.7 out of 60 and three task wins. GLM finished second at 48.3. DeepSeek, Kimi, and Mimo had useful fragments, but each exposed enough risk that they should be supervised more tightly for business-critical output.

The pattern was clear: the top models did not simply sound smarter. They were better at staying inside the evidence. That is the hidden requirement for founder work.

What each result tells founders

Qwen3.7 Plus was the best founder operator

It did not merely write better prose. It showed better judgment about what not to invent. In a founder workflow, that is the difference between a useful draft and a liability.

GLM 5.2 is the credible fallback

GLM was structurally strong and economically thoughtful. It was not as sharp as Qwen in this batch, but it consistently stayed close enough to deserve a slot in a serious founder stack.

DeepSeek V4 Flash is useful upstream, risky downstream

DeepSeek had a strong pricing showing and one excellent customer-research insight. But it also invented company names and impossible proof in the homepage task. Use it for exploration, not final claims.

Kimi and Mimo need tighter supervision for business work

Both had usable fragments. Neither produced enough grounded, founder-ready judgment to beat the top three. The risk is not that they are useless. The risk is that they sound more certain than the evidence allows.

Sample inputs and winning-output summaries

Input: homepage positioning from scattered founder context

A browser extension records Loom-style walkthroughs, trims silence, generates a transcript, and turns the video into a help article. Current positioning says AI video documentation for everyone. Strongest users are small B2B SaaS teams using it to answer repeated support questions.

Representative winning output summary

Qwen picked small B2B SaaS teams as the ICP, translated the pain into support questions answered once, and wrote a hero that could be shipped without inventing fake metrics or fake customer proof.

Input: pricing and packaging from messy usage economics

The founder needs to decide whether churn is caused by price, packaging, usage limits, or poor segmentation. The output must recommend a practical package and a zero-build experiment.

Representative winning output summary

Qwen diagnosed the issue as segmentation before price level. The winning experiment tested willingness to commit with a targeted package offer, rather than changing the entire pricing page and hoping the data would explain itself.

Input: customer interviews with contradictory churn signals

Synthesize messy interview notes into what users really want, what the founder should not build yet, and what low-build experiment should run next.

Representative winning output summary

Qwen separated high-confidence patterns from weaker signals, gave three messaging variants, and proposed a concrete low-build experiment tied to the exact churn shape in the notes.

The founder routing playbook

Use Qwen for founder-facing strategy drafts

Positioning, pricing, messaging, synthesis, and decision memos are all good fits based on this batch. Still verify facts before publishing.

Use GLM when you want a second opinion

GLM was close enough to be a strong challenger. Use it to review Qwen output or force a different economic frame before you commit.

Use DeepSeek Flash for cheap exploration, not proof

It can generate angles and alternatives quickly, but any names, numbers, testimonials, or customer claims need aggressive verification.

Keep the benchmark close to real work

The point is not model fandom. The point is building a routing table from tasks you actually do every week.

The real result: cheap model stacks need governance, not hype

The blue-pill version of this result is easy: Qwen won, so use Qwen. The red-pill version is more useful: non-frontier models are now good enough that your workflow becomes the differentiator.

A founder using a weaker model with a task-specific rubric, raw-output review, and fact-checking loop can beat a founder using a famous model lazily. The system matters more than the logo.

That is why these small benchmarks matter. You are not trying to settle the internet argument about model rankings. You are trying to decide which model gets to touch your positioning, pricing, research synthesis, product roadmap, and public claims.

Caveats before you overfit

This was a three-task founder benchmark, not a universal model ranking.
The judge was model-based, so the result should be treated as directional until a human review pass confirms the most important calls.
Token and cost data were not included because exact comparable billing data was not captured reliably across the provider surface.

Want the raw artifacts?

We are not publishing the full artifact folder publicly because it contains complete prompts, raw model outputs, and harness details that are easy to misread without context. If you want to inspect the scorecards or sample outputs, reach out and I will share the relevant subset.

Reach out on X Reach out on LinkedIn

Want more founder-grade AI experiments?

I send practical frameworks, model tests, and operator playbooks for indie founders building with AI. No hype. No generic prompt packs.

← Back to articles

Keep Reading

🔥 Trending

We Tested Claude Fable 5 vs GPT-5.6 Sol on Founder Work. Fable Won 4/4.

We ran Claude Fable 5 and GPT-5.6 Sol through four non-marketing founder tasks: product scope, customer synthesis, churn diagnosis, and pricing strategy. Fable swept the benchmark, but Sol still belongs in the workflow.

Jul 10, 2026Read

🔥 Trending

We Tested 5 AI Models on Real Founder Work. GLM 5.2 Beat GPT-5.5.

We ran GPT-5.5, Grok 4.20, DeepSeek V4 Flash, GLM 5.2, and Claude Code through three messy founder tasks. The winner was not the model most people would guess.

Jun 19, 2026Read

🔥 Trending

Claude Opus 5 vs GPT-5.6 Terra: Blind Judging Picked Opus 13–3 on Founder Work

We tested Claude Opus 5 and GPT-5.6 Terra across 16 real founder jobs. Self-judging produced opposite conclusions. A blinded third judge gave Opus a 13–3 win.

Jul 29, 2026Read