Updated: 9 May 2026 · Re-evaluated quarterly

Methodology: How we pick the AI behind StarReview review replies

We tested 13 frontier AI models against 53 real Swiss Google reviews in 6 languages. Here's what happened, which models met our bar, and which did not.

Key findings

9 of 13 models tested cleared our rule-based compliance threshold (0 forbidden phrases, 0 em-dashes across 53 reviews).
mistral-large disqualified — em-dashes in 42% (22/53) of replies.
gpt-5 (Chat Completions) added operational details not present in the review — a real liability for a public reply tool.
Latency range of compliant models: gpt-4.1-mini (1.2s P50) to gemini-2.5-pro (13.3s P50).
We weight safety + grounding above raw speed — the production choice is based on the complete safety, quality, and latency profile.

Why most AI replies on Google reviews are bad

Most review-reply tools use a default LLM without showing how they picked it. The result: generic corporate phrases. "Thank you for your kind words, your satisfaction is our top priority." Replies like that hurt the brand more than they help.

A good reply to a Google review needs three things: grounding in what the customer actually wrote (no invented details), a natural human voice in the correct language and register, and safety on legally sensitive reviews (no admission of liability, no privacy leaks). Which models deliver this isn't visible in datasheets. You have to test.

Our test setup

53 real reviews from Swiss SMB profiles, anonymized
6 languages: French (12), English (25), German (3), Italian (11), Portuguese (1), Chinese (1)
13 frontier models — all current top models from OpenAI, Anthropic, Google, Mistral
Rule-based scoring: 37 forbidden-phrase patterns per language, em-dash detection, word count, latency
Manual safety pass on negative DE/FR reviews, because rule-based tests can't measure grounding or tone

Sample caveat: the fixture is restaurant-heavy and contains only 3 German-language reviews. The next quarterly evaluation will rebalance toward a Swiss-DE-majority sample.

Results depend on the prompt, API mode, test date, and evaluation set. Rule compliance alone does not prove production suitability — it's a minimum bar.

The 13 models we tested

Sorted by median latency. Lower latency, fewer forbidden phrases, and zero em-dashes are better.

Model	Latency P50	Latency P95	Words (median)	Forbidden	Em-Dash	Status
gpt-4.1-mini	1.2s	1.9s	37	0	0	clean
gpt-4.1	1.3s	3.0s	32	0	0	clean
claude-haiku-4-5	1.6s	2.0s	37	0	1	em-dash in DE
mistral-large	1.9s	6.0s	42	0	22	em-dash spam (42%)
gpt-5-codex	2.3s	4.0s	40	0	0	clean
gpt-5.2-codex	2.3s	5.1s	36	0	0	clean
claude-sonnet-4-6	3.1s	4.2s	43	0	0	clean
claude-opus-4-7	3.3s	5.4s	41	0	0	clean
gemini-2.5-flash	4.7s	7.0s	51	1	0	forbidden phrase in DE
o4-mini	5.3s	12.3s	36	0	0	clean
gemini-3-pro-preview	10.1s	14.7s	40	0	0	clean
gpt-5	11.6s	22.7s	44	0	0	fabricated details (see below)
gemini-2.5-pro	13.3s	19.9s	37	0	0	clean

What the data shows — failure modes worth naming

mistral-large: em-dash spam

22 of 53 replies (42%) contained em-dashes (—) or en-dashes (–). That's a classic AI tell that immediately breaks the "sounds like a real person" effect. Disqualified.

gemini-2.5-flash: forbidden phrase in DE

Gemini Flash produced one of our forbidden cliché phrases ("we know how important it is...") in one of three German replies. With only 3 DE tests, 1 hit is a high percentage.

claude-haiku-4-5: em-dash in DE

One German reply contained an em-dash. Marginal, but enough to drop Haiku out of the compliance top group.

A reasoning model: fabricated operational details

In our run this affected gpt-5 via Chat Completions. The model produced rule-compliant replies but added details that weren't in the review. On a FR 2★ review about a brunch, it replied: "Nous avons revu la formule à 32 CHF: pain artisanal, jambon de meilleure coupe, omelette plus généreuse..." — none of those items were in the review. For a public review-reply tool, that's a real risk: the owner can't claim publicly to have done things they never did. This is not a general verdict on the model — it's a result for this prompt, this API setup, and this task.

Only a small subset of models met our bar

Of 13 models tested, 9 cleared the rule-based compliance bar (0 forbidden phrases, 0 em-dashes). After the manual safety pass — which tests grounding, tone, and truthfulness — the set narrowed further.

We do not publicly disclose which model we run in production. That's part of our competitive position. What is public is our methodology, our criteria, and our data — verifiable and reproducible.

Our selection criteria

We weight models against these criteria, in this order:

Safety: no fabrication, no admission of liability, no privacy leaks about other customers or staff
Grounding in the review: replies pick up what the customer actually wrote. No generic phrases, no invented details.
Swiss multilingual depth: consistent quality in DE-CH, FR-CH, IT-CH, with correct registers (Sie / vous / lei) and Swiss idioms
Tone fit by industry: warm for restaurants, de-escalating on negative reviews, professional-empathetic in healthcare
Latency: under 5 seconds P95 — good UX in the dashboard when the owner clicks "generate reply"
Cost: scalable at SMB volume (hundreds to thousands of reviews per month)

We do not optimize for raw speed. The fastest model in our test (gpt-4.1-mini, 1.2s) is 2.5× faster than our pick, but generic and shallow on grounding. We trade speed for quality.

What we don't disclose — and why

We don't name the model we run in production. We re-evaluate quarterly, and the answer to "which model is best?" changes when new ones launch. We want to switch without rewriting marketing claims. The real advantage is in the ongoing eval process: criteria, test cases, manual review, and the willingness to switch models. Methodology and criteria are public; the implementation is ours.

What you get with StarReview

In our May 2026 eval run, the model configuration behind StarReview met our threshold: 0/53 forbidden phrases, 0 em-dashes, and 0 fabricated operational details; P95 latency under 5 seconds across all 6 tested languages. Concretely:

Replies that sound like a real owner — 37 language-specific filters block AI clichés before publication
Grounded in the actual review — 0/53 fabricated details on the manual safety pass; several non-chosen models added operational details
Swiss multilingual tone in DE-CH, FR-CH, IT-CH, and English — tested on 53 real reviews
Forbidden-phrase filters with 37 patterns per language, built from identified AI clichés in real Swiss reviews
Quarterly re-evaluation — next run August 2026; we switch if a new model exceeds our threshold

How we tune for industry specifics

Review replies carry industry-specific risks. For a doctor's practice, it is about Swiss professional secrecy (StGB Art. 321) and what the reply may say about the patient relationship. For a garage, it is about repair promises made without a written agreement. For a restaurant, blanket hygiene assurances without a concrete measure.

StarReview is tuned differently per industry:

Industry-specific system prompts with built-in caution rules
Industry-specific phrase filters that route typical risk wording to human approval or have it regenerated before publication
In healthcare, every reply requires human approval regardless of star rating
Smoke tests on every model update against typical failure modes per industry

We do not publish the exact patterns and prompts. That is part of our competitive stack. Methodology and criteria are public; the concrete implementation is ours.

Sign up and get 2 months free with StarReview

While we wait for Google API approval, we're collecting signups.