When Do Several Models Repeat the Same Mistake?

Three systems can agree on a business description and still be leaning on the same crooked signpost: an ambiguous name, an old listing, or a category copied across platforms.

In a composite scenario, the same restaurant query was put to several AI search systems. Each returned the same group name. Each placed the recommended venue in the neighbouring province rather than Bangkok. Two answers linked to a booking page for the provincial branch; another displayed the Bangkok branch’s map listing beside a sentence describing the provincial location.

The scenario is assembled from recurring patterns around Study object B: a regional restaurant group with inconsistent branch names across maps, booking platforms, directories, and social pages. A similarly named venue also occupies the same discovery category. The systems reached matching conclusions, but the visible evidence beneath those conclusions was neither uniform nor clean.

Agreement feels like corroboration

People routinely use repetition as a test of reliability. One uncertain statement is checked against another source. When both say the same thing, confidence rises. That habit works reasonably well when the sources are independent, clearly identified, and capable of supporting the claim.

Generated answers complicate the arrangement. Two models may display different prose while drawing from overlapping pages, replicated directory records, or the same platform-generated category. Even where the systems have different retrieval components, the public material available to them may contain one widely repeated mistake. Agreement can therefore be real at the answer level while remaining weak as confirmation.

The laboratory treats cross-model agreement as an observation of matching outputs, not as confirmation of independent evidence or correct entity identification.

The distinction matters because “several models said it” is often used as though it were a source check. It is not. The laboratory records the models, prompts, languages, observation dates, visible citations, and relevant conditions. It then examines which parts of the answers agree and whether the displayed material supports those parts.

Sometimes agreement is broad. Several systems identify the same business, category, and location. Sometimes it is oddly selective. They may choose the same entity while disagreeing about whether it is a branch, a separate company, or a venue inside a hotel. One may provide citations; another may offer none. A third may cite a page that identifies the business correctly but cannot carry the geographic claim attached to it.

This unevenness is useful. It prevents the phrase “the models agreed” from flattening several different claim-source relationships into a single verdict.

The public web can manufacture consensus

The composite restaurant group has a branch page whose English title omits the province. A booking platform adds a geographic label automatically. A directory repeats the same label but uses a photograph from the Bangkok location. The group’s social page refers to both branches under one English name and posts temporary opening information without consistently identifying the location.

None of these pages needs to contain a spectacular error. The problem accumulates through small joins. A title loses a branch marker. A platform supplies a category. A directory imports an address and an image from different records. By the time an AI system assembles an answer, the mistaken identity already has several public surfaces.

Several models may then repeat the same mistake because the ambiguity exists before generation begins. The answer is the final room in the house, not necessarily the place where the crooked wall was built.

This explanation remains provisional unless several parts of the observation support it. The laboratory cannot see every source consulted internally, and apparent retrieval paths are reconstructions rather than confirmed system traces. Still, the visible pages can show that a shared error is compatible with the public evidence environment. If the same wrong province appears in directories, booking records, snippets, and branch titles, matching model outputs require less mysterious explanations.

The reverse situation is revealing. Suppose the public sources distinguish the branches clearly, yet several systems still merge them. The team would look more closely at query interpretation, transliteration, category wording, and the relationship between the cited pages and the claims produced. Agreement remains observable, but the likely source of it becomes less obvious.

A consensus among models may therefore accompany a well-supported fact, a replicated public error, a shared ambiguity, or separate mistakes that happen to converge. The outputs alone cannot tell those cases apart.

One shared claim, four source relationships

The laboratory reads matching answers claim by claim using the Four Source Relationships typology. The classification prevents cross-model agreement from becoming a shortcut around source inspection.

Direct support occurs when the visible source supports the claim as stated. If several systems place the restaurant in the neighbouring province and each displays the correct branch page with that address, the location claim may be directly supported even if the original requester meant the Bangkok branch. The remaining problem would be entity identification rather than source support.

Stretched support appears when the cited material carries only a narrower fact. A group page may confirm that the restaurant operates in both locations, while the answer uses it to assign the provincial address to the group generally. The source is relevant. It simply does not support that broad formulation.

Borrowed identity becomes central when one branch or similarly named venue supplies attributes for another. A photograph, reputation claim, opening status, category, or address may cross the entity boundary. Several models can repeat the transfer when the same mismatched listing is prominent or duplicated in the visible public material.

Unsupported arrival covers a claim that none of the visible sources in the observation supports. This often appears in recommendation language. Three systems may call a venue “the leading local choice” or “particularly popular with residents” while displaying sources that provide menus, addresses, and booking details only. Repetition does not manufacture support.

These classifications are qualitative. They do not rank one model above another or convert agreement into a score. Their purpose is narrower: to describe how each shared claim relates to the material shown beside it.

The result can be untidy. One answer may identify the correct restaurant through direct support, stretch the source for its category, borrow a branch address, and add unsupported praise in the same paragraph. A second model may produce a shorter answer containing only two of those failures. Saying that both are “wrong” loses the pattern; saying that they agree loses even more.

Shared mistakes can locate the weak seam

Although agreement is not confirmation, it can still be diagnostically useful. When several systems repeatedly attach the same wrong attribute to an entity, the laboratory gains a clearer place to inspect.

Study object A provides a second composite example. The independent Bangkok wellness clinic has a Thai name with several English transliterations and resembles a larger medical facility. Its own website describes treatments. Map and directory pages apply broader categories inconsistently.

Imagine that several systems call it a hospital. One cites the clinic’s service page, another displays a directory category, and a third names no source. The shared hospital label points toward a weak seam in the entity’s public representation. It may involve category inflation, transliteration overlap, or a mistaken link to the larger facility.

The models have not proven the clinic is a hospital. They have shown that the hospital version can be reconstructed in several generated environments under the recorded conditions.

That distinction gives the business owner something more useful than a mention count. The repeated error suggests where public descriptions may fail to maintain a boundary. The clinic may need clearer English naming, a consistent category across controlled profiles, or stronger separation from the similarly named facility. Those are practical hypotheses, not automatic prescriptions. The laboratory would still inspect which sources recur and which claims they support.

Cross-model comparison can also reveal that the weakness lies outside the business’s own site. A company may describe itself precisely while directories, maps, and booking platforms preserve an older or broader identity. Generated answers can expose this disagreement by placing fragments from several sources into one compact description. The compression makes the contradiction visible, though it may also conceal where each fragment came from.

There is a rough irony here. The more fluent the shared answer sounds, the easier it is to mistake convergence for verification.

Independence is difficult to establish

The laboratory does not assume that models are independent witnesses. Their training material may overlap. Their retrieval systems may surface similar pages. Public directories can syndicate one record across many domains, creating the appearance of multiple sources where there is only one underlying entry.

Visible citations do not solve the problem completely. Two answers can show different URLs that repeat the same imported business record. Another system may display no citations while still producing an identical claim. Without access to private retrieval infrastructure and hidden intermediate steps, the laboratory cannot establish full independence.

Prompt design introduces another complication. A query that names a disputed province or category may steer several systems toward the same mistaken entity. Agreement in that case partly reflects the prompt conditions. The observation remains valid, but its interpretation changes.

Timing matters as well. Cross-model runs conducted on different observation dates may encounter changed listings, updated indexes, or altered system behaviour. The team records those dates rather than treating the outputs as though they occurred in one frozen environment.

The laboratory is also careful with absence. If one model avoids the shared mistake, that does not automatically make it the reliable one. It may have selected another wrong entity, declined to answer, or produced an uncited claim that happens to match the correct location. Accuracy still requires claim-level checking.

What the comparison can support

A well-preserved cross-model comparison can establish that several systems returned the same entity, category, location, or attribution under described conditions. It can show whether their visible sources overlap, conflict, or fail to support the shared claim. It can also identify public ambiguities compatible with the repeated result.

The method cannot prove that the systems used the same internal route. It cannot reveal hidden ranking logic or every source consulted. Nor can a small set of observations justify a general rate for all prompts, users, or future model versions.

When several explanations fit, the laboratory leaves them open. A shared mistake may be compatible with replicated public information, similar query interpretation, overlapping retrieval, common training material, or some combination. The record can narrow the plausible field without pretending to see behind the system wall.

Cross-model agreement becomes valuable once it is demoted from verdict to evidence. It shows that a mistaken version of the business is stable enough to appear in more than one generated environment. For a reader checking the claim, that is the beginning of the investigation, not the end.