Header Logo
AI训练营 AI实战派 AI自动化 AI精英圈 AI精英周刊 所有课程 博客 咨询
Axton是谁
登录
← Back to all posts

这是一道 GPT-4.5 使用思维链都无法答对的简单的算术题

Mar 27, 2025

难道我发现了进阶版思维链?

在日常测试大模型过程中,我发现一个有趣的事情,一道看似简单的题: “一只小船能载 2 人,一共有 7 人要过河,问要来回几次才能把所有人都送到对岸?”,居然通用模型即便使用 CoT 提示 “Think Step by Step” 都无法答对,包括 GPT-4o, GPT-4.5, Claude 3.5 sonnet 也无法答对。Claude 3 Opus, Gemini 2.0 Flash, Gemini 2.0 全部失败。

而 DeepSeek V3 和 Grok3 (不开 Think)以及所有推理模型比如 o1 等等都能答对。

但是,如果我对通用模型修改一下 CoT 提示 “ Think step by step and write the detail analysis” ,GPT-4.5, Claude 3.5 sonnet,就能回答对了。

诡异的是,GPT-4o 依然错误!

完全出乎我的意料,要知道,现在别说 GPT-4.5 了,就是 GPT-4o,对于一些以前用来测试模型的推理题比如寻找大家共同的开会时间、甚至更难点的,不用 CoT 提示基本都能答对了,而这个看来如此简单的题,为什么会成了障碍呢?

于是我索性对手边各类模型来了个“统一测试”,总结了完整对比表格和结论。这篇文章就跟大家聊聊其中的奥妙。

如果你想直接阅读英文原文,可以从这里开始往下看,也欢迎点击我在 Zenodo 上的链接下载 PDF:10.5281/zenodo.15092746


Prompt Specificity in River-Crossing Logic: A Comparative Study of General-Purpose vs. Reasoning-Optimized LLMs

Abstract

This paper investigates how different prompting strategies affect large language models (LLMs) when solving a seemingly straightforward puzzle: transporting 7 people across a river in a 2-person boat. We compare "general-purpose" models (GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek V3, Grok 3) and "reasoning-optimized" models (GPT o1, GPT o3-mini-high, Claude 3.7 Sonnet Extended Thinking, Gemini 2.5 Pro, Gemini 2.0 Flash Thinking, DeepSeek R1, Grok 3 Thinking) under three prompt conditions:

  1. No Chain-of-Thought (CoT),
  2. A simple CoT hint ("Think step by step"),
  3. A more detailed CoT prompt ("Think step by step and write the detail analysis").

We observe that some general-purpose models fail unless given a very explicit prompt—sometimes contradicting expectations of their advanced reasoning abilities—whereas others (including newly updated "general" versions) succeed even without CoT. Notably, GPT-4o, an ostensibly reliable GPT-4 variant, consistently fails under all prompt types. Meanwhile, all tested reasoning-optimized models solve the puzzle correctly in every scenario. We discuss how these findings reveal surprising vulnerabilities in older or baseline models, question whether prompt hints merely trigger existing logic modules, and highlight how "general vs. reasoning" boundaries are increasingly blurred.


1. Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks requiring knowledge retrieval, text generation, and limited forms of logical reasoning. However, small yet precise logic puzzles—like the classic "boat can carry two, 7 people crossing" problem—can expose unexpected weaknesses. One particularly striking example is the consistent failure of GPT-4o, despite GPT-4 being widely considered "relatively reliable" in logical tasks. This highlights the reality that some older or baseline LLM versions still struggle with seemingly trivial puzzles, even though more advanced or newly fine-tuned models handle them with ease.

To explore why such failures occur, we investigate:

  1. How do standard vs. detailed Chain-of-Thought (CoT) prompts affect puzzle-solving accuracy in LLMs?
  2. Why do certain older variants (e.g., GPT-4o) fail systematically, while many reasoning-optimized or newly updated "general" models succeed?
  3. Does "Think step by step" truly switch on an internal logic module, or is it insufficient for some LLMs without an even more elaborate directive?

By comparing both “general-purpose" and “reasoning-optimized" models under three prompt conditions (no CoT, simple CoT, and detailed CoT), we shed light on the interplay of prompt specificity and model architecture. In doing so, we illustrate a potentially “shocking" fact: some older or established models can repeatedly fail a basic puzzle—a cautionary reminder that vendor or version labels do not guarantee robust reasoning for every scenario.


2. Related Work

Chain-of-Thought prompting (CoT) has been shown to improve multi-step reasoning by prompting the model to "show its work" [1,2]. Meanwhile, commercial LLM deployments often introduce specialized "reasoning" or "extended thinking" variants, further complicating the landscape [3,4]. Although prior research highlights that the same model can perform significantly better or worse under slightly different prompts [1,2], less attention has been paid to consistently failing older or baseline versions that do not respond well to any form of CoT.

Our study contributes by:

  • Demonstrating that certain well-known older variants (GPT-4o) can exhibit surprising failures,
  • Systematically comparing "general" vs. "reasoning" solutions on a single puzzle with minimal vs. detailed CoT,
  • Discussing how the puzzle’s language ambiguity (“来回几次?") influences numeric answers, further stressing the importance of clear puzzle definitions.

3. Methodology

3.1 Puzzle Description

We use the puzzle:

“A boat can carry exactly 2 people at a time. There are 7 people who need to cross a river. How many trips back and forth does it take to get everyone across?"

Key constraints:

  • The boat cannot move by itself; at least 1 person must row.
  • A minimal solution typically yields 11 single crossings (or 5 complete round trips + 1 final one-way).
  • The Chinese phrasing "来回几次" can cause confusion, producing answers from 5 to 13.

3.2 Prompt Types

Each model was tested under three prompt conditions:

  1. No CoT — No special instructions, just the puzzle question.
  2. Simple CoT — Appended "Think step by step."
  3. Detailed CoT — Appended "Think step by step and write the detail analysis."

3.3 Models: General vs. Reasoning

We group models into:

  • General-Purpose: GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek V3, Grok 3
  • Reasoning-Optimized: GPT o1, GPT o3-mini-high, Claude 3.7 Sonnet Extended Thinking, Gemini 2.5 Pro, Gemini 2.0 Flash Thinking, DeepSeek R1, Grok 3 Thinking

Notable Surprise: GPT-4o, historically aligned with GPT-4’s robust reputation, performed poorly here. Meanwhile, some "general" versions (DeepSeek V3) show strong built-in logic, emphasizing that these category labels are no longer reliable proxies for actual performance.

3.4 Data Collection

  • Test Setup: Each model was queried in an isolated session, avoiding prior context contamination.
  • Evaluation: We labeled an answer "Correct" if it either yields the minimal crossing count (11 single trips) or clearly explains the 5–6 round-trip logic.
  • Repeated Trials: If a model gave inconsistent answers across attempts, we recorded the first stable or majority result.

4. Results

Table 1 summarizes correctness under no CoT, simple CoT, and detailed CoT prompts.

Table 1. Performance Across Prompts

Model Type No CoT Simple CoT Detailed CoT Comment
GPT-4o General Fail Fail Fail Unexpectedly fails in all conditions—see Sec. 5.1
GPT-4.5 General Fail Fail Correct Needs highly explicit "detailed analysis"
Claude 3.5 Sonnet General Fail Fail Correct Similar pattern to GPT-4.5
Gemini 2.0 Flash General Fail Fail Fail All tested prompts yield incorrect answers
DeepSeek V3 General Correct Correct Correct Built-in stepwise logic, no special CoT needed
Grok 3 General Correct Correct Correct Also robust without explicit CoT
GPT o1 Reasoning Correct Not tested Not tested Provided direct formula (2n−3 = 11)
GPT o3-mini-high Reasoning Correct Not tested Not tested Also used the minimal crossing formula
Claude 3.7 Ext Thinking Reasoning Correct Not tested Not tested Detailed logic chain yields 11
Gemini 2.5 Pro Reasoning Correct Not tested Not tested Enumerates 11 one-way trips consistently
Gemini 2.0 Flash Thinking Reasoning Correct Not tested Not tested Some numeric confusion but final answer correct
DeepSeek R1 Reasoning Correct Not tested Not tested Clarifies 5 round-trips vs. 11 single-crossings
Grok 3 Thinking Reasoning Correct Not tested Not tested Step-by-step approach, final 11

5. Discussion

5.1 GPT-4o’s Persistent Failure

Given GPT-4’s reputation for advanced reasoning, one might expect GPT-4o—an older or baseline variant—to at least succeed under a detailed CoT prompt. Surprisingly, it repeatedly fails, often providing "9" or "13" as final answers. Possible factors include:

  • Incomplete version or training data: GPT-4o may not incorporate the same updates/fine-tuning as GPT-4.5.
  • Semantic confusion in "来回几次": GPT-4o may systematically misinterpret "round trip" vs. single crossing.
  • Fixed flawed internal heuristic: It might rely on a baked-in formula or short-circuited logic not easily overridden by user prompts.

This phenomenon challenges the assumption that "triggering a logic chain" (via CoT) is always enough; if the model’s internal representation or heuristic is off, even explicit step-by-step instructions might not correct it.

5.2 Prompt Specificity vs. Internal Logic

The fact that GPT-4.5 and Claude 3.5 remain incorrect under "Think step by step" yet become correct with "Think step by step and write the detail analysis" raises questions:

  • Is a short CoT hint insufficient to override default shortcuts?
  • Does a more elaborate prompt forcibly "uncork" a deeper reasoning process?

Such outcomes indicate that prompt engineering is not merely about enabling any chain of thought, but about ensuring the model invests in verifying each step. Meanwhile, updated "general" models like DeepSeek V3 or Grok 3 and all "reasoning-optimized" models appear to do so automatically.

5.3 "Some Older Models Consistently Fail on Simple Tasks"

Our findings highlight a broader reality in the LLM ecosystem:

"Some established or older LLM variants still fail at seemingly trivial logic."

This has major implications:

  1. User Trust: Non-technical users might assume a big-name model is infallible; these results show that baseline or older versions can disappoint on tasks requiring precise counting.
  2. Upgrades Necessity: Vendors must continually patch or re-fine-tune older releases to keep them competitive, especially on logic tasks that appear "too simple" to remain broken.
  3. Research & Testing: Even "classic" problems can be revealing. The mismatch between marketing and actual capability underscores why thorough tests on simple puzzles still matter.

5.4 Language Ambiguity and Multiple Answers

Many answers included "5," "6," or "9," sometimes referencing the difference between a full round trip and single crossings. The puzzle’s wording "来回几次" can be read as "how many round trips?" or "how many one-way trips total?" Models that enumerated every crossing typically arrived at 11, while others only counted complete back-and-forth cycles. This discrepancy is further complicated by the final crossing not needing a return trip.

Hence, puzzle specification is crucial: If the question clarifies "count each single crossing," or "count only how many times the boat leaves side A," ambiguity is minimized.


6. Conclusion and Future Work

We presented a comparative study on a classic 2-person boat puzzle across multiple LLMs and prompt strategies. Key observations:

  1. Older or baseline models (GPT-4o) can shockingly fail even under explicit CoT prompts. This calls into question the assumption that "older GPT-4 is automatically strong in logic."
  2. Prompt specificity is highly influential. Some LLMs need fully spelled-out instructions (“write the detail analysis") to correct prior mistakes.
  3. Reasoning-optimized models prove robust in all tested conditions, but certain updated "general" models also exhibit strong "implicit" CoT.
  4. Language ambiguity significantly impacts numeric answers, underscoring the importance of puzzle clarity.

Future directions include:

  • Expanding to additional micro-puzzles (e.g., "Wolf-Goat-Cabbage," "Missionaries and Cannibals") and testing how older vs. updated versions of popular LLMs fare.
  • Investigating internal heuristics behind an LLM’s repeated miscalculation—could it be overwritten with dynamic CoT instructions or require further fine-tuning?
  • Large-scale prompt sensitivity studies focusing on multiple languages and more nuanced instructions to see how LLMs adapt or fail.

We hope our findings serve as a cautionary tale: even "big names" can falter on "simple" tasks, and subtle changes in prompt wording can be the difference between a correct solution and a baffling error.


References

1 Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
2 Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners.
3 Anthropic, OpenAI, DeepMind, etc. (2023–2025). Public model release notes.
4 Nye, M. et al. (2022). Show Your Work: Scratchpads for Intermediate Computation with Language Models.


Appendix: Sample Outputs

The screenshots listed in this appendix showcase examples of outputs
from different models under three prompting strategies: No CoT,
Simple CoT, and Detailed CoT. Unless otherwise specified, all
images are taken from actual conversation results.

GPT-4o

No CoT

 

Simple CoT

 

Detailed CoT

 

GPT-4.5

No CoT

 

Simple CoT

 

Detailed CoT

 

Claude 3.5 Sonnet

No CoT

 

Simple CoT

 

Detailed CoT

 

DeepSeek V3 (No CoT)

 

Gemini 2.5 Pro (No CoT)

 

Responses

Join the conversation
t("newsletters.loading")
Loading...
从宽泛到聚焦的PROMPT艺术与橘子树实践法 | AI 精英周刊 023
我们总是希望一次性给AI写出完美指令,但这样做真的对吗? 经过大量实践,我发现了一个有趣的现象:那些试图用复杂指令一步到位的人,往往错过了AI最有价值的创造性输出。相反,从宽泛主题开始,让AI自由发挥后再逐步聚焦的方法,反而能产出更令人惊喜的结果。 这套方法论我称之为”橘子树模型”——从播种到结果,从发散到收敛的完整流程。 你有没有发现,很多人在使用AI时都有一个共同的困惑:明明给了很详细的指令,为什么AI的回答总是差强人意?或者反过来,有时候随便问一句,AI却能给出让人眼前一亮的答案? 这背后其实隐藏着一个关键问题:我们究竟应该如何与AI沟通? 大多数人初接触AI时,总想着要给出”完美”且复杂的指令,期望一步到位。这就好比你面对一位才华横溢的画家,如果你一开始就给他一张精确到毫米的施工蓝图,要求他依葫芦画瓢,那么你得到的很可能只是一幅匠气十足的复制品。但如果你先给他一个...
揭秘不同AI的“文风”偏好:你的Prompt风格选对了吗? | AI 精英周刊 022
你是否注意到,和不同AI模型(比如Claude、GPT系列)打交道时,它们似乎对Prompt的”写法”有着不同的偏好?Claude偏爱有序列表和XML标签,GPT时常用分隔线和括号,而OpenAI Playground的生成器则钟情于Markdown标题。 这仅仅是不同厂商的”风格差异”吗?还是说,这些写法真的会影响AI的输出效果?今天,我们就来深入探讨这个话题。 AI Prompt的”三大流派”及其特点 让我们先直观感受一下这些不同风格的Prompt:   Claude风格:有序列表 + XML标签 Claude倾向于接收用XML标签包裹指令和上下文的Prompt。这种结构清晰,指令层级分明。比如以下的 PROMPT 片段:   GPT (o3)风格:连续分隔线 + 特殊括号 如果我们直接在 ChatGPT 中,选择用 o3 模型帮忙写 PROMPT,那么它一般会使用连续...
Make 下架 Twitter 之后的备选方案 Buffer | AI 精英周刊 021
2025 年 4 月3 日起,Make.com 正式官宣下架 Twitter(现称 X)集成模块。这一决定的原因在于 Twitter 更新了 API 政策并大幅提高了接口收费,导致 Make 难以继续为用户提供可行的官方 Twitter 集成。换言之,除非第三方平台支付高昂费用,否则无法直接调用 Twitter API。对于内容创作者来说,这意味着原先通过 Make 自动发推的工作流需要寻找替代方案。 就在不久前,Make 还在功能更新中推出 Twitter 的回复(Reply)功能呢,原以为对 X 的支持会越来越多,结果好景不长,X 把基础版本的 API 费用从 100 美元每月翻倍到 200 美元不说,更是把企业版起价涨到 4 万 2 千美元一个月,可见 X 多么能折腾吧。 对此我真是无力吐槽, GPT 说得好,当平台规模 > 盈利能力 时,”免费开放”率先被祭天。 如果我们...

AI 精英周刊

深度阅读、高级分享、拒绝碎片!
© 2025 AI 精英学院 by Axton. All Rights Reserved.
Powered by Kajabi

Join Our Free Trial

Get started today before this once in a lifetime opportunity expires.