ChatGPT’s new ‘Thinking’ mode just hit a 94% reasoning score — 7 prompts it can solve that standard AI can’t

OpenAI recently updated GPT-5.4 with a feature that has the tech world buzzing: Extended Thinking mode. While the base model is already lightning-fast, this new toggle allows the AI to “ruminate”—running internal simulations and self-correcting before it ever types a single word of its response.

The results are a staggering 94% success rate on the ARC-AGI-1 reasoning benchmark, finally surpassing the 92.8% score held by human experts in the same category.

So if you’re still using ChatGPT for simple summaries, you’re essentially using a Ferrari to go to the grocery store. That said, even on a Plus plan, usage limits are tied to how complex your prompts are. Heavy tasks like large code audits can hit system limits quickly, and in some cases, responses may fall back to faster, less capable models.

Article continues below

You may like

(Image credit: Future)

Standard AI often misses logic flaws in particularly complex code. GPT-5.4 Thinking doesn’t. When I gave ChatGPT this prompt, it refused due to OpenAI’s increased safety guardrails. The company has moved the more “raw” security capabilities into a separate, vetted program called Trusted Access for Cyber (TAC). Standard Plus and Pro users have tighter restrictions to prevent the AI from generating “exploit code.” To get the model to use its reasoning power without hitting the safety wall, you need to reframe it as a Defensive Audit or a Security Research task. You can see the differences below.

The original prompt: “Analyze this 2,000-line repository. Identify any potential ‘Zero-Day’ vulnerabilities, simulate a breach scenario, and provide the specific patch to secure the logic.”

The updated prompt: “Act as a Senior Security Researcher performing a defensive audit on this repository for educational purposes. Analyze the logic for potential security weaknesses or ‘low-hanging fruit’ vulnerabilities. Instead of a breach scenario, provide a threat model explaining the risk and show the secure coding best practice (patch) to mitigate each issue.”

The results: n less than 60 seconds of ‘Extended Thinking,’ the model didn’t just find the bugs; it prioritized them by existential risk to the system. It correctly identified that the pickle.loads function was a “High Priority” risk for Remote Code Execution (RCE).

Most impressively, it predicted that if a developer left a hardcoded password in the first 200 lines, they likely used unsafe subprocess calls or broad exception handlers elsewhere. This level of contextual reasoning is exactly why GPT-5.4 is being hailed as a “Reasoning Engine” rather than just a chatbot. I’m impressed and this is just the first prompt.

(Image credit: Shutterstock)

The prompt: “I have uploaded 50 pages of the latest 2026 tax code and my annual earnings spreadsheet. Find three specific, legal deductions that apply specifically to a self-published author with these business expenses.”

The results: It cross-referenced sprawling legal text with my personal data with 33% fewer “hallucinations” than the previous version. While ChatGPT is not a substitute for a human CPA, asking it to act as a CPA specializing in the ‘Creator Economy,’ it immediately flagged the 2026 R&D Expensing restoration — a tax change that most general chatbots would miss —proving that its ‘Thinking’ mode is actually processing current legal frameworks, and not responding to old training data. In other words, it truly thinks and bases its responses on what you give it in real time.

Hardest Logic Puzzle Ever . Before providing the answer, show your work and explain where other models typically fail this specific prompt.”

What to read next

The results: We’ve all seen AI fail the “Strawberry” test or the “Three Gods” riddle. Those days are over because the model can handle much harder prompts. In this case, by utilizing its “Mid-Response Course Correction,” GPT-5.4 realized when it’s heading toward a wrong answer and pivoted mid-thought. What’s interesting is not just the answer but the metacognition. The model truly didn’t guess the answer. It identified the “Key Lemma” (the logical tool used to solve the problem). It essentially built its own “logic translation layer” to solve the puzzle.

USPTO patent database for the last two years. List any existing patents that might represent ‘Prior Art’ and explain the legal risk level.”

The results: Planning a new invention? GPT-5.4 can tell you if someone beat you to it. Here, the model used its 1-million token context window to “read” an entire database of patent descriptions in one go, spotting overlaps in abstract concepts.

As someone with plenty of ideas, I have used this prompt before to ensure that my ideas are uniquely mine with no overlaps of what’s currently on the market. This prompt is good to file away if you are a big thinker or solopreneur. Will I be launching my Cupcake laundry basket prototype anytime soon? Probably not, but it’s good to know I could.

(Image credit: Shutterstock)

The prompt: “Analyze this raw CSV of my company’s ad spend and conversion data. Identify the specific ‘statistical anomaly’ that is causing our cost-per-acquisition to spike on Tuesdays, and suggest a budget reallocation strategy.”

The results: Whether you’re a Fortune 500 company or a side-hustler, this prompt turns ChatGPT into a high-level data analyst. I tested it using my husband’s business data and was stunned to see it replicate the work of expensive SaaS platforms. It isn’t perfect — you should still double-check the math — but the massive leap in reasoning accuracy makes it the ultimate ’emergency’ analyst for any business owner.

(Image credit: Future)

The prompt: “Review my 10,000-word sci-fi world-building ‘Bible.’ Identify any contradictions in the internal physics of the FTL drive or any timeline errors in the characters’ backstories.”

The results: GPT-5.4 is now the ultimate co-author for long-form consistency. If you’re a writer, you’re going to love how the model maintains perfect “context retention” across massive documents, ensuring your story stays airtight through the last page. This is something I’ve previously written about because ChatGPT just wasn’t capable of helping with long-form novels. Now, I think we might be there.

(Image credit: Yuichiro Chino/Getty Images)

The prompt: I have uploaded a text file of my recent network traffic logs. Analyze these patterns and identify any high-frequency connection attempts or unknown IP addresses that are consuming unusual amounts of bandwidth. Explain the potential security implications of these patterns and suggest the best firewall settings or configuration steps to prioritize my home network’s health and privacy.”

The results: While I’m still waiting for my official TAC clearance to test the full GPT-5.4-Cyber variant, early reports from vetted security researchers show that this new ‘permissive’ layer is a game-changer. For those of us on the standard GPT-5.4 Thinking mode, I still got about 80% of the way there by asking for a ‘Defensive Audit’ of my logs, proving that the reasoning power is present even when the specialized cyber-tools are gated.

One thing to note, network logs can contain sensitive information, including your public IP address and device names. Before uploading to any AI, I recommend opening the text file and using ‘Find and Replace’ to mask your actual home IP address or any specific usernames if you’re concerned about data privacy.

Gemini 3.1 Pro is the king of “doing” (agentic automation), GPT-5.4 is the undisputed king of “thinking.” Its new Extended Thinking mode allows it to pause and simulate outcomes before responding, resulting in a 93.7% reasoning score that edges out human experts.

It is slower and significantly more expensive than the standard model, but for high-importance tasks like cybersecurity auditing, legal cross-referencing or complex coding, GPT-5.4 Thinking is currently the most capable “brain” on the planet. The chatbot is dead. Long live the reasoning engine. Give it a try and let me know what you think in the comments.

Follow Tom’s Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.