Pierson, B. Lawsuit claims UnitedHealth AI wrongfully denies elderly extended care. Reuters https://www.reuters.com/legal/lawsuit-claims-unitedhealth-ai-wrongfully-denies-elderly-extended-care-2023-11-14 (2023).

Schreiber, M. New AI tool counters health insurance denials decided by automated algorithms. The Guardian https://www.theguardian.com/us-news/2025/jan/25/health-insurers-ai (2025).

Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. Health 5, 1161098 (2023).

Article 

Google Scholar
 

Azaria, A., Azoulay, R. & Reches, S. ChatGPT is a remarkable tool — for experts. Data Intell. 6, 240–296 (2024).

Article 

Google Scholar
 

Meyrowitsch, D. W., Jensen, A. K., Sørensen, J. B. & Varga, T. V. AI chatbots and (mis)information in public health: impact on vulnerable communities. Front. Public Health 11, 1226776 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Kim, J. H. et al. When ChatGPT gives incorrect answers: the impact of inaccurate information by generative AI on tourism decision-making. J. Travel Res. 64, 51–73 (2023).

Article 

Google Scholar
 

Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).

Article 
PubMed 

Google Scholar
 

Kaddour, J. et al. Challenges and applications of large language models. Preprint at arXiv http://arxiv.org/abs/2307.10169 (2023).

Griewing, S. et al. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch. Gynecol. Obstet. 310, 537–550 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Jimenez, C. E. et al. SWE-bench: can language models resolve real-world Github issues? Preprint at arXiv https://doi.org/10.48550/arXiv.2310.06770 (2024).

Jusman, I. A., Ausat, A. M. A. & Sumarna, A. Application of ChatGPT in business management and strategic decision making. J. Minfo Polgan 12, 1688–1697 (2023).

Article 

Google Scholar
 

Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).

Article 
PubMed 

Google Scholar
 

Basir, A., Puspitasari, E. D., Aristarini, C. C., Sulastri, P. D. & Ausat, A. M. A. Ethical use of ChatGPT in the context of leadership and strategic decisions. J. Minfo Polgan 12, 1239–1246 (2023).

Article 

Google Scholar
 

Gloria, B., Melsbach, J., Bienert, S. & Schoder, D. Real-GPT: efficiently tailoring LLMs for informed decision-making in the real estate industry. J. Real Estate Portf. Manag. 31, 56–72 (2024).

Article 

Google Scholar
 

Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120, e2218523120 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).


Google Scholar
 

Hagendorff, T. Machine psychology: investigating emergent capabilities and behavior in large language models using psychological methods. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13988 (2023).

White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.11382 (2023).

Abbate, F. Natural and artificial intelligence: a comparative analysis of cognitive aspects. Minds Mach. 33, 791–815 (2023).

Article 

Google Scholar
 

Rich, A. S. & Gureckis, T. M. Lessons for artificial intelligence from the study of natural stupidity. Nat. Mach. Intell. 1, 174–180 (2019).

Article 

Google Scholar
 

Shiffrin, R. & Mitchell, M. Probing the psychology of AI models. Proc. Natl Acad. Sci. USA 120, e2300963120 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Suri, G., Slater, L. R., Ziaee, A. & Nguyen, M. Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. J. Exp. Psychol. Gen. 153, 1066–1075 (2024).

Article 
PubMed 

Google Scholar
 

Qu, Y. et al. Promoting interactions between cognitive science and large language models. Innovation 5, 100579 (2024).

PubMed 
PubMed Central 

Google Scholar
 

Thaler, R. H. Behavioral economics: past, present, and future. Am. Econ. Rev. 106, 1577–1600 (2016).

Article 

Google Scholar
 

Kahneman, D. Maps of bounded rationality: psychology for behavioral economics. Am. Econ. Rev. 93, 1449–1475 (2003).

Article 

Google Scholar
 

Evans, J. S. & Stanovich, K. E. Dual-process theories of higher cognition: advancing the debate. Persp. Psychol. Sci. 8, 223–241 (2013).

Article 

Google Scholar
 

Evans, J. S. Dual-processing accounts of reasoning, judgment, and social cognition. Annu. Rev. Psychol. 59, 255–278 (2008).

Article 
PubMed 

Google Scholar
 

Kahneman, D. & Frederick, S. in Heuristics and Biases: The Psychology of Intuitive Judgment (eds Gilovich, T. et al.) 49–81 (Cambridge Univ. Press, 2002).

Gigerenzer, G. & Brighton, H. Homo heuristicus: why biased minds make better inferences. Top. Cogn. Sci. 1, 107–143 (2009).

Article 
PubMed 

Google Scholar
 

Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).

Article 
PubMed 

Google Scholar
 

Smolensky, P. Connectionist AI, symbolic AI, and the brain. Artif. Intell. Rev. 1, 95–109 (1987).

Article 

Google Scholar
 

Goel, A. Looking back, looking ahead: symbolic versus connectionist AI. AI Mag. 42, 83–85 (2022).


Google Scholar
 

Bellini-Leite, S. C. Dual process theory: embodied and predictive; symbolic and classical. Front. Psychol. 13, 805386 (2022).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Bellini-Leite, S. C. Dual process theory for large language models: an overview of using psychology to address hallucination and reliability issues. Adapt. Behav. 32, 329–343 (2023).

Article 

Google Scholar
 

Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36, 181–204 (2013).

Article 
PubMed 

Google Scholar
 

Newell, A. Physical symbol systems. Cogn. Sci. 4, 135–183 (1980).


Google Scholar
 

Marcus, G. The next decade in AI: four steps towards robust artificial intelligence. Preprint at arXiv https://doi.org/10.48550/arXiv.2002.06177 (2020).

Hagendorff, T., Fabi, S. & Kosinski, M. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat. Comput. Sci. 3, 833–838 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Ma, D., Zhang, T. & Saunders, M. Is ChatGPT humanly irrational? Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-3220513/v1 (2023).

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).


Google Scholar
 

Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).

Article 
PubMed 

Google Scholar
 

Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).

Article 
PubMed 

Google Scholar
 

Frederick, S. Cognitive reflection and decision making. J. Econ. Perspect. 19, 25–42 (2005).

Article 

Google Scholar
 

Erickson, T. D. & Mattson, M. E. From words to meaning: a semantic illusion. J. Verbal Learn. Verbal Behav. 20, 540–551 (1981).

Article 

Google Scholar
 

Toplak, M. E., West, R. F. & Stanovich, K. E. The cognitive reflection test as a predictor of performance on heuristics-and-biases tasks. Mem. Cogn. 39, 1275–1289 (2011).

Article 

Google Scholar
 

Pennycook, G., Cheyne, J. A., Koehler, D. J. & Fugelsang, J. A. Is the cognitive reflection test a measure of both reflection and intuition? Behav. Res. Meth. 48, 341–348 (2016).

Article 

Google Scholar
 

Chen, Y., Kirshner, S. N., Ovchinnikov, A., Andiappan, M. & Jenkin, T. A manager and an AI walk into a bar: does ChatGPT make biased decisions like we do? Manuf. Serv. Oper. Manag. 27, 339–678 (2025).

Article 

Google Scholar
 

Jones, E. & Steinhardt, J. Capturing failures of large language models via human cognitive biases. Adv. Neural Inf. Process. Syst. 35, 11785–11799 (2022).


Google Scholar
 

Pantana, G., Castello, M. & Torre, I. Examining cognitive biases in ChatGPT 3.5 and ChatGPT 4 through human evaluation and linguistic comparison. In Proc. 16th Conf. Assoc. Mach. Transl. Am. (eds Knowles, R., Eriguchi, A. & Goel, S.) 250–260 (AMTA, 2024).

Ryu, J., Kim, J. & Kim, J. A study on the representativeness heuristics problem in large language models. IEEE Access. 12, 147958–147966 (2024).

Article 

Google Scholar
 

Tversky, A. & Kahneman, D. The framing of decisions and the psychology of choice. Science 211, 453–458 (1981).

Article 
PubMed 

Google Scholar
 

Nickerson, R. S. Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2, 175–220 (1998).

Article 

Google Scholar
 

Lou, J. & Sun, Y. Anchoring bias in large language models: an experimental study. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.06593 (2024).

Talboy, A. N. & Fuller, E. Challenging the appearance of machine intelligence: cognitive bias in LLMs and best practices for adoption. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.01358 (2023).

Azaria, A. ChatGPT: more human-like than computer-like, but not necessarily in a good way. In 2023 IEEE 35th Int. Conf. Tools Artif. Intell. (eds Esposito, A., Yang, M. & Cordasco, G.) 468–473 (IEEE, 2023).

Acerbi, A. & Stubbersfield, J. M. Large language models show human-like content biases in transmission chain experiments. Proc. Natl Acad. Sci. USA 120, e2313790120 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Schramowski, P., Turan, C., Andersen, N., Rothkopf, C. A. & Kersting, K. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat. Mach. Intell. 4, 258–268 (2022).

Article 

Google Scholar
 

Gallegos, I. O. et al. Bias and fairness in large language models: a survey. Comput. Linguist. 50, 1097–1179 (2024).

Article 

Google Scholar
 

Wang, P., Xiao, Z., Chen, H. & Oswald, F. L. Will the real Linda please stand up…to large language models? Examining the representativeness heuristic in LLMs. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.01461 (2024).

Nguyen, J. K. Human bias in AI models? Anchoring effects and mitigation strategies in large language models. J. Behav. Exp. Finance 43, 100971 (2024).

Article 

Google Scholar
 

Tversky, A. & Kahneman, D. Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment. Psychol. Rev. 90, 293–315 (1983).

Article 

Google Scholar
 

Tversky, A. & Kahneman, D. Availability: a heuristic for judging frequency and probability. Cogn. Psychol. 5, 207–232 (1973).

Article 

Google Scholar
 

Ariely, D., Loewenstein, G. & Prelec, D. “Coherent arbitrariness”: stable demand curves without stable preferences*. Q. J. Econ. 118, 73–106 (2003).

Article 

Google Scholar
 

Newell, A. & Simon, H. A. Computer science as empirical inquiry: symbols and search. Commun. ACM 19, 113–126 (2007).

Article 

Google Scholar
 

Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 36, 11809–11822 (2023).


Google Scholar
 

Du, M. Machine vs. human, who makes a better judgment on innovation? Take GPT-4 for example. Front. Artif. Intell. 6, 1206516 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Ziegler, D. M. et al. Fine-tuning language models from human preferences. Preprint at arXiv https://doi.org/10.48550/arXiv.1909.08593 (2019).

Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).


Google Scholar
 

Su, D., Sukhbaatar, S., Rabbat, M., Tian, Y. & Zheng, Q. Dualformer: controllable fast and slow thinking by learning with randomized reasoning traces. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.09918 (2025).

Shang, Y., Li, Y., Xu, F. & Li, Y. Defint: a default-interventionist framework for efficient reasoning with hybrid large language models. Preprint at arXiv 2402.02563 (2024).

Kojima, T., Gu, S. (Shane)., Reid. M., Matsuo. Y, & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 35, 22199–22213 (2022).


Google Scholar
 

Weston, J. & Sukhbaatar, S. System 2 Attention (is something you might need too). Preprint at arXiv https://doi.org/10.48550/arXiv.2311.11829 (2023).

Zhang, H., Huang, J., Li, Z., Naik, M. & Xing, E. Improved logical reasoning of language models via differentiable symbolic programming. In Find. Assoc. Comput. Linguist. (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 3062–3077 (ACL, 2023).

Zhu, X. et al. Solving math word problems via cooperative reasoning induced language models. In Proc. 61st Ann. Meet. Assoc. Comput. Linguist. (Rogers, A., Boyd-Graber, J. & Okazaki, N.) 4471–4485 (ACM, 2023).

Raoelison, M., Thompson, V. A. & De Neys, W. The smart intuitor: cognitive capacity predicts intuitive rather than deliberate thinking. Cognition 204, 104381 (2020).

Article 
PubMed 

Google Scholar
 

De Neys, W. & Pennycook, G. Logic, fast and slow: advances in dual-process theorizing. Curr. Dir. Psychol. Sci. 28, 503–509 (2019).

Article 

Google Scholar
 

Reyna, V. F. & Brainerd, C. J. Numeracy, gist, literal thinking and the value of nothing in decision making. Nat. Rev. Psychol. 2, 421–439 (2023).

Article 

Google Scholar
 

Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.12712 (2023).

Sclar, M., Choi, Y., Tsvetkov, Y. & Suhr, A. Quantifying language models’ sensitivity to spurious features in prompt design or: how I learned to start worrying about prompt formatting. Prepint at arXiv https://doi.org/10.48550/arXiv.2310.11324 (2023).

Pezeshkpour, P. & Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Find. Assoc. Comput. Linguist. (eds Duh, K., Gomez, H. & Bethard, S.) 2006–2017 (ACL, 2024).

Loya, M., Sinha, D. A. & Futrell, R. Exploring the sensitivity of LLMs’ decision-making capabilities: insights from prompt variation and hyperparameters. In Find. Assoc. Comput. Linguist. (eds Bouamor, H., Pino, J. & Bali, K.) 3711–3716 (ACL, 2023).

Barez, F. et al. Chain-of-thought is not explainability. Preprint at arXiv https://www.alphaxiv.org/overview/2025.02v3 (2025).

Zhang, Z. et al. Multimodal chain-of-thought reasoning in language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=y1pPWFVfvR (2024).

Chakraborty, N., Ornik, M. & Driggs-Campbell, K. Hallucination detection in foundation models for decision-making: a flexible definition and review of the state of the art. ACM Comput. Surv. 57, 188:1–188:35 (2025).

Article 

Google Scholar
 

Stella, M., Hills, T. T. & Kenett, Y. N. Using cognitive psychology to understand GPT-like models needs to extend beyond human biases. Proc. Natl Acad. Sci. USA 120, e2312911120 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43, 42 (2025).

Article 

Google Scholar
 

Smith, A. L., Greaves, F. & Panch, T. Hallucination or confabulation? Neuroanatomy as metaphor in large language models. PLoS Digital Health 2, e0000388 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Singh, A. K., Lamichhane, B., Devkota, S., Dhakal, U. & Dhakal, C. Do large language models show human-like biases? Exploring confidence–competence gap in AI. Information 15, 92 (2024).

Article 

Google Scholar
 

Borji, A. A Categorical archive of ChatGPT failures. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.03494 (2023).

Kopelman, M. D. Varieties of confabulation and delusion. Cogn. Neuropsychiatry 15, 14–37 (2010).

Article 
PubMed 

Google Scholar
 

Johnson, D. D. P. & Fowler, J. H. The evolution of overconfidence. Nature 477, 317–320 (2011).

Article 
PubMed 

Google Scholar
 

McCoy, R. T., Yao, S., Friedman, D., Hardy, M. & Griffiths, T. L. Embers of autoregression: understanding large language models through the problem they are trained to solve. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.13638 (2023).

Perković, G., Drobnjak, A. & Botički, I. Hallucinations in LLMs: understanding and addressing challenges. In 47th MIPRO ICT Electron. Conv. (ed. Babic, S.) 2084–2088 (IEEE, 2024).

Sun, F., Li, N., Wang, K. & Goette, L. Large language models are overconfident and amplify human bias. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.02151 (2025).

Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Trans. Mach. Learn. Res. https://doi.org/10.3929/ethz-b-000651806 (2023).

Article 

Google Scholar
 

Zhou, K., Hwang, J. D., Ren, X. & Sap, M. Relying on the unreliable: the impact of language models’ reluctance to express uncertainty. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.06730 (2024).

Zhou, K., Jurafsky, D. & Hashimoto, T. Navigating the grey area: how expressions of uncertainty and overconfidence affect language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.13439 (2023).

Dunning, D. in Advances in Experimental Social Psychology Vol. 44 (eds Olson, J. M. & Zanna, M. P.) 247–296 (Academic, 2011).

Ji, Z. et al. Towards mitigating LLM hallucination via self reflection. In Find. Assoc. Comput. Linguist. (eds Bouamor, H., Pino, J. & Bali, K.) 1827–1843 (ACL, 2023).

O’Leary, D. E. Confirmation and specificity biases in large language models: an explorative study. IEEE Intell. Syst. 40, 63–68 (2025).

Article 

Google Scholar
 

Wei, J. et al. Emergent abilities of large language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=yzkSU5zdwD (2022).

McCoy, R. T., Pavlick, E. & Linzen, T. Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proc. Conf. 57th Ann. Meet. Assoc. Comput. Linguist. (eds Korhonen, A., Traum, D. & Màrquez, L.) 3428–3448 (ACL, 2020).

Webson, A. & Pavlick, E. Do prompt-based models really understand the meaning of their prompts? In Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. (eds Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I. V.) 2300–2344 (ACL, 2022).

Lampinen, A. K. et al. Language models, like humans, show content effects on reasoning tasks. PNAS Nexus 3, 233 (2024).

Article 

Google Scholar
 

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).


Google Scholar
 

Zheng, H. & Zhan, H. ChatGPT in scientific writing: a cautionary tale. Am. J. Med. 136, 725–726.e6 (2023).

Article 
PubMed 

Google Scholar
 

Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl Acad. Sci. USA 120, e2215907120 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Nelson, A. B. & Shiffrin, R. M. The co-evolution of knowledge and event memory. Psychol. Rev. 120, 356–394 (2013).

Article 
PubMed 

Google Scholar
 

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conf. Fairness Account. Transpar. (eds Elish, M. C., Isaac, W. & Zemel, R.) 610–623 (ACM, 2021).

Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. 7, 1526–1541 (2023).

Article 
PubMed 

Google Scholar
 

Heinlein, R. A. Stranger in a Strange Land (Putnam, 1969).

Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: generalization beyond overfitting on small algorithmic datasets. Preprint at arXiv https://doi.org/10.48550/arXiv.2201.02177 (2022).

Seligman, M. E. P., Railton, P., Baumeister, R. F. & Sripada, C. Navigating into the future or driven by the past. Persp. Psychol. Sci. 8, 119–141 (2013).

Article 

Google Scholar
 

Liu, S. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. 30, 1237–1245 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 2 (2021).


Google Scholar
 

Silva, G. A. Can AI understand what it’s telling you? Forbes https://www.forbes.com/sites/gabrielasilva/2025/04/23/can-ai-understand-the-chinese-room-argument-says-no-but-is-it-right (2025).

Ananthaswamy, A. New theory suggests chatbots can understand text. Quanta Magazine https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/ (2024).

Moskvichev, A. K., Odouard, V. V. & Mitchell, M. The ConceptARC benchmark: evaluating understanding and generalization in the ARC domain. Trans. Mach. Learn. Res. https://openreview.net/forum?id=8ykyGbtt2q (2023).

Dijkstra, E. W. On IPW’s. Univ. Texas Austin https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867.html (1983).

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

Article 
PubMed 

Google Scholar
 

Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13375 (2023).

Nori, H. et al. Sequential diagnosis with language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.22405 (2025).

Nie, Y. et al. A survey of large language models for financial applications: progress, prospects and challenges. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.11903 (2024).

Aydın, Ö. & Karaarslan, E. OpenAI ChatGPT generated literature review: digital twin in healthcare. Emerg. Comput. Technol. 2, 22–31 (2022).


Google Scholar
 

Ke, L., Tong, S., Cheng, P. & Peng, K. Exploring the frontiers of LLMs in psychological applications: a comprehensive review. Artif. Intell. Rev. 58, 305 (2025).

Article 

Google Scholar
 

Hua, S., Jin, S. & Jiang, S. The limitations and ethical considerations of ChatGPT. Data Intell. 6, 201–239 (2024).

Article 

Google Scholar
 

Chuma, E. L. & de Oliveira, G. G. Generative AI for business decision-making: a case of ChatGPT. Manag. Sci. Bus. Decis. 3, 5–11 (2023).


Google Scholar
 

Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPTs are GPTs: labor market impact potential of LLMs. Science 384, 1306–1308 (2024).

Article 
PubMed 

Google Scholar
 

Weidinger, L. et al. Ethical and social risks of harm from language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2112.04359 (2021).

Chen, J., Liu, L., Ruan, S., Li, M. & Yin, C. Are different versions of ChatGPT’s ability comparable to the clinical diagnosis presented in case reports? A descriptive study. J. Multidisc. Healthc. 16, 3825–3831 (2023).

Article 

Google Scholar
 

OpenAI. Introducing ChatGPT. OpenAI https://openai.com/blog/chatgpt (2022).

An, J., Huang, D., Lin, C. & Tai, M. Measuring gender and racial biases in large language models. Preprint at arXiv http://arxiv.org/abs/2403.15281 (2024).

Liang, P. P., Wu, C., Morency, L.-P. & Salakhutdinov, R. Towards understanding and mitigating social biases in language models. In Proc. 38th Int. Conf. Mach. Learn. (eds Meila, M. & Zhang, T.) 6565–6576 (PMLR, 2021).

Peters, D., Vold, K., Robinson, D. & Calvo, R. A. Responsible AI — two frameworks for ethical design practice. IEEE Trans. Technol. Soc. 1, 34–47 (2020).

Article 

Google Scholar
 

Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017).

Article 
PubMed 

Google Scholar
 

Liang, J. T., Lin, M., Rao, N. & Myers, B. A. Prompts are programs too! Understanding how developers build software containing prompts. Proc. ACM Softw. Eng. 2, 1591–1614 (2025).

Article 

Google Scholar
 

De Neys, W. Bias and conflict: a case for logical intuitions. Persp. Psychol. Sci. 7, 28–38 (2012).

Article 

Google Scholar
 

Neys, W. D. Advancing theorizing about fast-and-slow thinking. Behav. Brain Sci. 46, e111 (2023).

Article 

Google Scholar
 

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2210.03629 (2023).

Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=ehfRiF0R3a (2023).

Nathani, D. et al. MLGym: a new framework and benchmark for advancing AI research agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.14499 (2025).

Paglieri, D. et al. BALROG: benchmarking agentic LLM and VLM reasoning on games. In Proc. Int. Conf. Learn. Represent. (ICLR, 2025).

Liu, X. et al. Agentbench: evaluating LLMs as agents. In Proc. Int. Conf. Learning Represent. (ICLR, 2024).

Huang, J. et al. How far are we on the decision-making of LLMs? Evaluating LLMs’ gaming ability in multi-agent environments. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.11807 (2025).

Leng, Y. & Yuan, Y. Do LLM agents exhibit social behavior? Preprint at arXiv https://doi.org/10.48550/arXiv.2312.15198 (2024).

Costarelli, A. et al. GameBench: evaluating strategic reasoning abilities of LLM agents. In Lang. Gamific. NeurIPS 2024 Workshop (NeurIPS, 2024).

Trencsenyi, V., Mensfelt, A. & Stathis, K. Approximating human strategic reasoning with LLM-enhanced recursive reasoners leveraging multi-agent hypergames. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.07443 (2025).

Dwivedi, Y. K. et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inform. Manag. 71, 102642 (2023).


Google Scholar
 

Vaugrante, L., Niepert, M. & Hagendorff, T. A looming replication crisis in evaluating behavior in language models? Evidence and solutions. Preprint at arXiv https://doi.org/10.48550/arXiv.2409.20303 (2024).

Koo, R. et al. Benchmarking cognitive biases in large language models as evaluators. In Find. Assoc. Comput. Linguist. (eds Ku, L.-W., Martins, A. & Srikumar, V.) 517–545 (ACL, 2024).

Wang, Y., Cai, Y., Chen, M., Liang, Y. & Hooi, B. Primacy effect of ChatGPT. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H., Pino, J. & Bali, K.) 108–115 (ACL, 2023).

Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L. & Gonzalez-Agirre, A. Cognitive biases, task complexity, and result intepretability in large language models. In Proc. 31st Int. Conf. Comput. Linguist. (eds Rambow, O. et al.) 1767–1784 (ACL, 2025).

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

Article 
PubMed 

Google Scholar
 

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).


Google Scholar
 

OpenAI. Learning to reason with LLMs. OpenAI https://openai.com/index/learning-to-reason-with-llms (2024).

Pfau, J., Merrill, W. & Bowman, S. R. Let’s think dot by dot: hidden computation in transformer language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.15758 (2024).

Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Kosinski, M. Evaluating large language models in theory of mind tasks. Proc. Natl Acad. Sci. USA 121, e2405460121 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Turpin, M., Michael, J., Perez, E. & Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 36, 74952–74965 (2023).


Google Scholar
 

Ameisen, E. et al. Circuit tracing: revealing computational graphs in language models. AI Transformer Circuits Thread https://transformer-circuits.pub/2025/attribution-graphs/methods.html (2025).

Peter, S., Riemer, K. & West, J. D. The benefits and dangers of anthropomorphic conversational agents. Proc. Natl Acad. Sci. USA 122, e2415898122 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Guthrie, S. E. Faces in the Clouds: A New Theory of Religion (Oxford Univ. Press, 1995).

Jones, C. R. & Bergen, B. K. Large language models pass the Turing test. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.23674 (2025).

Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Zerkouk, M., Mihoubi, M. & Chikhaoui, B. A comprehensive review of AI-based intelligent tutoring systems: applications and challenges. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.18882 (2025).

Colombatto, C. & Fleming, S. M. Folk psychological attributions of consciousness to large language models. Neurosci. Conscious. 2024, niae013 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Aldahoul, N. et al. Large language models are often politically extreme, usually ideologically inconsistent, and persuasive even in informational contexts. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.04171 (2025).

Lawrence, H. R. et al. The opportunities and risks of large language models in mental Health. JMIR Ment. Health 11, e59479 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R. & Yang, D. The rise of AI companions: how human-chatbot relationships influence well-being. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.12605 (2025).

Akbulut, C., Weidinger, L., Manzini, A., Gabriel, I. & Rieser, V. All too human? Mapping and mitigating the risk from anthropomorphic AI. In Proc. AAAI/ACM Conf. AI Ethics Soc. Vol. 7 (eds Das, S. et al.) 13–26 (AAAI, 2024).

Shanahan, M. Talking about large language models. Commun. ACM 67, 68–79 (2024).

Article 

Google Scholar
 

Zador, A. et al. Catalyzing next-generation artificial intelligence through neuroAI. Nat. Commun. 14, 1597 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Moser, E. I., Kropff, E. & Moser, M.-B. Place cells, grid cells, and the brain’s spatial representation system. Annu. Rev. Neurosci. 31, 69–89 (2008).

Article 
PubMed 

Google Scholar
 

Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 (2017).

Article 
PubMed 

Google Scholar
 

Mirzadeh, S. I. et al. GSM-symbolic: understanding the limitations of mathematical reasoning in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.05229 (2024).

Kumaran, D., Hassabis, D. & McClelland, J. L. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn. Sci. 20, 512–534 (2016).

Article 
PubMed 

Google Scholar
 

Connell, L. & Lynott, D. What can language models tell us about human cognition? Curr. Dir. Psychol. Sci. 33, 181–189 (2024).

Article 

Google Scholar
 

Martínez, L., Ruan, D. & Herrera, F. Computing with words in decision support systems: an overview on models and applications. Int. J. Comput. Intell. Syst. 3, 382–395 (2010).


Google Scholar
 

Perlis, R. H., Goldberg, J. F., Ostacher, M. J. & Schneck, C. D. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology 49, 1412–1416 (2024)

Article 
PubMed 
PubMed Central 

Google Scholar
 

Chiriatti, M., Ganapini, M., Panai, E., Ubiali, M. & Riva, G. The case for human–AI interaction as system 0 thinking. Nat. Hum. Behav. 8, 1829–1830 (2024).

Article 
PubMed 

Google Scholar
 

Essel, H. B., Vlachopoulos, D., Essuman, A. B. & Amankwa, J. O. ChatGPT effects on cognitive skills of undergraduate students: receiving instant responses from AI-based conversational large language models (LLMs). Comput. Educ. Artif. Intell. 6, 100198 (2024).

Article 

Google Scholar
 

Rasmequan, S. & Russ, S. Cognitive artefacts for decision support. In SMC 2000 Proc. Int. Conf. Syst. Man Cybernet. (von Keutz, S. et al.) 651–656 (IEEE, 2000).

Vicente, L. & Matute, H. Humans inherit artificial intelligence biases. Sci. Rep. 13, 15737 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Treiman, L. S., Ho, C.-J. & Kool, W. The consequences of AI training on human decision-making. Proc. Natl Acad. Sci. USA 121, e2408731121 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Frentz, T. S. Memory, myth, and rhetoric in plato’s phaedrus. Rhetor. Soc. Q. 36, 243–262 (2006).

Article 

Google Scholar
 

Heersmink, R. Extended mind and cognitive enhancement: moral aspects of cognitive artifacts. Phenomenol. Cogn. Sci. 16, 17–32 (2017).

Article 

Google Scholar
 

Carr, N. The Shallows: What the Internet Is Doing to Our Brains (W. W. Norton, 2020).

Tanil, C. T. & Yong, M. H. Mobile phones: the effect of its presence on learning and memory. PLoS ONE 15, e0219233 (2020).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Clemenson, G. D., Maselli, A., Fiannaca, A. J., Miller, A. & Gonzalez-Franco, M. Rethinking GPS navigation: creating cognitive maps through auditory clues. Sci. Rep. 11, 7764 (2021).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Dahmani, L. & Bohbot, V. D. Habitual use of GPS negatively impacts spatial memory during self-guided navigation. Sci. Rep. 10, 6310 (2020).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Bai, L., Liu, X. & Su, J. ChatGPT: the cognitive effects on learning and memory. Brain-X 1, e30 (2023).

Article 

Google Scholar
 

Costello, T. H., Pennycook, G. & Rand, D. G. Durably reducing conspiracy beliefs through dialogues with AI. Science 385, eadq1814 (2024).

Article 
PubMed 

Google Scholar
 

Heersmink, R. Use of large language models might affect our cognitive skills. Nat. Hum. Behav. 8, 805–806 (2024).

Article 
PubMed 

Google Scholar