Pierson, B. Lawsuit claims UnitedHealth AI wrongfully denies elderly extended care. Reuters https://www.reuters.com/legal/lawsuit-claims-unitedhealth-ai-wrongfully-denies-elderly-extended-care-2023-11-14 (2023).
Schreiber, M. New AI tool counters health insurance denials decided by automated algorithms. The Guardian https://www.theguardian.com/us-news/2025/jan/25/health-insurers-ai (2025).
Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. Health 5, 1161098 (2023).
Azaria, A., Azoulay, R. & Reches, S. ChatGPT is a remarkable tool — for experts. Data Intell. 6, 240–296 (2024).
Meyrowitsch, D. W., Jensen, A. K., Sørensen, J. B. & Varga, T. V. AI chatbots and (mis)information in public health: impact on vulnerable communities. Front. Public Health 11, 1226776 (2023).
Kim, J. H. et al. When ChatGPT gives incorrect answers: the impact of inaccurate information by generative AI on tourism decision-making. J. Travel Res. 64, 51–73 (2023).
Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).
Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).
Kaddour, J. et al. Challenges and applications of large language models. Preprint at arXiv http://arxiv.org/abs/2307.10169 (2023).
Griewing, S. et al. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch. Gynecol. Obstet. 310, 537–550 (2024).
Jimenez, C. E. et al. SWE-bench: can language models resolve real-world Github issues? Preprint at arXiv https://doi.org/10.48550/arXiv.2310.06770 (2024).
Jusman, I. A., Ausat, A. M. A. & Sumarna, A. Application of ChatGPT in business management and strategic decision making. J. Minfo Polgan 12, 1688–1697 (2023).
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Basir, A., Puspitasari, E. D., Aristarini, C. C., Sulastri, P. D. & Ausat, A. M. A. Ethical use of ChatGPT in the context of leadership and strategic decisions. J. Minfo Polgan 12, 1239–1246 (2023).
Gloria, B., Melsbach, J., Bienert, S. & Schoder, D. Real-GPT: efficiently tailoring LLMs for informed decision-making in the real estate industry. J. Real Estate Portf. Manag. 31, 56–72 (2024).
Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120, e2218523120 (2023).
Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).
Hagendorff, T. Machine psychology: investigating emergent capabilities and behavior in large language models using psychological methods. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13988 (2023).
White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.11382 (2023).
Abbate, F. Natural and artificial intelligence: a comparative analysis of cognitive aspects. Minds Mach. 33, 791–815 (2023).
Rich, A. S. & Gureckis, T. M. Lessons for artificial intelligence from the study of natural stupidity. Nat. Mach. Intell. 1, 174–180 (2019).
Shiffrin, R. & Mitchell, M. Probing the psychology of AI models. Proc. Natl Acad. Sci. USA 120, e2300963120 (2023).
Suri, G., Slater, L. R., Ziaee, A. & Nguyen, M. Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. J. Exp. Psychol. Gen. 153, 1066–1075 (2024).
Qu, Y. et al. Promoting interactions between cognitive science and large language models. Innovation 5, 100579 (2024).
Thaler, R. H. Behavioral economics: past, present, and future. Am. Econ. Rev. 106, 1577–1600 (2016).
Kahneman, D. Maps of bounded rationality: psychology for behavioral economics. Am. Econ. Rev. 93, 1449–1475 (2003).
Evans, J. S. & Stanovich, K. E. Dual-process theories of higher cognition: advancing the debate. Persp. Psychol. Sci. 8, 223–241 (2013).
Evans, J. S. Dual-processing accounts of reasoning, judgment, and social cognition. Annu. Rev. Psychol. 59, 255–278 (2008).
Kahneman, D. & Frederick, S. in Heuristics and Biases: The Psychology of Intuitive Judgment (eds Gilovich, T. et al.) 49–81 (Cambridge Univ. Press, 2002).
Gigerenzer, G. & Brighton, H. Homo heuristicus: why biased minds make better inferences. Top. Cogn. Sci. 1, 107–143 (2009).
Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64, 515–526 (2009).
Smolensky, P. Connectionist AI, symbolic AI, and the brain. Artif. Intell. Rev. 1, 95–109 (1987).
Goel, A. Looking back, looking ahead: symbolic versus connectionist AI. AI Mag. 42, 83–85 (2022).
Bellini-Leite, S. C. Dual process theory: embodied and predictive; symbolic and classical. Front. Psychol. 13, 805386 (2022).
Bellini-Leite, S. C. Dual process theory for large language models: an overview of using psychology to address hallucination and reliability issues. Adapt. Behav. 32, 329–343 (2023).
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36, 181–204 (2013).
Newell, A. Physical symbol systems. Cogn. Sci. 4, 135–183 (1980).
Marcus, G. The next decade in AI: four steps towards robust artificial intelligence. Preprint at arXiv https://doi.org/10.48550/arXiv.2002.06177 (2020).
Hagendorff, T., Fabi, S. & Kosinski, M. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat. Comput. Sci. 3, 833–838 (2023).
Ma, D., Zhang, T. & Saunders, M. Is ChatGPT humanly irrational? Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-3220513/v1 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).
Gigerenzer, G. & Gaissmaier, W. Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 (2011).
Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974).
Frederick, S. Cognitive reflection and decision making. J. Econ. Perspect. 19, 25–42 (2005).
Erickson, T. D. & Mattson, M. E. From words to meaning: a semantic illusion. J. Verbal Learn. Verbal Behav. 20, 540–551 (1981).
Toplak, M. E., West, R. F. & Stanovich, K. E. The cognitive reflection test as a predictor of performance on heuristics-and-biases tasks. Mem. Cogn. 39, 1275–1289 (2011).
Pennycook, G., Cheyne, J. A., Koehler, D. J. & Fugelsang, J. A. Is the cognitive reflection test a measure of both reflection and intuition? Behav. Res. Meth. 48, 341–348 (2016).
Chen, Y., Kirshner, S. N., Ovchinnikov, A., Andiappan, M. & Jenkin, T. A manager and an AI walk into a bar: does ChatGPT make biased decisions like we do? Manuf. Serv. Oper. Manag. 27, 339–678 (2025).
Jones, E. & Steinhardt, J. Capturing failures of large language models via human cognitive biases. Adv. Neural Inf. Process. Syst. 35, 11785–11799 (2022).
Pantana, G., Castello, M. & Torre, I. Examining cognitive biases in ChatGPT 3.5 and ChatGPT 4 through human evaluation and linguistic comparison. In Proc. 16th Conf. Assoc. Mach. Transl. Am. (eds Knowles, R., Eriguchi, A. & Goel, S.) 250–260 (AMTA, 2024).
Ryu, J., Kim, J. & Kim, J. A study on the representativeness heuristics problem in large language models. IEEE Access. 12, 147958–147966 (2024).
Tversky, A. & Kahneman, D. The framing of decisions and the psychology of choice. Science 211, 453–458 (1981).
Nickerson, R. S. Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2, 175–220 (1998).
Lou, J. & Sun, Y. Anchoring bias in large language models: an experimental study. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.06593 (2024).
Talboy, A. N. & Fuller, E. Challenging the appearance of machine intelligence: cognitive bias in LLMs and best practices for adoption. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.01358 (2023).
Azaria, A. ChatGPT: more human-like than computer-like, but not necessarily in a good way. In 2023 IEEE 35th Int. Conf. Tools Artif. Intell. (eds Esposito, A., Yang, M. & Cordasco, G.) 468–473 (IEEE, 2023).
Acerbi, A. & Stubbersfield, J. M. Large language models show human-like content biases in transmission chain experiments. Proc. Natl Acad. Sci. USA 120, e2313790120 (2023).
Schramowski, P., Turan, C., Andersen, N., Rothkopf, C. A. & Kersting, K. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat. Mach. Intell. 4, 258–268 (2022).
Gallegos, I. O. et al. Bias and fairness in large language models: a survey. Comput. Linguist. 50, 1097–1179 (2024).
Wang, P., Xiao, Z., Chen, H. & Oswald, F. L. Will the real Linda please stand up…to large language models? Examining the representativeness heuristic in LLMs. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.01461 (2024).
Nguyen, J. K. Human bias in AI models? Anchoring effects and mitigation strategies in large language models. J. Behav. Exp. Finance 43, 100971 (2024).
Tversky, A. & Kahneman, D. Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment. Psychol. Rev. 90, 293–315 (1983).
Tversky, A. & Kahneman, D. Availability: a heuristic for judging frequency and probability. Cogn. Psychol. 5, 207–232 (1973).
Ariely, D., Loewenstein, G. & Prelec, D. “Coherent arbitrariness”: stable demand curves without stable preferences*. Q. J. Econ. 118, 73–106 (2003).
Newell, A. & Simon, H. A. Computer science as empirical inquiry: symbols and search. Commun. ACM 19, 113–126 (2007).
Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 36, 11809–11822 (2023).
Du, M. Machine vs. human, who makes a better judgment on innovation? Take GPT-4 for example. Front. Artif. Intell. 6, 1206516 (2023).
Ziegler, D. M. et al. Fine-tuning language models from human preferences. Preprint at arXiv https://doi.org/10.48550/arXiv.1909.08593 (2019).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Su, D., Sukhbaatar, S., Rabbat, M., Tian, Y. & Zheng, Q. Dualformer: controllable fast and slow thinking by learning with randomized reasoning traces. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.09918 (2025).
Shang, Y., Li, Y., Xu, F. & Li, Y. Defint: a default-interventionist framework for efficient reasoning with hybrid large language models. Preprint at arXiv 2402.02563 (2024).
Kojima, T., Gu, S. (Shane)., Reid. M., Matsuo. Y, & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 35, 22199–22213 (2022).
Weston, J. & Sukhbaatar, S. System 2 Attention (is something you might need too). Preprint at arXiv https://doi.org/10.48550/arXiv.2311.11829 (2023).
Zhang, H., Huang, J., Li, Z., Naik, M. & Xing, E. Improved logical reasoning of language models via differentiable symbolic programming. In Find. Assoc. Comput. Linguist. (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 3062–3077 (ACL, 2023).
Zhu, X. et al. Solving math word problems via cooperative reasoning induced language models. In Proc. 61st Ann. Meet. Assoc. Comput. Linguist. (Rogers, A., Boyd-Graber, J. & Okazaki, N.) 4471–4485 (ACM, 2023).
Raoelison, M., Thompson, V. A. & De Neys, W. The smart intuitor: cognitive capacity predicts intuitive rather than deliberate thinking. Cognition 204, 104381 (2020).
De Neys, W. & Pennycook, G. Logic, fast and slow: advances in dual-process theorizing. Curr. Dir. Psychol. Sci. 28, 503–509 (2019).
Reyna, V. F. & Brainerd, C. J. Numeracy, gist, literal thinking and the value of nothing in decision making. Nat. Rev. Psychol. 2, 421–439 (2023).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.12712 (2023).
Sclar, M., Choi, Y., Tsvetkov, Y. & Suhr, A. Quantifying language models’ sensitivity to spurious features in prompt design or: how I learned to start worrying about prompt formatting. Prepint at arXiv https://doi.org/10.48550/arXiv.2310.11324 (2023).
Pezeshkpour, P. & Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Find. Assoc. Comput. Linguist. (eds Duh, K., Gomez, H. & Bethard, S.) 2006–2017 (ACL, 2024).
Loya, M., Sinha, D. A. & Futrell, R. Exploring the sensitivity of LLMs’ decision-making capabilities: insights from prompt variation and hyperparameters. In Find. Assoc. Comput. Linguist. (eds Bouamor, H., Pino, J. & Bali, K.) 3711–3716 (ACL, 2023).
Barez, F. et al. Chain-of-thought is not explainability. Preprint at arXiv https://www.alphaxiv.org/overview/2025.02v3 (2025).
Zhang, Z. et al. Multimodal chain-of-thought reasoning in language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=y1pPWFVfvR (2024).
Chakraborty, N., Ornik, M. & Driggs-Campbell, K. Hallucination detection in foundation models for decision-making: a flexible definition and review of the state of the art. ACM Comput. Surv. 57, 188:1–188:35 (2025).
Stella, M., Hills, T. T. & Kenett, Y. N. Using cognitive psychology to understand GPT-like models needs to extend beyond human biases. Proc. Natl Acad. Sci. USA 120, e2312911120 (2023).
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43, 42 (2025).
Smith, A. L., Greaves, F. & Panch, T. Hallucination or confabulation? Neuroanatomy as metaphor in large language models. PLoS Digital Health 2, e0000388 (2023).
Singh, A. K., Lamichhane, B., Devkota, S., Dhakal, U. & Dhakal, C. Do large language models show human-like biases? Exploring confidence–competence gap in AI. Information 15, 92 (2024).
Borji, A. A Categorical archive of ChatGPT failures. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.03494 (2023).
Kopelman, M. D. Varieties of confabulation and delusion. Cogn. Neuropsychiatry 15, 14–37 (2010).
Johnson, D. D. P. & Fowler, J. H. The evolution of overconfidence. Nature 477, 317–320 (2011).
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. & Griffiths, T. L. Embers of autoregression: understanding large language models through the problem they are trained to solve. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.13638 (2023).
Perković, G., Drobnjak, A. & Botički, I. Hallucinations in LLMs: understanding and addressing challenges. In 47th MIPRO ICT Electron. Conv. (ed. Babic, S.) 2084–2088 (IEEE, 2024).
Sun, F., Li, N., Wang, K. & Goette, L. Large language models are overconfident and amplify human bias. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.02151 (2025).
Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Trans. Mach. Learn. Res. https://doi.org/10.3929/ethz-b-000651806 (2023).
Zhou, K., Hwang, J. D., Ren, X. & Sap, M. Relying on the unreliable: the impact of language models’ reluctance to express uncertainty. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.06730 (2024).
Zhou, K., Jurafsky, D. & Hashimoto, T. Navigating the grey area: how expressions of uncertainty and overconfidence affect language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.13439 (2023).
Dunning, D. in Advances in Experimental Social Psychology Vol. 44 (eds Olson, J. M. & Zanna, M. P.) 247–296 (Academic, 2011).
Ji, Z. et al. Towards mitigating LLM hallucination via self reflection. In Find. Assoc. Comput. Linguist. (eds Bouamor, H., Pino, J. & Bali, K.) 1827–1843 (ACL, 2023).
O’Leary, D. E. Confirmation and specificity biases in large language models: an explorative study. IEEE Intell. Syst. 40, 63–68 (2025).
Wei, J. et al. Emergent abilities of large language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=yzkSU5zdwD (2022).
McCoy, R. T., Pavlick, E. & Linzen, T. Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proc. Conf. 57th Ann. Meet. Assoc. Comput. Linguist. (eds Korhonen, A., Traum, D. & Màrquez, L.) 3428–3448 (ACL, 2020).
Webson, A. & Pavlick, E. Do prompt-based models really understand the meaning of their prompts? In Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. (eds Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I. V.) 2300–2344 (ACL, 2022).
Lampinen, A. K. et al. Language models, like humans, show content effects on reasoning tasks. PNAS Nexus 3, 233 (2024).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Zheng, H. & Zhan, H. ChatGPT in scientific writing: a cautionary tale. Am. J. Med. 136, 725–726.e6 (2023).
Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl Acad. Sci. USA 120, e2215907120 (2023).
Nelson, A. B. & Shiffrin, R. M. The co-evolution of knowledge and event memory. Psychol. Rev. 120, 356–394 (2013).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conf. Fairness Account. Transpar. (eds Elish, M. C., Isaac, W. & Zemel, R.) 610–623 (ACM, 2021).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. 7, 1526–1541 (2023).
Heinlein, R. A. Stranger in a Strange Land (Putnam, 1969).
Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: generalization beyond overfitting on small algorithmic datasets. Preprint at arXiv https://doi.org/10.48550/arXiv.2201.02177 (2022).
Seligman, M. E. P., Railton, P., Baumeister, R. F. & Sripada, C. Navigating into the future or driven by the past. Persp. Psychol. Sci. 8, 119–141 (2013).
Liu, S. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. 30, 1237–1245 (2023).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 2 (2021).
Silva, G. A. Can AI understand what it’s telling you? Forbes https://www.forbes.com/sites/gabrielasilva/2025/04/23/can-ai-understand-the-chinese-room-argument-says-no-but-is-it-right (2025).
Ananthaswamy, A. New theory suggests chatbots can understand text. Quanta Magazine https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/ (2024).
Moskvichev, A. K., Odouard, V. V. & Mitchell, M. The ConceptARC benchmark: evaluating understanding and generalization in the ARC domain. Trans. Mach. Learn. Res. https://openreview.net/forum?id=8ykyGbtt2q (2023).
Dijkstra, E. W. On IPW’s. Univ. Texas Austin https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867.html (1983).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.13375 (2023).
Nori, H. et al. Sequential diagnosis with language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.22405 (2025).
Nie, Y. et al. A survey of large language models for financial applications: progress, prospects and challenges. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.11903 (2024).
Aydın, Ö. & Karaarslan, E. OpenAI ChatGPT generated literature review: digital twin in healthcare. Emerg. Comput. Technol. 2, 22–31 (2022).
Ke, L., Tong, S., Cheng, P. & Peng, K. Exploring the frontiers of LLMs in psychological applications: a comprehensive review. Artif. Intell. Rev. 58, 305 (2025).
Hua, S., Jin, S. & Jiang, S. The limitations and ethical considerations of ChatGPT. Data Intell. 6, 201–239 (2024).
Chuma, E. L. & de Oliveira, G. G. Generative AI for business decision-making: a case of ChatGPT. Manag. Sci. Bus. Decis. 3, 5–11 (2023).
Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPTs are GPTs: labor market impact potential of LLMs. Science 384, 1306–1308 (2024).
Weidinger, L. et al. Ethical and social risks of harm from language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2112.04359 (2021).
Chen, J., Liu, L., Ruan, S., Li, M. & Yin, C. Are different versions of ChatGPT’s ability comparable to the clinical diagnosis presented in case reports? A descriptive study. J. Multidisc. Healthc. 16, 3825–3831 (2023).
OpenAI. Introducing ChatGPT. OpenAI https://openai.com/blog/chatgpt (2022).
An, J., Huang, D., Lin, C. & Tai, M. Measuring gender and racial biases in large language models. Preprint at arXiv http://arxiv.org/abs/2403.15281 (2024).
Liang, P. P., Wu, C., Morency, L.-P. & Salakhutdinov, R. Towards understanding and mitigating social biases in language models. In Proc. 38th Int. Conf. Mach. Learn. (eds Meila, M. & Zhang, T.) 6565–6576 (PMLR, 2021).
Peters, D., Vold, K., Robinson, D. & Calvo, R. A. Responsible AI — two frameworks for ethical design practice. IEEE Trans. Technol. Soc. 1, 34–47 (2020).
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017).
Liang, J. T., Lin, M., Rao, N. & Myers, B. A. Prompts are programs too! Understanding how developers build software containing prompts. Proc. ACM Softw. Eng. 2, 1591–1614 (2025).
De Neys, W. Bias and conflict: a case for logical intuitions. Persp. Psychol. Sci. 7, 28–38 (2012).
Neys, W. D. Advancing theorizing about fast-and-slow thinking. Behav. Brain Sci. 46, e111 (2023).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2210.03629 (2023).
Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=ehfRiF0R3a (2023).
Nathani, D. et al. MLGym: a new framework and benchmark for advancing AI research agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.14499 (2025).
Paglieri, D. et al. BALROG: benchmarking agentic LLM and VLM reasoning on games. In Proc. Int. Conf. Learn. Represent. (ICLR, 2025).
Liu, X. et al. Agentbench: evaluating LLMs as agents. In Proc. Int. Conf. Learning Represent. (ICLR, 2024).
Huang, J. et al. How far are we on the decision-making of LLMs? Evaluating LLMs’ gaming ability in multi-agent environments. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.11807 (2025).
Leng, Y. & Yuan, Y. Do LLM agents exhibit social behavior? Preprint at arXiv https://doi.org/10.48550/arXiv.2312.15198 (2024).
Costarelli, A. et al. GameBench: evaluating strategic reasoning abilities of LLM agents. In Lang. Gamific. NeurIPS 2024 Workshop (NeurIPS, 2024).
Trencsenyi, V., Mensfelt, A. & Stathis, K. Approximating human strategic reasoning with LLM-enhanced recursive reasoners leveraging multi-agent hypergames. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.07443 (2025).
Dwivedi, Y. K. et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inform. Manag. 71, 102642 (2023).
Vaugrante, L., Niepert, M. & Hagendorff, T. A looming replication crisis in evaluating behavior in language models? Evidence and solutions. Preprint at arXiv https://doi.org/10.48550/arXiv.2409.20303 (2024).
Koo, R. et al. Benchmarking cognitive biases in large language models as evaluators. In Find. Assoc. Comput. Linguist. (eds Ku, L.-W., Martins, A. & Srikumar, V.) 517–545 (ACL, 2024).
Wang, Y., Cai, Y., Chen, M., Liang, Y. & Hooi, B. Primacy effect of ChatGPT. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H., Pino, J. & Bali, K.) 108–115 (ACL, 2023).
Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L. & Gonzalez-Agirre, A. Cognitive biases, task complexity, and result intepretability in large language models. In Proc. 31st Int. Conf. Comput. Linguist. (eds Rambow, O. et al.) 1767–1784 (ACL, 2025).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
OpenAI. Learning to reason with LLMs. OpenAI https://openai.com/index/learning-to-reason-with-llms (2024).
Pfau, J., Merrill, W. & Bowman, S. R. Let’s think dot by dot: hidden computation in transformer language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.15758 (2024).
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Kosinski, M. Evaluating large language models in theory of mind tasks. Proc. Natl Acad. Sci. USA 121, e2405460121 (2024).
Turpin, M., Michael, J., Perez, E. & Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 36, 74952–74965 (2023).
Ameisen, E. et al. Circuit tracing: revealing computational graphs in language models. AI Transformer Circuits Thread https://transformer-circuits.pub/2025/attribution-graphs/methods.html (2025).
Peter, S., Riemer, K. & West, J. D. The benefits and dangers of anthropomorphic conversational agents. Proc. Natl Acad. Sci. USA 122, e2415898122 (2025).
Guthrie, S. E. Faces in the Clouds: A New Theory of Religion (Oxford Univ. Press, 1995).
Jones, C. R. & Bergen, B. K. Large language models pass the Turing test. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.23674 (2025).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Zerkouk, M., Mihoubi, M. & Chikhaoui, B. A comprehensive review of AI-based intelligent tutoring systems: applications and challenges. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.18882 (2025).
Colombatto, C. & Fleming, S. M. Folk psychological attributions of consciousness to large language models. Neurosci. Conscious. 2024, niae013 (2024).
Aldahoul, N. et al. Large language models are often politically extreme, usually ideologically inconsistent, and persuasive even in informational contexts. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.04171 (2025).
Lawrence, H. R. et al. The opportunities and risks of large language models in mental Health. JMIR Ment. Health 11, e59479 (2024).
Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R. & Yang, D. The rise of AI companions: how human-chatbot relationships influence well-being. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.12605 (2025).
Akbulut, C., Weidinger, L., Manzini, A., Gabriel, I. & Rieser, V. All too human? Mapping and mitigating the risk from anthropomorphic AI. In Proc. AAAI/ACM Conf. AI Ethics Soc. Vol. 7 (eds Das, S. et al.) 13–26 (AAAI, 2024).
Shanahan, M. Talking about large language models. Commun. ACM 67, 68–79 (2024).
Zador, A. et al. Catalyzing next-generation artificial intelligence through neuroAI. Nat. Commun. 14, 1597 (2023).
Moser, E. I., Kropff, E. & Moser, M.-B. Place cells, grid cells, and the brain’s spatial representation system. Annu. Rev. Neurosci. 31, 69–89 (2008).
Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 (2017).
Mirzadeh, S. I. et al. GSM-symbolic: understanding the limitations of mathematical reasoning in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.05229 (2024).
Kumaran, D., Hassabis, D. & McClelland, J. L. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn. Sci. 20, 512–534 (2016).
Connell, L. & Lynott, D. What can language models tell us about human cognition? Curr. Dir. Psychol. Sci. 33, 181–189 (2024).
Martínez, L., Ruan, D. & Herrera, F. Computing with words in decision support systems: an overview on models and applications. Int. J. Comput. Intell. Syst. 3, 382–395 (2010).
Perlis, R. H., Goldberg, J. F., Ostacher, M. J. & Schneck, C. D. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology 49, 1412–1416 (2024)
Chiriatti, M., Ganapini, M., Panai, E., Ubiali, M. & Riva, G. The case for human–AI interaction as system 0 thinking. Nat. Hum. Behav. 8, 1829–1830 (2024).
Essel, H. B., Vlachopoulos, D., Essuman, A. B. & Amankwa, J. O. ChatGPT effects on cognitive skills of undergraduate students: receiving instant responses from AI-based conversational large language models (LLMs). Comput. Educ. Artif. Intell. 6, 100198 (2024).
Rasmequan, S. & Russ, S. Cognitive artefacts for decision support. In SMC 2000 Proc. Int. Conf. Syst. Man Cybernet. (von Keutz, S. et al.) 651–656 (IEEE, 2000).
Vicente, L. & Matute, H. Humans inherit artificial intelligence biases. Sci. Rep. 13, 15737 (2023).
Treiman, L. S., Ho, C.-J. & Kool, W. The consequences of AI training on human decision-making. Proc. Natl Acad. Sci. USA 121, e2408731121 (2024).
Frentz, T. S. Memory, myth, and rhetoric in plato’s phaedrus. Rhetor. Soc. Q. 36, 243–262 (2006).
Heersmink, R. Extended mind and cognitive enhancement: moral aspects of cognitive artifacts. Phenomenol. Cogn. Sci. 16, 17–32 (2017).
Carr, N. The Shallows: What the Internet Is Doing to Our Brains (W. W. Norton, 2020).
Tanil, C. T. & Yong, M. H. Mobile phones: the effect of its presence on learning and memory. PLoS ONE 15, e0219233 (2020).
Clemenson, G. D., Maselli, A., Fiannaca, A. J., Miller, A. & Gonzalez-Franco, M. Rethinking GPS navigation: creating cognitive maps through auditory clues. Sci. Rep. 11, 7764 (2021).
Dahmani, L. & Bohbot, V. D. Habitual use of GPS negatively impacts spatial memory during self-guided navigation. Sci. Rep. 10, 6310 (2020).
Bai, L., Liu, X. & Su, J. ChatGPT: the cognitive effects on learning and memory. Brain-X 1, e30 (2023).
Costello, T. H., Pennycook, G. & Rand, D. G. Durably reducing conspiracy beliefs through dialogues with AI. Science 385, eadq1814 (2024).
Heersmink, R. Use of large language models might affect our cognitive skills. Nat. Hum. Behav. 8, 805–806 (2024).