January 19, 2025:
AI Struggles with Historical Accuracy in New Study - A new study shows that leading large language models (LLMs) like GPT-4, Llama, and Google's Gemini have trouble with historical accuracy, especially in nuanced or less-documented contexts. Using the Hist-LLM benchmark, the highest accuracy achieved was only 46%, attributed to biases in training data and challenges in retrieving obscure information.
Researchers presented these findings at NeurIPS, emphasizing the potential for improvement and the usefulness of LLMs in historical research. They suggest refining data sampling and question complexity, highlighting the current limitations of LLMs compared to human expertise in advanced historical inquiry.