Search by:
Comparison of Problem-solving Performance Across Mathematical Domains with Large Language Models
Full text (PDF)
UDC: 004.89
Publication Language: English
Stuc. intelekt. 2024; 29(4):96-104
Abstract: This study investigates problem-solving performance across four mathematical domains, using statistical techniques to analyse domain-specific differences. By leveraging the NuminaMath-TIR dataset, we categorized problems into algebra, geometry, number theory, and combinatorics, selecting 8,000 problems for the analysis. Models including GPT-4o-mini, Mathstral-7B, Qwen2.5-Math-7B, and Llama-3.1-8B-Instruct were applied to assess answer correctness. Significant differences in solution accuracy were identified, with algebra showing the highest correctness rates and combinatorics the lowest. The results highlight the impact of domain on model performance and suggest the potential for tool-integrated reasoning (TIR) techniques to enhance consistency across domains. Future work can explore targeted model training improvements, aiming to optimize educational technologies and adaptive learning systems.
Keywords: Artificial Intelligence; Mathematical Problems; Natural Language Processing; Large Language Models; Automated Reasoning.
References:
- Rohan Anil et al. 2023. “PaLM 2 Technical Report”, available at https://arxiv.org/abs/2305.10403.
- Gao, Luyu, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. “PAL: Program-Aided Language Models”, available at https://arxiv.org/abs/2211.10435.
- Glushkov, V. M., K. P. Vershinin, Yu. V. Kapitonova, et al. 1974. “About a Formal Language for Recording Mathematical Texts: Automation of the Search for Proofs of Theorems in Mathematics.”
- Gou, Zhibin, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. 2024. “ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving”, available at https://arxiv.org/abs/2309.17452.
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. 2023, “Mistral 7b”, available at https://arxiv.org/abs/2310.06825.
- LI, Jia, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, et al. 2024. “NuminaMath.” available at GitHub Repository Project Numina: https://github.com/project-numina/aimo-progress-prize.
- Dubey, A., Jauhri, A., Pandey, A., Kadian et al. 2024. “The Llama3 Herd of Models”, available at https://arxiv.org/abs/2310.06825.
- OpenAI. 2024. “GPT-4 Technical Report”, available at https://arxiv.org/abs/2303.08774.
- Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only”, available at https://arxiv.org/abs/2306.01116.
- Touvron, Hugo, and et al. 2023. “Llama 2: Open Foundation and Fine-Tuned Chat Models”, available at https://arxiv.org/abs/2307.09288.
- Trinh, Trieu, Yuhuai Wu, Quoc Le, and Thang Luong. 2024. “Solving Olympiad Geometry Without Human Demonstrations”, available at https://doi.org/10.1038/s41586-023-06747-5.
- Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, available at https://arxiv.org/abs/2201.11903.
- Yang, An, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, et al. 2024. “Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement”, available at https://arxiv.org/abs/2409.12122.