SCML Proceeding Series
Accepted papers are published in the "RISC Proceedings on Symbolic Computation and Machine Learning (SCML)" under the Creative Commons Attribution 4.0 International License. They receive a cover page in the style shown below, are archived with a DOI, and are freely available for download from this web page.
Please consider subscribing to the SCML Mailing List in order to be informed about papers published in the SCML proceedings.
RISC Proceedings on Symbolic Computation and
Machine Learning (SCML)
Multilingual Autoformalization via Fine-tuning Large Language Models with Symbolically Generated Data
2025, 18 pages
2025-12-12
Abstract: Autoformalization, translating informal mathematical text into formal machine-checkable code, has recently attracted new interest thanks to advances in large language models. However, progress has been hindered by the lack of high-quality parallel corpora pairing natural language with formal code, especially for less widely used systems such as Agda. In this work, we address this gap by demonstrating a systematic data generation pipeline via the Informath project based on Grammatical Framework on the informal side and Dedukti on the formal side. Using this method, we construct a 400K “4-to-3” parallel corpus spanning four formal languages (Dedukti, Agda, Rocq, Lean) and three natural languages (English, French, Swedish). To validate the utility of the approach, we fine-tune the open-source Qwen2.5-7B-Instruct model and achieve substantial gains: BLEU-4 improves from 32.90 to 76.16, and Agda syntax error rate falls from 98.43% to under 8%. We further explore joint training across multiple formal and natural languages, demonstrating that multilingual and multi-formal regimes yield notable improvements in low-resource scenarios. Our work establishes an easily extensible multilingual autoformalization system and a test dataset with hand-written theorem statements paired with Agda, Dedukti, Lean, and Rocq. We will also propose some systematic insights into dataset construction and multilingual training for future research.
more ...
autoformalization, informalization, large language models, Agda, Dedukti, Grammatical Framework
Contents
News
December 15, 2025: SCML-2026 Call for Presentations
December 12, 2025: SCML Publication 1: Multilingual Autoformalization via Fine-tuning Large Language Models with Symbolically Generated Data
July 8, 2025: SCML goes public.
