Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
TLDR
This paper benchmarks cloud vs. local LLMs for System Dynamics AI assistance, evaluating performance on CLD extraction and interactive discussion tasks.
Key contributions
- Benchmarks cloud and local LLMs on CLD extraction and interactive discussion for System Dynamics.
- Cloud models achieve 77-89% pass rates on CLD extraction, outperforming local models on error fixing.
- Systematically analyzes model type effects: reasoning vs. instruction-tuned, backends (GGUF vs. MLX), and quantization.
- Backend choice (llama.cpp vs. mlx_lm) impacts JSON reliability and long-context performance more than quantization.
Why it matters
This research provides a crucial benchmark for deploying LLMs in System Dynamics, highlighting performance gaps between cloud and local models. It offers practical insights into backend and quantization choices, guiding practitioners in selecting and optimizing LLMs for specific tasks.
Original Abstract
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.