Co-Evolutionary Skill Discovery Unlocks Long-Horizon LLM Reasoning

Long-horizon sequential decision-making remains a fundamental challenge for language model-based agents operating in partially observable environments with delayed reward signals. While LLMs demonstrate strong reasoning capabilities, they typically fail to develop persistent abstractions—reusable skill primitives—that could be composed across episodes. This limitation becomes particularly acute in game environments requiring multi-step planning, skill chaining, and robust decision-making under uncertainty.

COSPLAY addresses this through a bidirectional co-evolutionary mechanism. The framework maintains two coupled components: (1) an LLM decision agent that learns to retrieve and compose skills from a learnable skill bank, and (2) an automated skill pipeline that discovers behavioral abstractions from unlabeled agent rollouts. The decision agent generates actions conditioned on retrieved skill specifications, while the skill extraction module continuously mines trajectories to identify reusable skill contracts—formal specifications encompassing preconditions, effects, and execution semantics. This mutual refinement creates a virtuous cycle where improved skill representations enable better decision-making, which in turn generates higher-quality trajectories for skill discovery.

The experimental validation spans six game environments with varying complexity profiles. An 8B parameter base model equipped with COSPLAY achieves 25.1% average reward improvement over frontier LLM baselines on single-player benchmarks, while maintaining competitive performance on multi-agent social reasoning tasks. This suggests the framework successfully balances skill specialization with generalization across diverse task structures.

The approach represents a meaningful departure from monolithic LLM agents by introducing structured skill abstraction as an explicit architectural component. Rather than relying on in-context learning or fine-tuning alone, COSPLAY enables agents to build compositional knowledge structures that persist and evolve across episodes—a critical capability for agents operating in complex interactive domains.

Co-Evolutionary Skill Discovery Unlocks Long-Horizon LLM Reasoning

Keep reading