Text Embeddings Outperform Hand-Crafted Features in Cross-Domain Algorithm Selection

Algorithm selection—the problem of choosing the most suitable solver for a given problem instance—traditionally relies on hand-crafted, domain-specific features that require substantial expertise to design and maintain. This work challenges that paradigm by demonstrating that pretrained language model embeddings can effectively capture instance characteristics without explicit feature engineering. The authors propose ZeroFolio, a three-stage pipeline that serializes raw problem instances as text, embeds them using pretrained models, and performs weighted k-nearest neighbor classification to select algorithms.

The methodology capitalizes on an underexplored property of modern embeddings: their ability to implicitly encode structural and semantic information about problem instances across heterogeneous domains. By treating the raw instance file as unstructured text—whether SAT formulas, constraint satisfaction problems, or mixed-integer programs—the framework achieves domain transfer without retraining. The selection mechanism employs inverse-distance weighting and Manhattan distance metrics in embedding space, with ablation studies confirming these design choices as critical components.

Empirical evaluation across 11 ASlib scenarios spanning seven distinct problem domains reveals consistent improvements over random forest classifiers trained on hand-crafted features. ZeroFolio outperforms in 10 of 11 single-configuration scenarios and all 11 with two-seed voting ensembles. Notably, the margin of improvement is often substantial, suggesting that embeddings capture complementary information to traditional features. Soft voting combinations of both approaches yield additional gains where individual selectors remain competitive, indicating potential for hybrid architectures.

This work has significant implications for algorithm selection research: it reduces the barrier to entry for new problem domains while maintaining competitive performance. The approach's generality suggests broader applicability to configuration and hyperparameter optimization tasks where domain knowledge remains scarce or expensive to obtain.

Text Embeddings Outperform Hand-Crafted Features in Cross-Domain Algorithm Selection

Keep reading