Zi-Jian Cheng1,2, Zi-Yi Jia1,2, Zhi Zhou2,3, Yu-Feng Li2,3, Lan-Zhe Guo1,211footnotemark: 1
1School of Intelligence Science and Technology, Nanjing University, China
2National Key Laboratory for Novel Software Technology, Nanjing University, China
3School of Artificial Intelligence, Nanjing University, China
{chengzj,zhouz,liyf,guolz}@lamda.nju.edu.cn,jiazy@smail.nju.edu.cn
Corresponding author.
Abstract
Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2’s adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at the URL.
1 Introduction
Tabular data[2] constitutes a highly structured data paradigm characterized by its organization of information through orthogonal dimensions of rows and columns[46]. In tabular data, each row represents an instance, while each column encodes a specific feature or attribute. The pervasive applicability of tabular data has been demonstrated across diverse domains. Within financial services, it facilitates critical operations such as credit scoring[54] and quantitative portfolio management[63] through predictive analytics. In biomedical research, tabular datasets underpin clinical decision support systems[58] and pharmacological discovery pipelines[40]. To fully exploit the potential of tabular data for addressing real-world tasks, various tabular machine learning models have been developed. This evolutionary progression spans from tree-based methods (e.g., CatBoost[44] and XGBoost[9]) to deep learning models (e.g., ModernNCA[56] and TabPFN[25, 26]). These models have demonstrated exceptional performance across diverse tabular tasks.
Tree-based models consistently outperform deep learning models in tabular tasks [20, 39]. The emergence of a new deep learning model, TabPFN v2, has effectively disrupted the dominance of tree-based models in performance[26]. Grounded in the Transformer[52], TabPFN v2 achieves state-of-the-art results through large-scale pre-training on synthetic datasets, allowing direct deployment on downstream tasks without the need for fine-tuning. Notably, TabPFN v2 introduces a novel contextual learning framework that processes both labelled training data and unlabeled test samples in unified input pipelines. This hybrid training approach facilitates joint optimization of feature representation and class prediction through self-supervised alignment mechanisms. Empirical validation across diverse datasets demonstrates an unprecedented performance level of TabPFN v2 in tabular tasks.
Given the significant potential demonstrated by TabPFN v2 in handling tabular machine learning tasks, current research has focused on further enhancing its performance or adapting it to more real-world applications. They can be divided into two categories: performance evaluation and the handling of specific tasks. For performance evaluation,Liu and Ye [35] had expanded the scope of evaluation experiments on TabPFN, assessing its performance on nearly 300 datasets and further validating the efficacy of TabPFN. To address the limitations of TabPFN v2 in handling high-dimensional, large-scale, and multi-class tabular machine learning tasks, a divide-and-conquer mechanism has been proposed[57]. Furthermore, researchers have proposed a series of optimization strategies to enhance TabPFN v2’s adaptability in complex tasks such as context compression[33] and data generation[50]. Koshil etal. [33] suggests leveraging retrieval samples to construct a local context, thereby enhancing TabPFN v2’s ability to perceive local information. While Thomas etal. [50] andXu etal. [55] optimizes TabPFN v2’s performance through data generation.
However, current research on TabPFN v2 is mostly carried out in closed environments where various learning factors, such as data distribution and feature space, remain consistent[42]. In the real world, tabular tasks usually occur in open environments[61] and face significant challenges when these learning factors change. For example, in traffic management systems, as the categories of traffic participants, event types, and facilities continue to increase, the complexity of management significantly rises (Emerging New Classes). Meanwhile, equipment updates, failures, and changes in travel behaviour lead to feature drifts in data, affecting the system’s accurate perception of traffic states (Decremental/Incremental Features). Moreover, the distribution of traffic flow frequently changes due to factors such as urban planning, large-scale events, and holidays, further increasing the dynamism of management (Changing Data Distributions). In addition, management goals have also shifted from single-efficiency optimization to multi-objective optimization, including reducing carbon emissions and enhancing system resilience, while paying more attention to long-term sustainability and overall system optimization (Varied Learning Objectives). Figure1 depicts four open environments challenges claimed inZhou [61]. Although existing research has gradually focused on improving TabPFN v2’s adaptability in open environments, these studies mainly concentrate on distribution shift scenarios[30, 22] and have not yet conducted a comprehensive evaluation of various challenges that TabPFN v2 may face in open environments. This limitation raises the natural question of whether TabPFN v2 can maintain good performance in open environments, and highlights the need for a more holistic assessment of TabPFN v2 in diverse and dynamic real-world scenarios.
To this end, we conduct a comprehensive evaluation of the performance of TabPFN v2 in open environments for the first time. Existing benchmarks for tabular data in open environments primarily evaluate models in isolated scenarios, limiting their methodological applicability to broader real-world tasks. To address this, we introduce a unified evaluation framework that systematically benchmarks diverse tabular models across various challenges in open environments, enabling standardized assessment of robustness and adaptability.
From the experiments, we observe that TabPFN v2 exhibits overall limitations across various challenges in open environments. Although in emerging new classes, TabPFN v2 has the potential to detect new classes, when handling decremental/incremental features, it not only shows heightened vulnerability to feature decrement but also can not address newly added features during testing. Under changing data distributions, the performance of TabPFN v2 degrades substantially due to limited robustness against concept drift. For varied learning objectives, TabPFN v2 displays statistically significant bias toward majority classes while failing to maintain competitive performance across different task formulations. Moreover, the robustness of TabPFN v2 is fundamentally data-dependent, rendering its generalization capability highly sensitive to the scale of the dataset.
Although results demonstrate that tree-based models remain the optimal approach for general tabular tasks in open environments, the above observations suggest settings where TabPFN v2 is most likely the right choice in open environments for practitioners: 1) when the available dataset is small; 2) when the distribution shift is characterized as covariate shift; 3) when the label distribution is approximately balanced across classes.
Separately, state-of-the-art methods, despite their strong performance in closed environments, may fail to generalize effectively in open environments. This performance gap underscores a crucial research imperative regarding the enhancements required to advance open environments research. To address this challenge, we propose the following recommendations:
- •
Develop benchmarks targeting unexplored open environments tabular challenges.
- •
Evaluate models on various open environments metrics.
- •
Take model robustness as a critical metric when comparing model quality.
- •
Design universal modules to enhance the robustness of diverse existing models.
2 Related Work
2.1 Open Environments Challenges
Most tabular machine learning models are typically trained and tested in closed environments where critical learning factors remain stable. However, various real-world tasks operate in open environments where dynamic changes occur in key factors, posing challenges to model generalization[42].Zhou [61] categorizes four core challenges in open environments: Emerging New Classes, Decremental/Incremental Features, Changing Data Distributions, and Varied Learning Objectives.
These challenges are pivotal in open environments machine learning. Emerging new classes, involving unseen classes during testing, have been addressed in natural language processing[13] and computer vision[14, 15]. Decremental/Incremental features, caused by changes in feature sets, lead to mismatched training-testing spaces. TabFSBench[10] evaluates model performance under such variations, andHou etal. [31] enhances performance by restoring ephemeral features. Changing data distributions, where test data violate the i.i.d.assumption, have led to benchmark datasets such as Tableshift[18], and methods such as domain adaptation[60] and domain generalization[59]. Varied learning objectives, which prioritize adaptive optimization beyond accuracy, include multi-objective learning[62, 64] and self-evolving training[37]. However, research on these challenges remains fragmented, lacking a unified framework to evaluate models on all four challenges.
2.2 Tabular Data in Machine Learning
Tabular data, with structured and heterogeneous features, is used in healthcare, finance, and recommendation systems[5, 32, 48]. Unlike images and texts, it has high dimensionality, heterogeneity, and complex dependencies, posing challenges for machine learning models[16]. Current approaches are mainly tree-based models (e.g., XGBoost[11], LightGBM[4], CatBoost[44]) and deep learning models. Tree-based models handle irregular patterns and uninformative features well[20], while deep learning models like DCN V2[53], FT-Transformer[19], and NODE[43] aim to capture complex feature interactions for better performance[19, 43].
In the realm of tabular machine learning tasks, tree-based models have traditionally held a dominant position over deep learning models[20, 39]. The emergence of the novel deep learning model TabPFN v2[26] has surpassed tree-based models. TabPFN v2 has demonstrated superior performance compared to tree-based models across multiple benchmarks. However, the majority of these benchmarks are confined to closed environments. Consequently, the comparative performance of TabPFN v2 and tree-based models in open environments remains underexplored.
2.3 Research on TabPFN
TabPFN[25], short for Tabular Prior-Fitted Network, is a model pre-trained on large-scale synthetic datasets, enabling efficient zero-shot learning. It can efficiently perform classification and regression tasks without the need for hyperparameter tuning. Compared to existing models, TabPFN shows significant advantages on small- to medium-scale datasets with low computational cost, making it an efficient solution for tabular tasks. Recent research, however, reveals limitations in TabPFN’s performance on high-dimensional, large-scale, or multi-class tasks[57, 35].
Various optimization strategies have been proposed to enhance TabPFN’s adaptability to current limitations and more complex scenarios, including local context construction via retrieval-based methods[33], model fine-tuning[50, 55], and pretraining dataset expansion[6]. Moreover, TabPFN’s strong performance has prompted its application to challenges such as distribution shift adaptation[22], time series forecasting[29], and various domains including healthcare[41, 51], ecology[21], and cybersecurity[45]. However, these studies primarily focus on closed environments or target only a single challenge in open settings, lacking a comprehensive evaluation of TabPFN under diverse open environments scenarios.
3 TabPFN and TabPFN v2
This section explains how TabPFN and its newer version, TabPFN v2, work. Since there are already various research about these models, this section will give a short summary. More details are given in AppendixA, which brings together the important parts fromHollmann etal. [26], Ye etal. [57].
3.1 TabPFN
Developed byHollmann etal. [25], TabPFN reimagines classification through an innovative adaptation of a Transformer-based architecture. At its core, the method reformulates the classification task as a sequence processing problem with the following key components.
TabPFN standardizes each data point to in the -dimensional space via linear projections with zero-padding ensuring uniform dimensionality. A context matrix is constructed by concatenating training samples and a test sample , where and denotes vector concatenation. This formulation treats each data point as a token in a sequence, enabling flexible handling of variable dataset sizes. The context matrix is then processed through Transformer layers and an MLP head, which converts the test sample’s output token into class probabilities.
3.2 TabPFN v2
Building upon TabPFN, TabPFN v2[26] introduces architectural innovations that redefine feature processing in tabular data analysis. The proposed method encompasses a feature space transformation where each raw feature is projected into a -dimensional latent space and subjected to controlled perturbation, creating unique positional identifiers[57, 19]. The computational framework processes a three-dimensional tensor structure using dual attention mechanisms: cross-sample attention for dataset-level patterns and intra-feature attention for feature relationships. Pre-trained weights derived from synthetic data generated by structural causal models which facilitate zero-shot transfer, thereby addressing the challenges of tabular data diversity.
Current research[26, 57] has extensively evaluated TabPFN v2’s performance in closed environments, but largely overlooked its adaptability to open environments, leaving a critical gap. To fully realize its potential and practical value, we conduct comprehensive evaluations of TabPFN v2 under various open environments challenges.
4 Open Environments Challenges
In this section, we draw upon the previous work presented inZhou [61] as a foundational framework to further symbolically formalize and represent the open environments challenges encountered in the tabular machine learning tasks. The detailed real-world descriptions are given in AppendixB.
4.1 Emerging New Classes
In closed environments machine learning tasks, it is commonly assumed that the class of any test sample must belong to the class set seen during training. However, this assumption does not always hold in open environments. We formally define this challenge by partitioning the class set into and , corresponding to the training and testing phases, respectively. In closed environments, the class set remains consistent between the training and testing phases, i.e., . In contrast, in open environments, test samples may belong to novel classes that are not present during training, i.e., such that . In such cases, the model must be capable of identifying and handling these new classes.
4.2 Decremental/Incremental Features
Decremental and incremental features represent open environments challenges characterized by partial removal or augmentation of the input feature set, known as feature shift. Let denote the full feature set, partitioned into and for training and testing, respectively. In closed environments, , whereas in open environments, remains fixed but may differ. Specifically, when , imputation of shifted features in is necessary to maintain input dimension consistency and enable accurate model prediction (Decremental Features). Conversely, when , the model typically truncates the newly added features in , retaining only those corresponding to , thus ensuring input dimension consistency between the training and testing phases (Incremental Features).
4.3 Changing Data Distributions
Closed environments machine learning research generally assumes that all data in both the training and testing phases are independent samples from the identical distribution. Unfortunately, this assertion does not always hold true in open environments. Changing data distributions has two scenarios. Covariate Shift[49] occurs when the input distribution changes between training and testing phases, while the conditional probability remains constant. Concept Shift[17] involves changes in the conditional probability with a stable input distribution .
4.4 Varied Learning Objectives
The performance of the machine learning model can be measured by a learning objective , such as accuracy, F1 score, or ROC-AUC. Learning towards different objectives may lead to a model with different strengths. A model that is optimal on one measure is not necessarily optimal on others. Machine learning research in closed environments generally assumes that the used to evaluate model performance is fixed and known in advance. However, this assumption may not always hold in open environments. When facing this challenge, the model should perform well across various learning objectives without requiring data to be recollected and a completely new model to be trained.
4.5 Evaluation Framework for open environments Challenges
Existing benchmarks for evaluating model performance in open environments typically focus on a single task, such as distribution shift[18] or feature shift[10]. However, they lack a unified and comprehensive assessment across multiple open environments challenges. Hence, we propose a modular and extensible evaluation framework that assesses both model performance and robustness across diverse real-world scenarios. The framework formalizes four representative open environments challenges: Emerging New Classes, Decremental/Incremental Features, Changing Data Distributions, and Varied Learning Objectives. It builds testing protocols by leveraging existing benchmarks, including WhyShift[34] and TableShift[18] for distribution shifts, and TabFSBench[10] for feature shifts. It supports comprehensive evaluation of tabular models and enables exporting datasets under different open environments scenarios with just a few lines of Python code. Details are in AppendixC.
Model EyeMovement CMC Wine-Red Wine-White ROC-AUC AUPR ROC-AUC AUPR ROC-AUC AUPR ROC-AUC AUPR RandomForest 0.509 0.505 0.503 0.502 0.500 0.500 0.500 0.500 XGBoost 0.503 0.502 0.502 0.501 0.567 0.558 0.533 0.533 CatBoost 0.507 0.504 0.512 0.507 0.467 0.492 0.367 0.462 MLP 0.503 0.502 0.527 0.517 0.700 0.658 0.400 0.472 RealMLP 0.510 0.505 0.514 0.508 0.500 0.500 0.500 0.500 ModernNCA 0.504 0.502 0.510 0.506 0.400 0.481 0.500 0.500 TabPFN v2 0.511 0.507 0.522 0.513 0.533 0.521 0.667 0.644
5 Comprehensive Evaluation of TabPFN v2
Expanding on the impressive performance of TabPFN v2 in closed environments, we undertake a comprehensive evaluation in open environments to rigorously assess the robustness and adaptability of TabPFN v2 through our proposed evaluation framework. Specifically, we subject TabPFN v2 to evaluation across four distinct challenges in open environments, as detailed in SectionD. We choose RandomForest[7], XGBoost[9] and CatBoost[44] as tree-based baseline models. We also select MLP, RealMLP[28] and ModernNCA[56] as deep learning baseline models. Given the difference in datasets from different open environments challenges, we provide detailed descriptions of the datasets in each subsection.
5.1 Emerging New Classes
Current tabular models (e.g., TabPFN v2) are fundamentally constrained by fixed input-output dimensions, limiting new class incorporation. To evaluate their adaptability, we design a novel class detection task from SMOOD[23]. Based on multi-class datasets (AppendixE.1), we implement a leave-one-class-out protocol: for -class problems, we perform runs, each excluding one class during training and treating it as novel. We evaluate models on the metrics of Area Under the Precision-Recall curve (AUPR) and ROC-AUC. Results represent averages across all runs. The detailed computation procedure is described in SectionE.2.
TabPFN v2 has the potential to detect new classes.As illustrated in Table1, TabPFN v2 consistently achieves better AUPR and ROC-AUC in the task of new class detection across all four datasets when compared to other models. This empirical evidence indicates that TabPFN v2 possesses a robust capability for identifying new classes. Results from other metrics of predicted probability are in AppendixE.
5.2 Decremental/Incremental Features
To conduct a comprehensive evaluation of decremental/incremental features, we adopt TabFSBench[10], a benchmark specifically designed for this challenge. It includes twelve datasets covering eight classification tasks and four regression tasks across various domains, dataset sizes, and feature types. Descriptions and results are provided in AppendixF.
Task Shift RandomForest XGBoost CatBoost MLP RealMLP ModernNCA TabPFN v2 Binary Classification 0% 0.838 0.842 0.869 0.805 0.813 0.869 0.852 20% 0.764(-0.074) 0.766(-0.076) 0.834(-0.035) 0.781(-0.024) 0.744(-0.069) 0.708(-0.161) 0.809(-0.043) 40% 0.622(-0.216) 0.624(-0.218) 0.764(-0.105) 0.743(-0.062) 0.666(-0.147) 0.598(-0.271) 0.725(-0.127) 60% 0.583(-0.255) 0.581(-0.261) 0.714(-0.155) 0.698(-0.107) 0.672(-0.141) 0.568(-0.301) 0.635(-0.217) 80% 0.464(-0.374) 0.514(-0.328) 0.631(-0.238) 0.620(-0.185) 0.563(-0.250) 0.540(-0.329) 0.556(-0.296) 100% 0.446(-0.392) 0.467(-0.375) 0.537(-0.332) 0.534(-0.271) 0.460(-0.353) 0.512(-0.357) 0.483(-0.369) Multi Classification 0% 0.800 0.802 0.837 0.723 0.745 0.906 0.709 20% 0.735(-0.065) 0.759(-0.043) 0.794(-0.043) 0.700(-0.023) 0.640(-0.105) 0.819(-0.087) 0.651(-0.058) 40% 0.637(-0.163) 0.677(-0.125) 0.714(-0.123) 0.658(-0.065) 0.665(-0.080) 0.700(-0.206) 0.556(-0.153) 60% 0.462(-0.338) 0.574(-0.228) 0.605(-0.232) 0.600(-0.123) 0.559(-0.186) 0.562(-0.344) 0.432(-0.277) 80% 0.354(-0.446) 0.460(-0.342) 0.463(-0.374) 0.520(-0.203) 0.379(-0.366) 0.444(-0.462) 0.288(-0.421) 100% 0.226(-0.574) 0.306(-0.496) 0.321(-0.516) 0.363(-0.360) 0.195(-0.550) 0.286(-0.620) 0.117(-0.592) Regression 0% 0.925 0.922 0.902 0.997 0.926 0.940 0.928 20% 1.218(+0.293) 1.155(+0.233) 1.152(+0.250) 1.025(+0.028) 1.263(+0.337) 1.103(+0.163) 0.974(+0.046) 40% 1.537(+0.612) 1.514(+0.592) 1.544(+0.642) 1.073(+0.076) 1.567(+0.641) 1.309(+0.369) 1.034(+0.104) 60% 1.738(+0.813) 1.762(+0.840) 1.818(+0.916) 1.125(+0.128) 1.802(+0.876) 1.499(+0.559) 1.184(+0.256) 80% 2.086(+1.161) 2.119(+1.197) 2.247(+1.345) 1.181(+0.184) 2.138(+1.212) 1.735(+0.795) 1.232(+0.304) 100% 2.346(+1.421) 2.412(+1.490) 2.571(+1.669) 1.247(+0.250) 2.433(+1.507) 1.940(+1.000) 1.317(+0.389)
TabPFN v2 exhibits heightened vulnerability to decremental features.To assess TabPFN v2’s adaptability to decremental features, we conduct random-shift experiments in TabFSBench and use the performance gap as a metric. The performance gap, explained in AppendixF, measures the impact of feature shifts by comparing model performance metrics between original and shifted features. As shown in Tables2, TabPFN v2’s performance gap widens significantly with increasing feature shifts, indicating weaker adaptability and higher sensitivity to feature space changes. In contrast, MLP and CatBoost show greater robustness against decremental features, possibly due to their inherent anti-shift properties, which TabPFN v2 may lack.
TabPFN v2 can not address new added features in the testing phase.When the dimensionality of input features dynamically increases, TabPFN v2 cannot process the additional features and can only truncate them to retain those present during training. This is because its internal parameters and feature representations are based on a fixed feature dimensionality during training. Consequently, TabPFN v2 cannot leverage the information from new features during testing. However, this limitation won’t degrade TabPFN v2’s performance.
5.3 Changing Data Distributions
We evaluate TabPFN v2 under scenarios of changing data distributions, using metrics including Accuracy, Balanced Accuracy, F1-score, and ROC-AUC. Detailed results are given in AppendixG. The evaluation is conducted on nine fully numerical datasets drawn from the WhyShift[34] and TableShift[18] benchmarks, which contain three types of data distribution scenarios. To accommodate memory constraints and the current limitations of TabPFN v2 in handling very large datasets, we apply stratified subsampling (up to 50,000 instances) while preserving the original train/test splits. Detailed dataset statistics are provided in AppendixG.1.
TabPFN v2 reveals limited robustness when concepts shift.We present a comparative analysis of accuracy between XGBoost and TabPFN v2 under two distinct data distribution shifts: Concept Shift and Covariate Shift. XGBoost is the best model in the changing data distributions task. Figure3 shows that both models exhibit enhanced accuracy when shifting from Concept to Covariate Shift. Meanwhile, XGBoost maintains consistent superiority over TabPFN v2 across both shift types. However, in Covariate Shift scenarios, TabPFN v2 demonstrates more improved performance compared to XGBoost, reducing the performance difference. Results on the other three metrics are given in AppendixG.3. These results suggest that TabPFN v2 demonstrates promising discriminative capacity on covariate-shift datasets instead of concept-shift datasets.
5.4 Varied Learning Objectives
We conduct an exhaustive comparative analysis across four primary classification learning objectives: Accuracy, ROC-AUC, F1-score, and Balanced Accuracy. The analysis is performed on i.i.d. datasets used in changing data distribution.


TabPFN v2 has statistically significant bias toward majority classes.Figure3 reveals that the performance of TabPFN v2 degrades significantly on class-imbalance-sensitive metrics (F1-score and Balanced Accuracy), suggesting inherent limitations in handling minority classes. Specifically, Balanced Accuracy, a metric designed to address class imbalance by computing the arithmetic mean of per-class accuracies, shows that TabPFN v2 struggles to effectively adapt to varying sample sizes across different classes. Similarly, F1-score, as the harmonic mean of precision and recall, further confirms the model’s suboptimal predictive capability for minority classes. Hence, TabPFN v2 is suitable to handle datasets with balanced classes.
TabPFN v2 fails to maintain competitive performance across various learning objectives.As illustrated in Figure3, TabPFN v2 demonstrates competitive performance with respect to accuracy and ROC-AUC, achieving comparable results to other models in these specific evaluation criteria. However, a comparative analysis reveals statistically significant performance deficiencies in both F1 Score and Balanced Accuracy when contrasted with tree-based models and RealMLP. These empirical observations highlight an important limitation of TabPFN v2: while exhibiting superior performance for particular learning objectives, the model fails to maintain consistent efficacy across all evaluated performance metrics.
5.5 Holistic Assessment
We conduct a comprehensive assessment to evaluate the robustness of TabPFN v2 relative to compared models in open environments, employing a performance ranking analysis across four above challenges in open environments.
TabPFN v2’s robustness is inherently data-dependent.Through a thorough and comprehensive evaluation of the performance of TabPFN v2 across four distinct open environments challenges, our analysis has revealed that TabPFN v2 consistently demonstrates superior efficacy, particularly when applied to small-scale datasets. This observed performance characteristic is in precise alignment with the fundamental design objective of TabPFN v2, which is specifically optimized for small-scale datasets. The empirical results obtained substantiate the theoretical premise underlying TabPFN v2’s development, thereby confirming the particular suitability of TabPFN v2 for applications where the volume of training data is inherently limited.
Tree-based models remain the optimal approach for general tabular tasks in open environments.As shown in Table3, tree-based models, particularly CatBoost and RandomForest, consistently outperform TabPFN v2. CatBoost achieves the best overall ranking, excelling in both changing data distributions and varied learning objectives, demonstrating stronger adaptability in open environments. In contrast, while TabPFN v2 remains competitive in closed environments, its performance declines relative to tree-based methods in open environments. These results suggest that tree-based models are better suited for open environments tasks requiring robustness.
5.6 Recommendations
During the experimental investigation, we observe that the majority of existing high-performance models predominantly demonstrate their superior performance in closed environments. However, these models tend to fall short in adapting to the open environments challenges that are more frequently encountered in real-world scenarios. To further enhance the performance of models in open environments and to provide guidance for the development of subsequent research, the following recommendations are proposed:
Develop benchmarks targeting unexplored open environments tabular challenges. Existing benchmarks are primarily designed around distribution shifts and feature shifts, lacking variations in open environments tabular challenges such as new classes and changes in learning objectives. Considering that the construction of benchmarks can effectively improve the model performance evaluation and methodological improvements on corresponding tasks. Therefore, it is urgent to develop benchmarks based on various open environments tabular challenges.
Evaluate models on various open environments metrics. Current research typically relies on OOD Accuracy, Performance Gap, or Balanced Accuracy to assess the robustness of a model. However, these metrics are mostly applicable to tasks involving distribution shifts or feature shifts and do not cover diverse open environments challenges. Therefore, additional general open environments metrics should be introduced in model evaluation, such as Open-World Tracking Accuracy[38] and Mean Average Precision[47].
Take model robustness as a critical metric when comparing model quality. Current research often judges the quality of a model solely based on its performance in closed environments, without considering the robustness of the model in open environments as an important evaluation criterion. While robustness is a crucial indicator for determining whether a model has practical value. Therefore, robustness should be regarded as a critical metric when comparing model quality, and the quality of a model should be comprehensively assessed based on both closed environments performance and open environments robustness.
Design universal modules to enhance the robustness of diverse existing models. From the aforementioned experiments, we learn that although some models perform well in certain open environments challenges, these models rely on their specific modules and lack universality. They cannot be transferred to other models to further improve robustness. Moreover, these models do not achieve excellent performance in all open environments challenges. Therefore, future research should focus on designing highly universal and transferable modules to enhance the overall performance of models in handling open environments tasks.
Task RandomForest XGBoost CatBoost MLP RealMLP ModernNCA TabPFN v2 Emerging New Classes 4.2 4.1 5.6 3.3 3.2 5.2 1.8 Decremental/Incremental Features 1.78 4.89 3.89 4.00 3.83 5.50 4.06 Changing Data Distributions 5.00 2.25 2.00 6.00 4.25 4.25 4.25 Varied Learning Objectives 4.75 2.75 2.00 6.75 4.00 3.75 4.00 Average Rank 3.93 3.49 3.37 5.01 3.82 4.67 3.53
6 Conclusion
We present the first comprehensive evaluation of TabPFN v2 in open environments and construct an evaluation framework that simulates diverse challenges in open environments, revealing its limitations in feature decrements, and distribution shifts, while highlighting its strengths in detecting new classes, small-scale datasets and covariate shift scenarios. Although tree-based models remain superior for general tabular tasks, our analysis identifies specific conditions under which TabPFN v2 is pragmatically viable. The observations underscore a critical performance gap between closed and open environments, emphasizing the need for enhanced evaluation frameworks and robust model designs. To advance open environments research, we advocate for the development of specialized benchmarks, multi-faceted model assessments prioritizing robustness, and universal modules to improve existing methods’ adaptability. These directions aim to bridge the current methodological divide and foster more reliable tabular learning systems in real-world applications.
Limitations. Our experiments may not fully represent the diversity of open environments tasks due to constraints in dataset variety and task types, potentially impacting the simulation of complex real-world scenarios. The theoretical analysis depth may also limit insights into TabPFN v2’s closed environments performance and open environments robustness.
References
- Akiba etal. [2019]Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.Optuna: A next-generation hyperparameter optimization framework.In Proceedings of the 25th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining, pages 2623–2631, 2019.
- Altman and Krzywinski [2017]Naomi Altman and Martin Krzywinski.Tabular data.Nature Methods, 14(4):329–331, 2017.
- Alvarez-Melis and Fusi [2020]David Alvarez-Melis and Nicolo Fusi.Geometric dataset distances via optimal transport.Advances in Neural Information Processing Systems, pages21428–21439, 2020.
- Badirli etal. [2020]Sarkhan Badirli, Xuanqing Liu, Zhengming Xing, Avradeep Bhowmik, Khoa Doan, andSathiyaKeerthi Keerthi.Gradient boosting neural networks: Grownet.arXiv preprint arXiv:2002.07971, 2020.
- Borisov etal. [2022]Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, MartinPawelczyk, and Gjergji Kasneci.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems,35(6):7499–7519, 2022.
- Breejen etal. [2024]Felixden Breejen, Sangmin Bae, Stephen Cha, and Se-Young Yun.Fine-tuned in-context learning transformers are excellent tabulardata classifiers.arXiv preprint arXiv:2405.13396, 2024.
- Breiman [2001]Leo Breiman.Random forests.Machine learning, 45:5–32, 2001.
- Cai etal. [2023]TiffanyTianhui Cai, Hongseok Namkoong, and Steve Yadlowsky.Diagnosing model performance under distribution shift.arXiv preprint arXiv:2303.02011, 2023.
- Chen and Guestrin [2016]Tianqi Chen and Carlos Guestrin.XGBoost: A Scalable Tree Boosting System.In Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages 785–794, 2016.
- Cheng etal. [2025]Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Lan-Zhe Guo, and Yu-Feng Li.TabFSBench: Tabular Benchmark for Feature Shifts in OpenEnvironment.arXiv preprint arXiv:2501.18935, 2025.
- Chizat etal. [2020]Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-XavierVialard, and Gabriel Peyré.Faster Wasserstein distance estimation with the Sinkhorndivergence.Advances in Neural Information Processing Systems, pages2257–2269, 2020.
- Cortez etal. [1998]P.Cortez, A.Cerdeira, F.Almeida, T.Matos, and J.Reis.Modeling wine preferences by data mining from physicochemicalproperties.Decision Support Systems, 47(4):547–553,1998.
- Deng etal. [2024]Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, and YunkuanWang.Zero-shot generalizable incremental learning for vision-languageobject detection.Advances in Neural Information Processing Systems, pages136679–136700, 2024.
- Dhamija etal. [2020]Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult.The overlooked elephant of object detection: Open set.In Proceedings of the IEEE/CVF Winter Conference onApplications of Computer Vision, pages 1021–1030, 2020.
- Du etal. [2022]Xuefeng Du, Zhaoning Wang, MuCai, and Yixuan Li.Vos: Learning what you don’t know by virtual outlier synthesis.arXiv preprint arXiv:2202.01197, 2022.
- Fang etal. [2024]XiFang, Weijie Xu, FionaAnting Tan, Jiani Zhang, Ziqing Hu, YanjunJane Qi,Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and ChristosFaloutsos.Large language models on tabular data: Prediction, generation, andunderstanding-a survey.arXiv preprint arXiv:2402.17944, 2024.
- Gama etal. [2014]J.Gama, I.Zliobaite, and A.Bifet.A survey on concept drift adaptation.ACM Computing Surveys (CSUR), 46(4):44,2014.
- Gardner etal. [2023]Josh Gardner, Zoran Popovic, and Ludwig Schmidt.Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, pages53385–53432, 2023.
- Gorishniy etal. [2021]Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko.Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, pages18932–18943, 2021.
- Grinsztajn etal. [2022]Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux.Why do tree-based models still outperform deep learning on typicaltabular data?Advances in Neural Information Processing Systems, pages507–520, 2022.
- Heinzel etal. [2025]CarolaSophia Heinzel, Lennart Purucker, Frank Hutter, and Peter Pfaffelhuber.Advancing biogeographical ancestry predictions through machinelearning.bioRxiv, pages 1–3, 2025.
- Helli etal. [2024]Kai Helli, David Schnurr, Noah Hollmann, Samuel Müller, and Frank Hutter.Drift-resilient TabPFN: In-context learning temporal distributionshifts on tabular data.Advances in Neural Information Processing Systems, pages98742–98781, 2024.
- Hendrycks and Gimpel [2017]Dan Hendrycks and Kevin Gimpel.A Baseline for Detecting Misclassified and Out-of-DistributionExamples in Neural Networks.In Proceedings of the 5th International Conference on LearningRepresentations, 2017.
- Heusel etal. [2017]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter.Gans trained by a two time-scale update rule converge to a local nashequilibrium.Advances in Neural Information Processing Systems, pages6629–6640, 2017.
- Hollmann etal. [2023]Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter.TabPFN: A Transformer That Solves Small Tabular ClassificationProblems in a Second.In Proceedings of the 11th International Conference on LearningRepresentations, 2023.
- Hollmann etal. [2025]Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, MaxKörfer, ShiBin Hoo, RobinTibor Schirrmeister, and Frank Hutter.Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025.
- Holzmüller etal. [2024]David Holzmüller, Léo Grinsztajn, and Ingo Steinwart.Better by default: Strong pre-tuned mlps and boosted trees on tabulardata.Advances in Neural Information Processing Systems, pages26577–26658, 2024.
- Holzmüller etal. [2025]David Holzmüller, Leo Grinsztajn, and Ingo Steinwart.RealMLP: Advancing MLPs and default parameters for tabular data.In ELLIS workshop on Representation Learning and GenerativeModels for Structured Data, 2025.
- Hoo etal. [2025a]ShiBin Hoo, Samuel Müller, David Salinas, and Frank Hutter.The tabular foundation model tabpfn outperforms specialized timeseries forecasting models based on simple features.arXiv preprint arXiv:2501.02945, 2025a.
- Hoo etal. [2025b]ShiBin Hoo, Samuel Müller, David Salinas, and Frank Hutter.The Tabular Foundation Model TabPFN Outperforms Specialized TimeSeries Forecasting Models Based on Simple Features.arXiv preprint arXiv:2501.02945, 2025b.
- Hou etal. [2017]Bo-Jian Hou, Lijun Zhang, and Zhi-Hua Zhou.Learning with feature evolvable streams.Advances in Neural Information Processing Systems, page1416–1426, 2017.
- Kadra etal. [2021]Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka.Well-tuned simple nets excel on tabular datasets.Advances in Neural Information Processing Systems, pages23928–23941, 2021.
- Koshil etal. [2024]Mykhailo Koshil, Thomas Nagler, Matthias Feurer, and Katharina Eggensperger.Towards Localization via Data Embedding for TabPFN.Advances in Neural Information Processing Systems TableRepresentation Learning Workshop, 2024.
- Liu etal. [2023]Jiashuo Liu, Tianyu Wang, Peng Cui, and Hongseok Namkoong.On the need for a language describing distribution shifts:Illustrations on tabular datasets.Advances in Neural Information Processing Systems, pages51371–51408, 2023.
- Liu and Ye [2025]Si-Yang Liu and Han-Jia Ye.TabPFN Unleashed: A Scalable and Effective Solution to TabularClassification Problems.arXiv preprint arXiv:2502.02527, 2025.
- Liu etal. [2024a]Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye.Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024a.
- Liu etal. [2024b]Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, YuCheng, and Junxian He.Diving into self-evolving training for multimodal reasoning.arXiv preprint arXiv:2412.17451, 2024b.
- Liu etal. [2022]Yang Liu, IdilEsen Zulfikar, Jonathon Luiten, Achal Dave, Deva Ramanan,Bastian Leibe, Aljoša Ošep, and Laura Leal-Taixé.Opening up open world tracking.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 19045–19055, 2022.
- McElfresh etal. [2023]Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, VishakPrasad C., GaneshRamakrishnan, Micah Goldblum, and Colin White.When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, pages34–47, 2023.
- Meijerink etal. [2020]Lotta Meijerink, Giovanni Cinà, and Michele Tonutti.Uncertainty estimation for classification and risk prediction onmedical tabular data.arXiv preprint arXiv:2004.05824, 2020.
- Noda etal. [2024]Ryunosuke Noda, Daisuke Ichikawa, and Yugo Shibagaki.Machine learning-based diagnostic prediction of minimal changedisease: model development study.Scientific Reports, 14(1):23460, 2024.
- Parmar etal. [2023]Jitendra Parmar, Satyendra Chouhan, Vaskar Raychoudhury, and Santosh Rathore.Open-world machine learning: applications, challenges, andopportunities.ACM Computing Surveys, 55(10):1–37, 2023.
- Popov etal. [2020]Sergei Popov, Stanislav Morozov, and Artem Babenko.Neural Oblivious Decision Ensembles for Deep Learning on TabularData.In Proceedings of the 8th International Conference on LearningRepresentations, 2020.
- Prokhorenkova etal. [2018]Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, AnnaVeronika Dorogush,and Andrey Gulin.CatBoost: unbiased boosting with categorical features.Advances in Neural Information Processing Systems, pages6639–6649, 2018.
- Ruiz-Villafranca etal. [2024]Sergio Ruiz-Villafranca, José Roldán-Gómez, Juan ManuelCasteloGómez, Javier Carrillo-Mondéjar, and JoséLuis Martinez.A TabPFN-based intrusion detection system for the industrialinternet of things.The Journal of Supercomputing, 80(14):20080–20117, 2024.
- Sahakyan etal. [2021]Maria Sahakyan, Zeyar Aung, and Talal Rahwan.Explainable artificial intelligence for tabular data: A survey.IEEE access, 9:135392–135422, 2021.
- Sancaktar etal. [2022]Cansu Sancaktar, Sebastian Blaes, and Georg Martius.Curious exploration via structured world models yields zero-shotobject manipulation.Advances in Neural Information Processing Systems, pages24170–24183, 2022.
- Shwartz-Ziv and Armon [2022]Ravid Shwartz-Ziv and Amitai Armon.Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022.
- Sugiyama etal. [2007]Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift adaptation by importance weighted cross validation.Journal of Machine Learning Research, 8(5), 2007.
- Thomas etal. [2024]Valentin Thomas, Junwei Ma, Rasa Hosseinzadeh, Keyvan Golestan, Guangwei Yu,Maksims Volkovs, and AnthonyL. Caterini.Retrieval & Fine-Tuning for In-Context Tabular Models.Advances in Neural Information Processing Systems, pages108439–108467, 2024.
- Tran and Byeon [2024]VinhQuang Tran and Haewon Byeon.Predicting dementia in parkinson’s disease on a small tabular datasetusing hybrid lightgbm–tabpfn and shap.Digital Health, 10:20–55, 2024.
- Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in Neural Information Processing Systems, pages6000–6010, 2017.
- Wang etal. [2021]Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,and EdChi.Dcn v2: Improved deep & cross network and practical lessons forweb-scale learning to rank systems.In Proceedings of the Web Conference 2021, pages 1785–1797,2021.
- West [2000]David West.Neural network credit scoring models.Computers & operations research, 27(11):1131–1152, 2000.
- Xu etal. [2025]DerekQiang Xu, FOlcay Cirit, Reza Asadi, Yizhou Sun, and Wei Wang.Mixture of In-Context Prompters for Tabular PFNs.In Proceedings of the 13th International Conference on LearningRepresentations, 2025.
- Ye etal. [2024]Han-Jia Ye, Huai-Hong Yin, and De-Chuan Zhan.Modern Neighborhood Components Analysis: A Deep Tabular Baseline TwoDecades Later.arXiv preprint arXiv:2407.03257, 2024.
- Ye etal. [2025]Han-Jia Ye, Si-Yang Liu, and Wei-Lun Chao.A Closer Look at TabPFN v2: Strength, Limitation, and Extension.arXiv preprint arXiv:2502.17361, 2025.
- Yıldız and Kalayci [2024]AYarkın Yıldız and Asli Kalayci.Gradient boosting decision trees on medical diagnosis over tabulardata.arXiv preprint arXiv:2410.03705, 2024.
- Zhou etal. [2021]Kaiyang Zhou, Yongxin Yang, YuQiao, and Tao Xiang.Domain Generalization with MixStyle.In Proceedings of the 9th International Conference on LearningRepresentations, 2021.
- Zhou etal. [2025]Zhi Zhou, Kun-Yang Yu, Lan-Zhe Guo, and Yu-Feng Li.Fully Test-time Adaptation for Tabular Data.In Proceedings of the 39th AAAI conference on ArtificialIntelligence, 2025.
- Zhou [2022]Zhi-Hua Zhou.Open-environment machine learning.National Science Review, 9(8):nwac123,2022.
- Zhou etal. [2019]Zhi-Hua Zhou, Yang Yu, and Chao Qian.Evolutionary learning: Advances in theories and algorithms.Springer, 2019.
- Zhu etal. [2021]Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv,Fuli Feng, and Tat-Seng Chua.TAT-QA: A question answering benchmark on a hybrid of tabular andtextual content in finance.arXiv preprint arXiv:2105.07624, 2021.
- Zuluaga etal. [2013]Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus Püschel.Active learning for multi-objective optimization.In Proceedings of the 29th International Conference on MachineLearning, pages 462–470, 2013.
Appendix A TabPFN and TabPFN v2
A.1 TabPFN
Developed by[25], TabPFN reimagines classification through an innovative adaptation of a Transformer-based architecture. At its core, the method reformulates the classification task as a sequence processing problem with the following key components:
Data Representation.
Each data point undergoes -dimensional standardization through linear projections:
Where zero-padding ensures all vectors conform to the predefined dimensionality .
Contextual Learning Framework.
The model operates by constructing a dynamic context matrix that jointly encodes with training samples and one test sample :
where denotes vector concatenation. This formulation treats each transformed data point as a token in a sequence, enabling flexible handling of varying dataset sizes.
Architecture.
The context matrix is processed through a stack of Transformer layers capable of processing variable-length token sequences and a specialized MLP head that converts the test instance’s output token into class probabilities
The model’s distinctive approach lies in its in-context learning paradigm, where the prediction for test sample emerges from the Transformer’s processing of the entire augmented sequence containing both training and testing representations. This design eliminates the need for traditional iterative training while maintaining competitive accuracy on tabular tasks.
A.2 TabPFN v2
Building upon TabPFN, TabPFN v2[26] introduces three key architectural innovations that redefine feature processing in tabular data analysis:
Feature Space Transformation.
Each raw feature undergoes linear projection into a -dimensional latent space, followed by controlled perturbation. This mechanism, characterized by[57] as a tokenization variant of[19]’s approach, creates unique positional identifiers for features.
Computational Framework.
The computational framework operates on a three-dimensional tensor structure and processes through dual attention mechanisms of cross-sample attention for dataset-level patterns and intra-feature attention for feature relationships.
Knowledge Transfer.
Pre-trained weights derived from synthetic data generated by structural causal model which facilitates zero-shot transfer, thereby addressing the challenges of tabular data diversity.
TabPFN v2 has three fundamental constraints: (1) quadratic complexity scaling, (2) dataset size limit , and (3) maximum class count ( for classification tasks). Hence,[57] introduces a divide-and-conquer mechanism to address these limitations. To address the performance degradation on high-dimensional datasets, a method combining feature subset sampling and ensemble learning is employed. For the inadequate performance on large-scale datasets, two improved schemes, data-to-embedding and decision tree, are proposed. To tackle the problem of inapplicability to multi-classification tasks, the Decimal Encoding and ECOC methods are utilized.
While current research[26, 57] has thoroughly assessed TabPFN v2’s performance, these evaluations primarily focus on its performance in closed environments. This leaves a critical gap in understanding how the model adapts to open environments. To fully realize TabPFN v2’s potential and explore its practical value, we conduct comprehensive evaluations. These evaluations focus on TabPFN v2’s performance under various open environments challenges.
Appendix B Tabular Challenges in Open Environments
B.1 Emerging New Classes
In closed environments, it is commonly assumed that the label of any testing sample must come from the label set used during training. However, this assumption is not always valid in open environments. For instance, in a forest disease monitoring system that relies on a machine learning model trained with signals from sensors deployed in the forest, it is impractical to enumerate all possible classes in advance, as some forest diseases may be entirely novel, such as those caused by invasive insect pests that have never been encountered in the region before.
B.2 Decremental/Incremental Features
Decremental/Incremental Features are another open environments challenge, wherein the feature set previously utilized as inputs is either partially removed or expanded by new features, also known as feature shift. Given a forest disease monitoring system that relies on a machine learning model trained with signals from sensors deployed in the forest, certain existing sensors may cease to function, leading to a reduction of the feature set (Decremental Features). Meanwhile, additional sensors may be deployed to monitor, resulting in an expansion of the feature set (Incremental Features).
B.3 Changing Data Distributions
Machine learning research in closed environments generally assumes that all data in both the training and testing phases are independent samples from the identical distribution. Unfortunately, this assertion does not always hold true in open environments. In the forest disease monitoring system, the model may be built in summer based on sensor signals specific to that season, but it is expected to perform well across all seasons.
B.4 Varied Learning Objectives
The performance of the machine learning model can be measured by a learning objective , such as accuracy, F1 score, or ROC-AUC. Learning towards different objectives may lead to a model with different strengths. Being optimal on one measure does not mean that the model will also be optimal on other measures. Machine learning research in closed environments generally assumes that the used to measure model performance is invariant and known in advance. However, this assertion may not invariably be valid in open environments. In the forest disease monitoring system, the sensor dispatch task may prioritize different objectives over time. Initially, various sensors are dispatched to pursue high monitoring accuracy; later, after a relatively high accuracy has been achieved, different sensors may be used to ensure that the system operates with minimal energy consumption. When facing this challenge, the model should be able to perform well on various learning objectives without requiring the data to be recollected and a completely new model to be trained.
Appendix C Evaluation Framework
To facilitate the use of our proposed evaluation framework, we provide a set of APIs. More details are available at https://anonymous.4open.science/r/tabpfn-ood-4E65. The API accepts four parameters: dataset, model, task, and export_dataset. We will give further specifications in the supplementary material and the repository readme.md.
The dataset parameter specifies the full name of the dataset to be used. Our evaluation framework supports datasets from OpenML, Kaggle, and local directories.
The model parameter defines the model to be evaluated and can be selected from tree-based models and deep-learning models, which we evaluated in this paper. New models can be added by following the instructions in the "How to Add New Models" section.
The task parameter determines the type of feature-shift experiment to be conducted. The available options include emerging new classes (enc), decremental features (df), changing data distributions (cdd), and varied learning objectives (vlo).
The export_dataset parameter controls whether the modified dataset—corresponding to a specific open environments challenge—is exported as a CSV file for further use.
An example command for running the evaluation framework is as follows ():
Appendix D General Experimental Settings
D.1 Traning settings
Deep learning models are trained on an NVIDIA 4090 GPU. Tree-based models are trained on an AMD Ryzen 5 7500F 6-Core Processor. All experimental results are reported as the average of three different random seeds to ensure statistical reliability.
D.2 Models
In this subsection, we provide detailed descriptions of all the models used in our paper.
XGBoost
XGBoost[11] is an efficient and flexible machine learning model that incrementally builds multiple decision trees by optimizing the loss function, with each tree correcting the errors of the previous one to continuously improve the model’s predictive performance. XGBoost also incorporates the gradient boosting algorithm, iteratively training decision tree-based models with the goal of minimizing residuals and enhancing predictive accuracy.
CatBoost
CatBoost[44] is a powerful boosting-based model designed for efficient handling of categorical features. It uses the "Ordered Boosting" technique, which calculates gradients sequentially to prevent target leakage and maintain the independence of each training instance. At the same time, CatBoost employs "Target-based Categorical Encoding," converting categorical variables into numerical representations based on target statistics, thereby reducing the need for extensive preprocessing and improving model performance.
RandomForest
RandomForest[7] is a classical ensemble learning method based on bagging and decision trees. It constructs a multitude of decision trees during training and outputs the mode or mean prediction of individual trees. Its robustness to overfitting, strong performance with minimal tuning, and ability to handle both classification and regression tasks make it a widely used baseline in tabular data benchmarks.
MLP
An MLP consists of multiple layers of neurons, with each layer fully connected to the next. An MLP contains at least three layers: an input layer, one or more hidden layers, and an output layer. It continuously adjusts the connection weights between neurons through training methods such as the backpropagation algorithm and gradient descent to minimize prediction errors.
ModernNCA
ModernNCA[56] is an enhanced Neighborhood Component Analysis (NCA) model that improves tabular data processing by adjusting learning objectives, integrating deep learning architectures, and using stochastic neighbor sampling for better efficiency and accuracy.
RealMLP
RealMLP[27] is an enhanced multilayer perceptron designed for tabular data tasks, combining architectural improvements with meta-learned default hyperparameters. It achieves a strong balance between accuracy and training efficiency.
D.3 Hyperparameter Tuning
In this subsection, we provide hyperparameter grids of tree-based and deep learning models in Table4,5.
For tree-based models, we employ GridSearchCV from the scikit-learn library to conduct an exhaustive hyperparameter search. This approach systematically explores a predefined parameter grid through 5-fold cross-validation to ensure the reproducibility of results. The search process is optimized for computational efficiency by enabling parallel processing.
Regarding deep learning models, we implement an adaptive hyperparameter optimization strategy based on the Optuna framework[1], following methodologies established in prior studies[36]. The optimization protocol maintains a constant batch size of 1024 and performs 100 independent trials using training-validation splits to prevent potential data leakage from the test set.
Model Hyperparameter Values XGBoost Learning Rate Max. Depth N Estimators Subsample Colsample Bytree Min Child Weight CatBoost Learning Rate Depth Iterations RandomForest Min Samples Split Min Samples Leaf
Model Hyperparameter Values MLP D_layers Dropout Uniform Learning Rate Loguniform Weight Decay Loguniform ModernNCA Dropout Uniform D_block Int N_blocks Int N_frequencies Int Frequency Scale Loguniform D_embedding Int Sample Rate Uniform Learning Rate Loguniform Weight Decay Loguniform RealMLP Num Emb Type Add Front Scale Learning Rate (lr) Dropout (p_drop) Activation (act) Hidden Sizes Weight Decay (wd) PLR Sigma Label Smoothing Epsilon (ls_eps)
Appendix E Emerging New Classes
E.1 Dataset
Eye Movements
This dataset is designed to predict the relevance of sentences in relation to a given question based on eye movement data. The target is to classify sentences as irrelevant, relevant, or correct, using 27 features, including landing position, first fixation duration, next fixation duration, time spent on the predicted region, and other relevant eye movement metrics. This dataset is available at https://www.kaggle.com/datasets/vinnyr12/eye-movements.
Contraceptive Method Choice (CMC)
This dataset contains 1,473 instances with 10 demographic and socio-economic attributes, originally derived from the 1987 National Indonesia Contraceptive Prevalence Survey. Each instance represents a married woman who was not pregnant (or unsure) at the time of the interview. The target is to predict the contraceptive method currently used by the individual, categorized into three classes: no-use, long-term methods, and short-term methods. This dataset was prepared by Tjen-Sien Lim and is available at https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.
Wine Quality (Red and White)
This dataset includes two subsets related to red and white Vinho Verde wine samples from the north of Portugal. Each sample is described by 11 physicochemical attributes (e.g., acidity, sugar, pH) and a quality score ranging from 1 to 4. The task is to predict the sensory quality of the wine based on its physicochemical properties. The dataset was first introduced by Cortez et al.[12] and is available at https://archive.ics.uci.edu/dataset/186/wine+quality.
E.2 Results
To evaluate a model’s ability to detect novel classes, we adopt a leave-one-class-out protocol. For each class label, we exclude its samples from training and treat them as novel (label 1) during testing. An equal number of samples from the remaining classes are randomly sampled as known (label 0). After training the model on the reduced dataset, we compute confidence scores for test samples using the maximum predicted probability. Samples with low confidence (within a fixed interval ) are predicted as novel. We assess performance using ROC-AUC and AUPR, measuring the model’s ability to separate known and novel instances based on confidence.
To further assess whether models exhibit appropriate uncertainty when encountering novel classes, we hold out one or more classes during training and evaluate the predicted probabilities on test samples from these unseen classes. A prediction is deemed uncertain if its maximum confidence falls within a predefined low-confidence interval . We report the proportion of novel samples falling into this interval as an indicator of the model’s ability to recognize unfamiliar inputs. This metric complements ROC-AUC and AUPR by directly measuring how often the model expresses uncertainty when presented with out-of-distribution classes.
Table6, Table7, and Table8 show the results for uncertainty intervals , , and , respectively.
Model | EyeMovement | CMC | Wine-Red | Wine-White |
---|---|---|---|---|
RandomForest | 0.716 | 0.520 | 0.764 | 0.562 |
XGBoost | 0.107 | 0.159 | 0.061 | 0.052 |
CatBoost | 0.150 | 0.183 | 0.127 | 0.070 |
MLP | 0.029 | 0.356 | 0.083 | 0.068 |
RealMLP | 0.198 | 0.523 | 0.000 | 0.000 |
ModernNCA | 0.101 | 0.323 | 0.182 | 0.433 |
TabPFN v2 | 0.139 | 0.300 | 0.246 | 0.400 |
Model | EyeMovement | CMC | Wine-Red | Wine-White |
---|---|---|---|---|
RandomForest | 0.340 | 0.351 | 0.244 | 0.294 |
XGBoost | 0.054 | 0.075 | 0.027 | 0.023 |
CatBoost | 0.077 | 0.095 | 0.040 | 0.031 |
MLP | 0.014 | 0.203 | 0.037 | 0.052 |
RealMLP | 0.091 | 0.272 | 0.000 | 0.000 |
ModernNCA | 0.051 | 0.178 | 0.085 | 0.205 |
TabPFN v2 | 0.072 | 0.156 | 0.142 | 0.241 |
Model | EyeMovement | CMC | Wine-Red | Wine-White |
---|---|---|---|---|
RandomForest | 0.127 | 0.054 | 0.038 | 0.005 |
XGBoost | 0.012 | 0.016 | 0.001 | 0.001 |
CatBoost | 0.016 | 0.019 | 0.001 | 0.010 |
MLP | 0.003 | 0.038 | 0.003 | 0.004 |
RealMLP | 0.019 | 0.053 | 0.000 | 0.000 |
ModernNCA | 0.009 | 0.039 | 0.018 | 0.046 |
TabPFN v2 | 0.015 | 0.036 | 0.020 | 0.030 |
Appendix F Decremental/Incremental Features
F.1 Dataset
We refer to TabFSBench[10] for details of the evaluated datasets.
Credit
The original dataset contains 1,000 entries with 20 categorical/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit from a bank. Each person is classified as having good or bad credit risk according to the set of attributes. The target is to determine whether the customer’s credit is good or bad. This dataset is available at https://www.openml.org/search?type=data&sort=runs&id=31&status=active.
Electricity
The Electricity dataset, collected from the Australian New South Wales Electricity Market, contains 45,312 instances from May 1996 to December 1998. Each instance represents a 30-minute period and includes fields for the day, timestamp, electricity demand in New South Wales and Victoria, scheduled electricity transfer, and a class label. The target is to predict whether the price in New South Wales is up or down relative to a 24-hour moving average, based on market demand and supply fluctuations. This dataset is available on https://www.kaggle.com/datasets/vstacknocopyright/electricity.
Heart
Cardiovascular diseases (CVDs) are the leading cause of death globally, responsible for 17.9 million deaths annually. Heart failure is a common event caused by CVDs, and this dataset contains 11 features that can be used to predict a possible heart disease. The target is to determine whether the patient’s heart disease is present or absent. This dataset is available on https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction.
Miniboone
This dataset aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a binary classification task where we attempt to predict one of two possible outcomes. The target is to determine whether the neutrino is an electron or a muon. This dataset is available at https://www.kaggle.com/datasets/alexanderliapatis/miniboone.
Iris
The Iris flower dataset, introduced by Ronald Fisher in 1936, contains 150 samples from three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Each sample has four features: sepal length, sepal width, petal length, and petal width, measured in centimeters. The target is to classify the Iris species as setosa, versicolor, or virginica. This dataset is available on https://www.kaggle.com/datasets/uciml/iris.
Jannis
This dataset is used in the tabular benchmark from[20]. It belongs to the ’classification on numerical features’ benchmark. The dataset is designed to test classification performance using numerical features, and it presents challenges such as varying data distributions, class imbalances, and potential missing values. It serves as a critical evaluation tool for machine learning models in real-world scenarios, including medical diagnosis, credit rating, and object recognition tasks. This dataset is available on https://www.openml.org/search?type=data&status=active&id=45021.
Penguins
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The goal of the Palmer Penguins dataset is to offer a comprehensive resource for data exploration and visualization, serving as an alternative to the Iris dataset. The target is to classify the penguin species as Adelie, Chinstrap, or Gentoo. This dataset is available at https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species.
Dataset Train Size ID Test OOD Test Total Num. Columns Cat. Columns # Classes college_scorecard 43,908 5,488 602 49,998 118 0 2 43,566.39 2116.63 0.0337 brfss_diabetes 37,284 4,660 8,054 49,998 142 0 2 12.28 0.10 0.0332 diabetes_readmission 19,146 2,393 28,460 49,999 183 0 2 42.37 1.30 0.0060
Dataset Train ID Test OOD Test Total #Features Setting Shift Pattern ACS Income (CAPR) 38,227 9,557 2,215 49,999 9 California Puerto Rico ACS Mobility (MSHI) 4,254 1,064 2,733 8,051 21 Mississippi Hawaii ACS Pub.Cov (NELA) 23,211 5,065 1,267 16,879 18 Nebraska Louisiana ACS Pub.Cov (20102017) 20,501 5,126 24,372 49,999 18 2010 (NY) 2017 (NY) ACS Income (Young 80%) 20,000 5,000 25,000 50,000 9 Younger People (80%) ACS Income (Young 90%) 20,000 5,000 25,000 50,000 9 Younger People (90%)
Eye Movements
This dataset is designed to predict the relevance of sentences in relation to a given question based on eye movement task. The target is to classify sentences as irrelevant, relevant, or correct, using 27 features, including landing position, first fixation duration, next fixation duration, time spent on the predicted region, and other relevant eye movement metrics. This dataset is available at https://www.kaggle.com/datasets/vinnyr12/eye-movements.
Abalone
The age of abalone is traditionally determined by cutting the shell, staining it, and counting the rings under a microscope, a process that is both tedious and time-consuming. This dataset uses easier-to-obtain physical measurements, such as length, diameter, and weight, to predict the abalone’s age. The target is to predict the age, providing a more efficient approach. This dataset is available on https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset.
Bike
The dataset records the rental of shared bikes in the Washington area from 2011-01-01 to 2012-12-31, containing 11 features such as season, holiday, working day, and weather conditions. The goal is to predict the total count of bikes rented each hour, with the target being to forecast the number of bicycles available for rent today based on historical rental patterns and external factors like temperature, humidity, and seasonal trends. This dataset is available on https://www.kaggle.com/datasets/abdullapathan/bikesharingdemand.
Concrete
Concrete is the most important material in civil engineering, and its compressive strength is influenced by a highly nonlinear relationship with its ingredients and age. The dataset contains 9 attributes, including variables such as cement, water, and age. The target is to predict the concrete compressive strength (measured in MPa) using these input variables. This dataset is available on https://www.kaggle.com/datasets/maajdl/yeh-concret-data.
Laptop
The original dataset was relatively compact, with many details embedded in each column. The columns mostly consisted of long strings of data, which were relatively human-readable and concise. However, for Machine Learning algorithms to work more efficiently, it is better to separate different details into individual columns. After doing so, 28 duplicate rows were exposed and removed. The cleaned dataset serves as the final result. The target is to predict the price of a laptop based on its specifications. This dataset is available on https://www.kaggle.com/datasets/owm4096/laptop-prices.
F.2 Performance Gap
We consider the percentage of model performance gap as model robustness in feature-shift scenarios by following TabFSBench[10],
(1) |
denotes the model performance where features shift. In subsequent sections, we use to refer to performance, and to refer to robustness.
F.3 Results
To systematically assess TabPFN v2 in the presence of decremental features, we design random-shift experiments in TabFSBench. We use accuracy and ROC-AUC for classification tasks and RMSE for regression tasks. Table2 provides detailed model performance.
Appendix G Changing Data Distributions
G.1 Dataset
We evaluate our models on two established benchmarks for distribution shift in tabular data: TableShift[18] and WhyShift[34].
From the TableShift benchmark, we select three datasets—College Scorecard, Hospital Readmission, and Diabetes. To ensure scalability and consistency across experiments, we apply stratified subsampling to limit each dataset to 50,000 total instances while preserving the original train/test split ratio. Detailed statistics are provided in Table9.
From the WhyShift benchmark, we adopt six pre-defined settings provided by the original paper, covering a variety of real-world covariate and concept shift scenarios. For datasets containing more than 50,000 instances, we similarly apply stratified subsampling to retain a total of 50,000 samples. The configuration and statistics for all selected WhyShift settings are summarized in Table10.
G.2 Domain Shift Metrics
Domain shift can be categorized into covariate shift, concept shift, and label shift. We adopt the metrics proposed in TableShift[18] to quantify the degree of these three types of domain shift.
Measuring Covariate Shift with OTDD:
(2) |
Here, and denote the source and target domain datasets, respectively. OTDD represents the Optimal Transport Dataset Distance, computed under a Gaussian approximation[3].
Measuring Concept Shift with Frechet Dataset Distance (FDD):
Inspired by the widely used Frechet Inception Distance (FID) in machine learning[24], FDD utilizes intermediate representations of a classifier to quantify distributional discrepancies. It calculates the Frechet distance (also known as the Wasserstein-2 distance) between two distributions to assess the extent of concept shift.
The computation of this metric proceeds as follows: First, a classifier (we use MLPs) is trained on the source domain using the best hyperparameters obtained through hyperparameter search. Then, for each input , we compute the activation values at each layer of the model, obtaining the activation vector , where denotes the -th layer of the model. Finally, the Frechet Dataset Distance is calculated to measure the divergence between the two distributions.
(3) |
represents the set of feature vectors extracted from domain . Based on , we construct the covariance matrix . In the discussion below, we refer to the resulting measure as . A lower FDD score indicates a smaller distance between any in the training domain and in the test domain .


Measuring label shift:
TableShift[18] proposes a simple formula to quantify the label shift between the source and target distributions:
(4) |
In this equation, represents the average label value computed from samples within domain . Given that all tasks in our study are binary classification, this formulation captures the squared distance between the class prior probabilities of the source and target domains.
Quantifying the Contribution of -Shifts and -Shifts
To gain a fine-grained understanding of the sources of model performance degradation under distribution shifts, we adopt the DIStribution Shift DEcomposition (DISDE) framework proposed by [8]. This framework enables the decomposition of the overall generalization gap into components attributed to covariate shifts (-shifts) and conditional label shifts (-shifts).
Formally, for a model trained on a source distribution and evaluated on a target distribution , DISDE decomposes the generalization gap as:
(5) | ||||||
where denotes the conditional expected loss under distribution , and is an auxiliary distribution over whose support is contained within both and .
Each term in Eqn.5 in this decomposition corresponds to a specific type of shift:
- •
Eqn.5.\@slowromancapi@&\@slowromancapiii@ reflect changes due to differences in the marginal distribution of covariates—i.e., the contribution of -shifts.
- •
Eqn.5.\@slowromancapii@ captures the shift in the conditional distribution of labels given features, corresponding to -shifts.
Building upon this decomposition, we utilize the open-source WhyShift package111https://github.com/namkoong-lab/whyshift, which implements DISDE in a scalable and extensible manner. This allows us to rigorously quantify the relative impact of -shifts and -shifts on performance degradation across datasets and domains, providing deeper insight into model robustness under open environments evaluation settings.
G.3 Results
We evaluate TabPFN v2 alongside several mainstream tabular models under changing data distributions scenarios, using metrics including Accuracy, Balanced Accuracy, F1-score, and ROC-AUC. The evaluation is conducted on nine fully numerical datasets drawn from the WhyShift[34] and TableShift[18] benchmarks, which contain three types of data distribution scenarios. To accommodate memory constraints and the current limitations of TabPFN in handling very large datasets, we apply stratified subsampling (up to 50,000 instances) while preserving the original train/test splits. Figure4, 5,6and7 demonstrates results on changing data distributions. Although there has been recent work attempting to apply TabPFN to scenarios involving temporal distribution shift[22], their implementation is not publicly available. Hence, we do not include this method in our evaluation.


Appendix H Varied Learning Objectives
We conduct an analysis across four primary classification learning objectives: accuracy, ROC-AUC, F1-score, and Balanced Accuracy. The analysis is performed on i.i.d. datasets that are employed in Changing Data Distribution. Table11 demonstrates results on varied learning objectives.
Objective Model ACS Income(CA-PR) ACS Mobility(MS-HI) ACS Pub.Cov(NE-LA) ACS Pub.Cov(2010-2017) ACS Income(Setting 21) ACS Income(Setting 22) college_scorecard brfss_diabetes diabetes_readmission Accuracy XGBoost 0.813 0.788 0.819 0.831 0.845 0.864 0.949 0.872 0.642 CatBoost 0.815 0.801 0.827 0.836 0.852 0.870 0.949 0.874 0.651 MLP 0.781 0.760 0.793 0.794 0.821 0.843 0.934 0.869 0.569 ModernNCA 0.810 0.800 0.819 0.822 0.842 0.859 0.946 0.874 0.651 RandomForest 0.805 0.800 0.821 0.823 0.843 0.859 0.943 0.876 0.649 RealMLP 0.812 0.796 0.814 0.819 0.840 0.856 0.946 0.875 0.653 TabPFN v2 0.806 0.804 0.824 0.834 0.848 0.867 0.938 0.875 0.651 ROC-AUC XGBoost 0.893 0.815 0.817 0.790 0.848 0.865 0.978 0.807 0.679 CatBoost 0.897 0.832 0.833 0.802 0.861 0.877 0.978 0.817 0.697 MLP 0.848 0.758 0.757 0.721 0.791 0.807 0.959 0.809 0.585 ModernNCA 0.893 0.825 0.821 0.812 0.839 0.855 0.976 0.812 0.689 RandomForest 0.886 0.833 0.833 0.823 0.847 0.861 0.971 0.817 0.689 RealMLP 0.893 0.806 0.804 0.801 0.828 0.845 0.950 0.808 0.685 TabPFN v2 0.888 0.832 0.831 0.779 0.858 0.873 0.968 0.816 0.688 F1-Score XGBoost 0.770 0.815 0.732 0.471 0.717 0.693 0.792 0.239 0.522 CatBoost 0.772 0.824 0.735 0.468 0.719 0.694 0.789 0.210 0.517 MLP 0.732 0.790 0.699 0.434 0.676 0.645 0.739 0.157 0.487 ModernNCA 0.762 0.817 0.722 0.646 0.651 0.639 0.778 0.129 0.507 RandomForest 0.749 0.814 0.709 0.628 0.626 0.609 0.748 0.058 0.446 RealMLP 0.764 0.813 0.719 0.646 0.650 0.636 0.786 0.182 0.534 TabPFN v2 0.755 0.817 0.721 0.416 0.702 0.675 0.728 0.044 0.516 Balanced-Accuracy XGBoost 0.801 0.723 0.721 0.658 0.739 0.741 0.860 0.567 0.617 CatBoost 0.807 0.720 0.715 0.656 0.733 0.735 0.855 0.557 0.622 MLP 0.775 0.708 0.703 0.642 0.718 0.720 0.839 0.551 0.558 ModernNCA 0.801 0.708 0.702 0.685 0.705 0.710 0.849 0.531 0.619 RandomForest 0.791 0.695 0.687 0.670 0.685 0.686 0.814 0.514 0.604 RealMLP 0.803 0.723 0.714 0.695 0.714 0.716 0.867 0.548 0.628 TabPFN v2 0.794 0.711 0.704 0.631 0.718 0.718 0.810 0.510 0.621