Realistic Evaluation of TabPFN v2 in Open Environments (2025)

Zi-Jian Cheng1,2, Zi-Yi Jia1,2, Zhi Zhou2,3, Yu-Feng Li2,3, Lan-Zhe Guo1,211footnotemark: 1
1School of Intelligence Science and Technology, Nanjing University, China
2National Key Laboratory for Novel Software Technology, Nanjing University, China
3School of Artificial Intelligence, Nanjing University, China
{chengzj,zhouz,liyf,guolz}@lamda.nju.edu.cn,jiazy@smail.nju.edu.cn
Corresponding author.

Abstract

Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2’s adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at the URL.

1 Introduction

Tabular data[2] constitutes a highly structured data paradigm characterized by its organization of information through orthogonal dimensions of rows and columns[46]. In tabular data, each row represents an instance, while each column encodes a specific feature or attribute. The pervasive applicability of tabular data has been demonstrated across diverse domains. Within financial services, it facilitates critical operations such as credit scoring[54] and quantitative portfolio management[63] through predictive analytics. In biomedical research, tabular datasets underpin clinical decision support systems[58] and pharmacological discovery pipelines[40]. To fully exploit the potential of tabular data for addressing real-world tasks, various tabular machine learning models have been developed. This evolutionary progression spans from tree-based methods (e.g., CatBoost[44] and XGBoost[9]) to deep learning models (e.g., ModernNCA[56] and TabPFN[25, 26]). These models have demonstrated exceptional performance across diverse tabular tasks.

Tree-based models consistently outperform deep learning models in tabular tasks [20, 39]. The emergence of a new deep learning model, TabPFN v2, has effectively disrupted the dominance of tree-based models in performance[26]. Grounded in the Transformer[52], TabPFN v2 achieves state-of-the-art results through large-scale pre-training on synthetic datasets, allowing direct deployment on downstream tasks without the need for fine-tuning. Notably, TabPFN v2 introduces a novel contextual learning framework that processes both labelled training data and unlabeled test samples in unified input pipelines. This hybrid training approach facilitates joint optimization of feature representation and class prediction through self-supervised alignment mechanisms. Empirical validation across diverse datasets demonstrates an unprecedented performance level of TabPFN v2 in tabular tasks.

Realistic Evaluation of TabPFN v2 in Open Environments (1)

Given the significant potential demonstrated by TabPFN v2 in handling tabular machine learning tasks, current research has focused on further enhancing its performance or adapting it to more real-world applications. They can be divided into two categories: performance evaluation and the handling of specific tasks. For performance evaluation,Liu and Ye [35] had expanded the scope of evaluation experiments on TabPFN, assessing its performance on nearly 300 datasets and further validating the efficacy of TabPFN. To address the limitations of TabPFN v2 in handling high-dimensional, large-scale, and multi-class tabular machine learning tasks, a divide-and-conquer mechanism has been proposed[57]. Furthermore, researchers have proposed a series of optimization strategies to enhance TabPFN v2’s adaptability in complex tasks such as context compression[33] and data generation[50]. Koshil etal. [33] suggests leveraging retrieval samples to construct a local context, thereby enhancing TabPFN v2’s ability to perceive local information. While Thomas etal. [50] andXu etal. [55] optimizes TabPFN v2’s performance through data generation.

However, current research on TabPFN v2 is mostly carried out in closed environments where various learning factors, such as data distribution and feature space, remain consistent[42]. In the real world, tabular tasks usually occur in open environments[61] and face significant challenges when these learning factors change. For example, in traffic management systems, as the categories of traffic participants, event types, and facilities continue to increase, the complexity of management significantly rises (Emerging New Classes). Meanwhile, equipment updates, failures, and changes in travel behaviour lead to feature drifts in data, affecting the system’s accurate perception of traffic states (Decremental/Incremental Features). Moreover, the distribution of traffic flow frequently changes due to factors such as urban planning, large-scale events, and holidays, further increasing the dynamism of management (Changing Data Distributions). In addition, management goals have also shifted from single-efficiency optimization to multi-objective optimization, including reducing carbon emissions and enhancing system resilience, while paying more attention to long-term sustainability and overall system optimization (Varied Learning Objectives). Figure1 depicts four open environments challenges claimed inZhou [61]. Although existing research has gradually focused on improving TabPFN v2’s adaptability in open environments, these studies mainly concentrate on distribution shift scenarios[30, 22] and have not yet conducted a comprehensive evaluation of various challenges that TabPFN v2 may face in open environments. This limitation raises the natural question of whether TabPFN v2 can maintain good performance in open environments, and highlights the need for a more holistic assessment of TabPFN v2 in diverse and dynamic real-world scenarios.

To this end, we conduct a comprehensive evaluation of the performance of TabPFN v2 in open environments for the first time. Existing benchmarks for tabular data in open environments primarily evaluate models in isolated scenarios, limiting their methodological applicability to broader real-world tasks. To address this, we introduce a unified evaluation framework that systematically benchmarks diverse tabular models across various challenges in open environments, enabling standardized assessment of robustness and adaptability.

From the experiments, we observe that TabPFN v2 exhibits overall limitations across various challenges in open environments. Although in emerging new classes, TabPFN v2 has the potential to detect new classes, when handling decremental/incremental features, it not only shows heightened vulnerability to feature decrement but also can not address newly added features during testing. Under changing data distributions, the performance of TabPFN v2 degrades substantially due to limited robustness against concept drift. For varied learning objectives, TabPFN v2 displays statistically significant bias toward majority classes while failing to maintain competitive performance across different task formulations. Moreover, the robustness of TabPFN v2 is fundamentally data-dependent, rendering its generalization capability highly sensitive to the scale of the dataset.

Although results demonstrate that tree-based models remain the optimal approach for general tabular tasks in open environments, the above observations suggest settings where TabPFN v2 is most likely the right choice in open environments for practitioners: 1) when the available dataset is small; 2) when the distribution shift is characterized as covariate shift; 3) when the label distribution is approximately balanced across classes.

Separately, state-of-the-art methods, despite their strong performance in closed environments, may fail to generalize effectively in open environments. This performance gap underscores a crucial research imperative regarding the enhancements required to advance open environments research. To address this challenge, we propose the following recommendations:

  • Develop benchmarks targeting unexplored open environments tabular challenges.

  • Evaluate models on various open environments metrics.

  • Take model robustness as a critical metric when comparing model quality.

  • Design universal modules to enhance the robustness of diverse existing models.

2 Related Work

2.1 Open Environments Challenges

Most tabular machine learning models are typically trained and tested in closed environments where critical learning factors remain stable. However, various real-world tasks operate in open environments where dynamic changes occur in key factors, posing challenges to model generalization[42].Zhou [61] categorizes four core challenges in open environments: Emerging New Classes, Decremental/Incremental Features, Changing Data Distributions, and Varied Learning Objectives.

These challenges are pivotal in open environments machine learning. Emerging new classes, involving unseen classes during testing, have been addressed in natural language processing[13] and computer vision[14, 15]. Decremental/Incremental features, caused by changes in feature sets, lead to mismatched training-testing spaces. TabFSBench[10] evaluates model performance under such variations, andHou etal. [31] enhances performance by restoring ephemeral features. Changing data distributions, where test data violate the i.i.d.assumption, have led to benchmark datasets such as Tableshift[18], and methods such as domain adaptation[60] and domain generalization[59]. Varied learning objectives, which prioritize adaptive optimization beyond accuracy, include multi-objective learning[62, 64] and self-evolving training[37]. However, research on these challenges remains fragmented, lacking a unified framework to evaluate models on all four challenges.

2.2 Tabular Data in Machine Learning

Tabular data, with structured and heterogeneous features, is used in healthcare, finance, and recommendation systems[5, 32, 48]. Unlike images and texts, it has high dimensionality, heterogeneity, and complex dependencies, posing challenges for machine learning models[16]. Current approaches are mainly tree-based models (e.g., XGBoost[11], LightGBM[4], CatBoost[44]) and deep learning models. Tree-based models handle irregular patterns and uninformative features well[20], while deep learning models like DCN V2[53], FT-Transformer[19], and NODE[43] aim to capture complex feature interactions for better performance[19, 43].

In the realm of tabular machine learning tasks, tree-based models have traditionally held a dominant position over deep learning models[20, 39]. The emergence of the novel deep learning model TabPFN v2[26] has surpassed tree-based models. TabPFN v2 has demonstrated superior performance compared to tree-based models across multiple benchmarks. However, the majority of these benchmarks are confined to closed environments. Consequently, the comparative performance of TabPFN v2 and tree-based models in open environments remains underexplored.

2.3 Research on TabPFN

TabPFN[25], short for Tabular Prior-Fitted Network, is a model pre-trained on large-scale synthetic datasets, enabling efficient zero-shot learning. It can efficiently perform classification and regression tasks without the need for hyperparameter tuning. Compared to existing models, TabPFN shows significant advantages on small- to medium-scale datasets with low computational cost, making it an efficient solution for tabular tasks. Recent research, however, reveals limitations in TabPFN’s performance on high-dimensional, large-scale, or multi-class tasks[57, 35].

Various optimization strategies have been proposed to enhance TabPFN’s adaptability to current limitations and more complex scenarios, including local context construction via retrieval-based methods[33], model fine-tuning[50, 55], and pretraining dataset expansion[6]. Moreover, TabPFN’s strong performance has prompted its application to challenges such as distribution shift adaptation[22], time series forecasting[29], and various domains including healthcare[41, 51], ecology[21], and cybersecurity[45]. However, these studies primarily focus on closed environments or target only a single challenge in open settings, lacking a comprehensive evaluation of TabPFN under diverse open environments scenarios.

3 TabPFN and TabPFN v2

This section explains how TabPFN and its newer version, TabPFN v2, work. Since there are already various research about these models, this section will give a short summary. More details are given in AppendixA, which brings together the important parts fromHollmann etal. [26], Ye etal. [57].

3.1 TabPFN

Developed byHollmann etal. [25], TabPFN reimagines classification through an innovative adaptation of a Transformer-based architecture. At its core, the method reformulates the classification task as a sequence processing problem with the following key components.

TabPFN standardizes each data point (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to (x~i,y~i)subscript~𝑥𝑖subscript~𝑦𝑖(\tilde{x}_{i},\tilde{y}_{i})( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the k𝑘kitalic_k-dimensional space via linear projections with zero-padding ensuring uniform dimensionality. A context matrix 𝒜𝒜\mathcal{A}caligraphic_A is constructed by concatenating N𝑁Nitalic_N training samples and a test sample xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where 𝒜=[x~iy~i]i=1N[x~]𝒜conditionalsuperscriptsubscriptdelimited-[]direct-sumsubscript~𝑥𝑖subscript~𝑦𝑖𝑖1𝑁delimited-[]superscript~𝑥\mathcal{A}=\left[\tilde{x}_{i}\oplus\tilde{y}_{i}\right]_{i=1}^{N}\parallel%\left[\tilde{x}^{*}\right]caligraphic_A = [ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ [ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] and direct-sum\oplus denotes vector concatenation. This formulation treats each data point as a token in a sequence, enabling flexible handling of variable dataset sizes. The context matrix is then processed through Transformer layers and an MLP head, which converts the test sample’s output token into class probabilities.

3.2 TabPFN v2

Building upon TabPFN, TabPFN v2[26] introduces architectural innovations that redefine feature processing in tabular data analysis. The proposed method encompasses a feature space transformation where each raw feature is projected into a k𝑘kitalic_k-dimensional latent space and subjected to controlled perturbation, creating unique positional identifiers[57, 19]. The computational framework processes a three-dimensional tensor structure using dual attention mechanisms: cross-sample attention for dataset-level patterns and intra-feature attention for feature relationships. Pre-trained weights derived from synthetic data generated by structural causal models which facilitate zero-shot transfer, thereby addressing the challenges of tabular data diversity.

Current research[26, 57] has extensively evaluated TabPFN v2’s performance in closed environments, but largely overlooked its adaptability to open environments, leaving a critical gap. To fully realize its potential and practical value, we conduct comprehensive evaluations of TabPFN v2 under various open environments challenges.

4 Open Environments Challenges

In this section, we draw upon the previous work presented inZhou [61] as a foundational framework to further symbolically formalize and represent the open environments challenges encountered in the tabular machine learning tasks. The detailed real-world descriptions are given in AppendixB.

4.1 Emerging New Classes

In closed environments machine learning tasks, it is commonly assumed that the class of any test sample must belong to the class set seen during training. However, this assumption does not always hold in open environments. We formally define this challenge by partitioning the class set L𝐿Litalic_L into Ltrainsuperscript𝐿trainL^{\text{train}}italic_L start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT and Ltestsuperscript𝐿testL^{\text{test}}italic_L start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, corresponding to the training and testing phases, respectively. In closed environments, the class set remains consistent between the training and testing phases, i.e., Ltrain=Ltestsuperscript𝐿trainsuperscript𝐿testL^{\text{train}}=L^{\text{test}}italic_L start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. In contrast, in open environments, test samples may belong to novel classes l𝑙litalic_l that are not present during training, i.e., lLtest𝑙superscript𝐿test\exists~{}l\in L^{\text{test}}∃ italic_l ∈ italic_L start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT such that lLtrain𝑙superscript𝐿trainl\notin L^{\text{train}}italic_l ∉ italic_L start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT. In such cases, the model must be capable of identifying and handling these new classes.

4.2 Decremental/Incremental Features

Decremental and incremental features represent open environments challenges characterized by partial removal or augmentation of the input feature set, known as feature shift. Let C𝐶Citalic_C denote the full feature set, partitioned into Ctrainsuperscript𝐶trainC^{\text{train}}italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT and Ctestsuperscript𝐶testC^{\text{test}}italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT for training and testing, respectively. In closed environments, Ctrain=Ctestsuperscript𝐶trainsuperscript𝐶testC^{\text{train}}=C^{\text{test}}italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, whereas in open environments, Ctrainsuperscript𝐶trainC^{\text{train}}italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT remains fixed but Ctestsuperscript𝐶testC^{\text{test}}italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT may differ. Specifically, when CtestCtrainsuperscript𝐶testsuperscript𝐶trainC^{\text{test}}\subsetneqq C^{\text{train}}italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ⫋ italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT, imputation of shifted features in Ctestsuperscript𝐶testC^{\text{test}}italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT is necessary to maintain input dimension consistency and enable accurate model prediction (Decremental Features). Conversely, when CtrainCtestsuperscript𝐶trainsuperscript𝐶testC^{\text{train}}\subsetneqq C^{\text{test}}italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ⫋ italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, the model typically truncates the newly added features in Ctestsuperscript𝐶testC^{\text{test}}italic_C start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, retaining only those corresponding to Ctrainsuperscript𝐶trainC^{\text{train}}italic_C start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT, thus ensuring input dimension consistency between the training and testing phases (Incremental Features).

4.3 Changing Data Distributions

Closed environments machine learning research generally assumes that all data in both the training and testing phases are independent samples from the identical distribution. Unfortunately, this assertion does not always hold true in open environments. Changing data distributions has two scenarios. Covariate Shift[49] occurs when the input distribution p(x)𝑝𝑥p(x)italic_p ( italic_x ) changes between training and testing phases, while the conditional probability p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) remains constant. Concept Shift[17] involves changes in the conditional probability p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) with a stable input distribution p(x)𝑝𝑥p(x)italic_p ( italic_x ).

4.4 Varied Learning Objectives

The performance of the machine learning model f𝑓fitalic_f can be measured by a learning objective Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, such as accuracy, F1 score, or ROC-AUC. Learning towards different objectives may lead to a model with different strengths. A model that is optimal on one measure is not necessarily optimal on others. Machine learning research in closed environments generally assumes that the Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT used to evaluate model performance is fixed and known in advance. However, this assumption may not always hold in open environments. When facing this challenge, the model should perform well across various learning objectives without requiring data to be recollected and a completely new model to be trained.

4.5 Evaluation Framework for open environments Challenges

Existing benchmarks for evaluating model performance in open environments typically focus on a single task, such as distribution shift[18] or feature shift[10]. However, they lack a unified and comprehensive assessment across multiple open environments challenges. Hence, we propose a modular and extensible evaluation framework that assesses both model performance and robustness across diverse real-world scenarios. The framework formalizes four representative open environments challenges: Emerging New Classes, Decremental/Incremental Features, Changing Data Distributions, and Varied Learning Objectives. It builds testing protocols by leveraging existing benchmarks, including WhyShift[34] and TableShift[18] for distribution shifts, and TabFSBench[10] for feature shifts. It supports comprehensive evaluation of tabular models and enables exporting datasets under different open environments scenarios with just a few lines of Python code. Details are in AppendixC.

ModelEyeMovementCMCWine-RedWine-White
ROC-AUCAUPRROC-AUCAUPRROC-AUCAUPRROC-AUCAUPR
RandomForest0.5090.5050.5030.5020.5000.5000.5000.500
XGBoost0.5030.5020.5020.5010.5670.5580.5330.533
CatBoost0.5070.5040.5120.5070.4670.4920.3670.462
MLP0.5030.5020.5270.5170.7000.6580.4000.472
RealMLP0.5100.5050.5140.5080.5000.5000.5000.500
ModernNCA0.5040.5020.5100.5060.4000.4810.5000.500
TabPFN v20.5110.5070.5220.5130.5330.5210.6670.644

5 Comprehensive Evaluation of TabPFN v2

Expanding on the impressive performance of TabPFN v2 in closed environments, we undertake a comprehensive evaluation in open environments to rigorously assess the robustness and adaptability of TabPFN v2 through our proposed evaluation framework. Specifically, we subject TabPFN v2 to evaluation across four distinct challenges in open environments, as detailed in SectionD. We choose RandomForest[7], XGBoost[9] and CatBoost[44] as tree-based baseline models. We also select MLP, RealMLP[28] and ModernNCA[56] as deep learning baseline models. Given the difference in datasets from different open environments challenges, we provide detailed descriptions of the datasets in each subsection.

5.1 Emerging New Classes

Current tabular models (e.g., TabPFN v2) are fundamentally constrained by fixed input-output dimensions, limiting new class incorporation. To evaluate their adaptability, we design a novel class detection task from SMOOD[23]. Based on multi-class datasets (AppendixE.1), we implement a leave-one-class-out protocol: for k𝑘kitalic_k-class problems, we perform k𝑘kitalic_k runs, each excluding one class during training and treating it as novel. We evaluate models on the metrics of Area Under the Precision-Recall curve (AUPR) and ROC-AUC. Results represent averages across all runs. The detailed computation procedure is described in SectionE.2.

TabPFN v2 has the potential to detect new classes.As illustrated in Table1, TabPFN v2 consistently achieves better AUPR and ROC-AUC in the task of new class detection across all four datasets when compared to other models. This empirical evidence indicates that TabPFN v2 possesses a robust capability for identifying new classes. Results from other metrics of predicted probability are in AppendixE.

5.2 Decremental/Incremental Features

To conduct a comprehensive evaluation of decremental/incremental features, we adopt TabFSBench[10], a benchmark specifically designed for this challenge. It includes twelve datasets covering eight classification tasks and four regression tasks across various domains, dataset sizes, and feature types. Descriptions and results are provided in AppendixF.

TaskShiftRandomForestXGBoostCatBoostMLPRealMLPModernNCATabPFN v2
Binary Classification0%0.8380.8420.8690.8050.8130.8690.852
20%0.764(-0.074)0.766(-0.076)0.834(-0.035)0.781(-0.024)0.744(-0.069)0.708(-0.161)0.809(-0.043)
40%0.622(-0.216)0.624(-0.218)0.764(-0.105)0.743(-0.062)0.666(-0.147)0.598(-0.271)0.725(-0.127)
60%0.583(-0.255)0.581(-0.261)0.714(-0.155)0.698(-0.107)0.672(-0.141)0.568(-0.301)0.635(-0.217)
80%0.464(-0.374)0.514(-0.328)0.631(-0.238)0.620(-0.185)0.563(-0.250)0.540(-0.329)0.556(-0.296)
100%0.446(-0.392)0.467(-0.375)0.537(-0.332)0.534(-0.271)0.460(-0.353)0.512(-0.357)0.483(-0.369)
Multi Classification0%0.8000.8020.8370.7230.7450.9060.709
20%0.735(-0.065)0.759(-0.043)0.794(-0.043)0.700(-0.023)0.640(-0.105)0.819(-0.087)0.651(-0.058)
40%0.637(-0.163)0.677(-0.125)0.714(-0.123)0.658(-0.065)0.665(-0.080)0.700(-0.206)0.556(-0.153)
60%0.462(-0.338)0.574(-0.228)0.605(-0.232)0.600(-0.123)0.559(-0.186)0.562(-0.344)0.432(-0.277)
80%0.354(-0.446)0.460(-0.342)0.463(-0.374)0.520(-0.203)0.379(-0.366)0.444(-0.462)0.288(-0.421)
100%0.226(-0.574)0.306(-0.496)0.321(-0.516)0.363(-0.360)0.195(-0.550)0.286(-0.620)0.117(-0.592)
Regression0%0.9250.9220.9020.9970.9260.9400.928
20%1.218(+0.293)1.155(+0.233)1.152(+0.250)1.025(+0.028)1.263(+0.337)1.103(+0.163)0.974(+0.046)
40%1.537(+0.612)1.514(+0.592)1.544(+0.642)1.073(+0.076)1.567(+0.641)1.309(+0.369)1.034(+0.104)
60%1.738(+0.813)1.762(+0.840)1.818(+0.916)1.125(+0.128)1.802(+0.876)1.499(+0.559)1.184(+0.256)
80%2.086(+1.161)2.119(+1.197)2.247(+1.345)1.181(+0.184)2.138(+1.212)1.735(+0.795)1.232(+0.304)
100%2.346(+1.421)2.412(+1.490)2.571(+1.669)1.247(+0.250)2.433(+1.507)1.940(+1.000)1.317(+0.389)

TabPFN v2 exhibits heightened vulnerability to decremental features.To assess TabPFN v2’s adaptability to decremental features, we conduct random-shift experiments in TabFSBench and use the performance gap as a metric. The performance gap, explained in AppendixF, measures the impact of feature shifts by comparing model performance metrics between original and shifted features. As shown in Tables2, TabPFN v2’s performance gap widens significantly with increasing feature shifts, indicating weaker adaptability and higher sensitivity to feature space changes. In contrast, MLP and CatBoost show greater robustness against decremental features, possibly due to their inherent anti-shift properties, which TabPFN v2 may lack.

TabPFN v2 can not address new added features in the testing phase.When the dimensionality of input features dynamically increases, TabPFN v2 cannot process the additional features and can only truncate them to retain those present during training. This is because its internal parameters and feature representations are based on a fixed feature dimensionality during training. Consequently, TabPFN v2 cannot leverage the information from new features during testing. However, this limitation won’t degrade TabPFN v2’s performance.

5.3 Changing Data Distributions

We evaluate TabPFN v2 under scenarios of changing data distributions, using metrics including Accuracy, Balanced Accuracy, F1-score, and ROC-AUC. Detailed results are given in AppendixG. The evaluation is conducted on nine fully numerical datasets drawn from the WhyShift[34] and TableShift[18] benchmarks, which contain three types of data distribution scenarios. To accommodate memory constraints and the current limitations of TabPFN v2 in handling very large datasets, we apply stratified subsampling (up to 50,000 instances) while preserving the original train/test splits. Detailed dataset statistics are provided in AppendixG.1.

TabPFN v2 reveals limited robustness when concepts shift.We present a comparative analysis of accuracy between XGBoost and TabPFN v2 under two distinct data distribution shifts: Concept Shift and Covariate Shift. XGBoost is the best model in the changing data distributions task. Figure3 shows that both models exhibit enhanced accuracy when shifting from Concept to Covariate Shift. Meanwhile, XGBoost maintains consistent superiority over TabPFN v2 across both shift types. However, in Covariate Shift scenarios, TabPFN v2 demonstrates more improved performance compared to XGBoost, reducing the performance difference. Results on the other three metrics are given in AppendixG.3. These results suggest that TabPFN v2 demonstrates promising discriminative capacity on covariate-shift datasets instead of concept-shift datasets.

5.4 Varied Learning Objectives

We conduct an exhaustive comparative analysis across four primary classification learning objectives: Accuracy, ROC-AUC, F1-score, and Balanced Accuracy. The analysis is performed on i.i.d. datasets used in changing data distribution.

Realistic Evaluation of TabPFN v2 in Open Environments (2)
Realistic Evaluation of TabPFN v2 in Open Environments (3)

TabPFN v2 has statistically significant bias toward majority classes.Figure3 reveals that the performance of TabPFN v2 degrades significantly on class-imbalance-sensitive metrics (F1-score and Balanced Accuracy), suggesting inherent limitations in handling minority classes. Specifically, Balanced Accuracy, a metric designed to address class imbalance by computing the arithmetic mean of per-class accuracies, shows that TabPFN v2 struggles to effectively adapt to varying sample sizes across different classes. Similarly, F1-score, as the harmonic mean of precision and recall, further confirms the model’s suboptimal predictive capability for minority classes. Hence, TabPFN v2 is suitable to handle datasets with balanced classes.

TabPFN v2 fails to maintain competitive performance across various learning objectives.As illustrated in Figure3, TabPFN v2 demonstrates competitive performance with respect to accuracy and ROC-AUC, achieving comparable results to other models in these specific evaluation criteria. However, a comparative analysis reveals statistically significant performance deficiencies in both F1 Score and Balanced Accuracy when contrasted with tree-based models and RealMLP. These empirical observations highlight an important limitation of TabPFN v2: while exhibiting superior performance for particular learning objectives, the model fails to maintain consistent efficacy across all evaluated performance metrics.

5.5 Holistic Assessment

We conduct a comprehensive assessment to evaluate the robustness of TabPFN v2 relative to compared models in open environments, employing a performance ranking analysis across four above challenges in open environments.

TabPFN v2’s robustness is inherently data-dependent.Through a thorough and comprehensive evaluation of the performance of TabPFN v2 across four distinct open environments challenges, our analysis has revealed that TabPFN v2 consistently demonstrates superior efficacy, particularly when applied to small-scale datasets. This observed performance characteristic is in precise alignment with the fundamental design objective of TabPFN v2, which is specifically optimized for small-scale datasets. The empirical results obtained substantiate the theoretical premise underlying TabPFN v2’s development, thereby confirming the particular suitability of TabPFN v2 for applications where the volume of training data is inherently limited.

Tree-based models remain the optimal approach for general tabular tasks in open environments.As shown in Table3, tree-based models, particularly CatBoost and RandomForest, consistently outperform TabPFN v2. CatBoost achieves the best overall ranking, excelling in both changing data distributions and varied learning objectives, demonstrating stronger adaptability in open environments. In contrast, while TabPFN v2 remains competitive in closed environments, its performance declines relative to tree-based methods in open environments. These results suggest that tree-based models are better suited for open environments tasks requiring robustness.

5.6 Recommendations

During the experimental investigation, we observe that the majority of existing high-performance models predominantly demonstrate their superior performance in closed environments. However, these models tend to fall short in adapting to the open environments challenges that are more frequently encountered in real-world scenarios. To further enhance the performance of models in open environments and to provide guidance for the development of subsequent research, the following recommendations are proposed:

Develop benchmarks targeting unexplored open environments tabular challenges. Existing benchmarks are primarily designed around distribution shifts and feature shifts, lacking variations in open environments tabular challenges such as new classes and changes in learning objectives. Considering that the construction of benchmarks can effectively improve the model performance evaluation and methodological improvements on corresponding tasks. Therefore, it is urgent to develop benchmarks based on various open environments tabular challenges.

Evaluate models on various open environments metrics. Current research typically relies on OOD Accuracy, Performance Gap, or Balanced Accuracy to assess the robustness of a model. However, these metrics are mostly applicable to tasks involving distribution shifts or feature shifts and do not cover diverse open environments challenges. Therefore, additional general open environments metrics should be introduced in model evaluation, such as Open-World Tracking Accuracy[38] and Mean Average Precision[47].

Take model robustness as a critical metric when comparing model quality. Current research often judges the quality of a model solely based on its performance in closed environments, without considering the robustness of the model in open environments as an important evaluation criterion. While robustness is a crucial indicator for determining whether a model has practical value. Therefore, robustness should be regarded as a critical metric when comparing model quality, and the quality of a model should be comprehensively assessed based on both closed environments performance and open environments robustness.

Design universal modules to enhance the robustness of diverse existing models. From the aforementioned experiments, we learn that although some models perform well in certain open environments challenges, these models rely on their specific modules and lack universality. They cannot be transferred to other models to further improve robustness. Moreover, these models do not achieve excellent performance in all open environments challenges. Therefore, future research should focus on designing highly universal and transferable modules to enhance the overall performance of models in handling open environments tasks.

TaskRandomForestXGBoostCatBoostMLPRealMLPModernNCATabPFN v2
Emerging New Classes4.24.15.63.33.25.21.8
Decremental/Incremental Features1.784.893.894.003.835.504.06
Changing Data Distributions5.002.252.006.004.254.254.25
Varied Learning Objectives4.752.752.006.754.003.754.00
Average Rank3.933.493.375.013.824.673.53

6 Conclusion

We present the first comprehensive evaluation of TabPFN v2 in open environments and construct an evaluation framework that simulates diverse challenges in open environments, revealing its limitations in feature decrements, and distribution shifts, while highlighting its strengths in detecting new classes, small-scale datasets and covariate shift scenarios. Although tree-based models remain superior for general tabular tasks, our analysis identifies specific conditions under which TabPFN v2 is pragmatically viable. The observations underscore a critical performance gap between closed and open environments, emphasizing the need for enhanced evaluation frameworks and robust model designs. To advance open environments research, we advocate for the development of specialized benchmarks, multi-faceted model assessments prioritizing robustness, and universal modules to improve existing methods’ adaptability. These directions aim to bridge the current methodological divide and foster more reliable tabular learning systems in real-world applications.

Limitations. Our experiments may not fully represent the diversity of open environments tasks due to constraints in dataset variety and task types, potentially impacting the simulation of complex real-world scenarios. The theoretical analysis depth may also limit insights into TabPFN v2’s closed environments performance and open environments robustness.

References

  • Akiba etal. [2019]Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.Optuna: A next-generation hyperparameter optimization framework.In Proceedings of the 25th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining, pages 2623–2631, 2019.
  • Altman and Krzywinski [2017]Naomi Altman and Martin Krzywinski.Tabular data.Nature Methods, 14(4):329–331, 2017.
  • Alvarez-Melis and Fusi [2020]David Alvarez-Melis and Nicolo Fusi.Geometric dataset distances via optimal transport.Advances in Neural Information Processing Systems, pages21428–21439, 2020.
  • Badirli etal. [2020]Sarkhan Badirli, Xuanqing Liu, Zhengming Xing, Avradeep Bhowmik, Khoa Doan, andSathiyaKeerthi Keerthi.Gradient boosting neural networks: Grownet.arXiv preprint arXiv:2002.07971, 2020.
  • Borisov etal. [2022]Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, MartinPawelczyk, and Gjergji Kasneci.Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems,35(6):7499–7519, 2022.
  • Breejen etal. [2024]Felixden Breejen, Sangmin Bae, Stephen Cha, and Se-Young Yun.Fine-tuned in-context learning transformers are excellent tabulardata classifiers.arXiv preprint arXiv:2405.13396, 2024.
  • Breiman [2001]Leo Breiman.Random forests.Machine learning, 45:5–32, 2001.
  • Cai etal. [2023]TiffanyTianhui Cai, Hongseok Namkoong, and Steve Yadlowsky.Diagnosing model performance under distribution shift.arXiv preprint arXiv:2303.02011, 2023.
  • Chen and Guestrin [2016]Tianqi Chen and Carlos Guestrin.XGBoost: A Scalable Tree Boosting System.In Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages 785–794, 2016.
  • Cheng etal. [2025]Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Lan-Zhe Guo, and Yu-Feng Li.TabFSBench: Tabular Benchmark for Feature Shifts in OpenEnvironment.arXiv preprint arXiv:2501.18935, 2025.
  • Chizat etal. [2020]Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-XavierVialard, and Gabriel Peyré.Faster Wasserstein distance estimation with the Sinkhorndivergence.Advances in Neural Information Processing Systems, pages2257–2269, 2020.
  • Cortez etal. [1998]P.Cortez, A.Cerdeira, F.Almeida, T.Matos, and J.Reis.Modeling wine preferences by data mining from physicochemicalproperties.Decision Support Systems, 47(4):547–553,1998.
  • Deng etal. [2024]Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, and YunkuanWang.Zero-shot generalizable incremental learning for vision-languageobject detection.Advances in Neural Information Processing Systems, pages136679–136700, 2024.
  • Dhamija etal. [2020]Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult.The overlooked elephant of object detection: Open set.In Proceedings of the IEEE/CVF Winter Conference onApplications of Computer Vision, pages 1021–1030, 2020.
  • Du etal. [2022]Xuefeng Du, Zhaoning Wang, MuCai, and Yixuan Li.Vos: Learning what you don’t know by virtual outlier synthesis.arXiv preprint arXiv:2202.01197, 2022.
  • Fang etal. [2024]XiFang, Weijie Xu, FionaAnting Tan, Jiani Zhang, Ziqing Hu, YanjunJane Qi,Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and ChristosFaloutsos.Large language models on tabular data: Prediction, generation, andunderstanding-a survey.arXiv preprint arXiv:2402.17944, 2024.
  • Gama etal. [2014]J.Gama, I.Zliobaite, and A.Bifet.A survey on concept drift adaptation.ACM Computing Surveys (CSUR), 46(4):44,2014.
  • Gardner etal. [2023]Josh Gardner, Zoran Popovic, and Ludwig Schmidt.Benchmarking distribution shift in tabular data with tableshift.Advances in Neural Information Processing Systems, pages53385–53432, 2023.
  • Gorishniy etal. [2021]Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko.Revisiting deep learning models for tabular data.Advances in Neural Information Processing Systems, pages18932–18943, 2021.
  • Grinsztajn etal. [2022]Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux.Why do tree-based models still outperform deep learning on typicaltabular data?Advances in Neural Information Processing Systems, pages507–520, 2022.
  • Heinzel etal. [2025]CarolaSophia Heinzel, Lennart Purucker, Frank Hutter, and Peter Pfaffelhuber.Advancing biogeographical ancestry predictions through machinelearning.bioRxiv, pages 1–3, 2025.
  • Helli etal. [2024]Kai Helli, David Schnurr, Noah Hollmann, Samuel Müller, and Frank Hutter.Drift-resilient TabPFN: In-context learning temporal distributionshifts on tabular data.Advances in Neural Information Processing Systems, pages98742–98781, 2024.
  • Hendrycks and Gimpel [2017]Dan Hendrycks and Kevin Gimpel.A Baseline for Detecting Misclassified and Out-of-DistributionExamples in Neural Networks.In Proceedings of the 5th International Conference on LearningRepresentations, 2017.
  • Heusel etal. [2017]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter.Gans trained by a two time-scale update rule converge to a local nashequilibrium.Advances in Neural Information Processing Systems, pages6629–6640, 2017.
  • Hollmann etal. [2023]Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter.TabPFN: A Transformer That Solves Small Tabular ClassificationProblems in a Second.In Proceedings of the 11th International Conference on LearningRepresentations, 2023.
  • Hollmann etal. [2025]Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, MaxKörfer, ShiBin Hoo, RobinTibor Schirrmeister, and Frank Hutter.Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025.
  • Holzmüller etal. [2024]David Holzmüller, Léo Grinsztajn, and Ingo Steinwart.Better by default: Strong pre-tuned mlps and boosted trees on tabulardata.Advances in Neural Information Processing Systems, pages26577–26658, 2024.
  • Holzmüller etal. [2025]David Holzmüller, Leo Grinsztajn, and Ingo Steinwart.RealMLP: Advancing MLPs and default parameters for tabular data.In ELLIS workshop on Representation Learning and GenerativeModels for Structured Data, 2025.
  • Hoo etal. [2025a]ShiBin Hoo, Samuel Müller, David Salinas, and Frank Hutter.The tabular foundation model tabpfn outperforms specialized timeseries forecasting models based on simple features.arXiv preprint arXiv:2501.02945, 2025a.
  • Hoo etal. [2025b]ShiBin Hoo, Samuel Müller, David Salinas, and Frank Hutter.The Tabular Foundation Model TabPFN Outperforms Specialized TimeSeries Forecasting Models Based on Simple Features.arXiv preprint arXiv:2501.02945, 2025b.
  • Hou etal. [2017]Bo-Jian Hou, Lijun Zhang, and Zhi-Hua Zhou.Learning with feature evolvable streams.Advances in Neural Information Processing Systems, page1416–1426, 2017.
  • Kadra etal. [2021]Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka.Well-tuned simple nets excel on tabular datasets.Advances in Neural Information Processing Systems, pages23928–23941, 2021.
  • Koshil etal. [2024]Mykhailo Koshil, Thomas Nagler, Matthias Feurer, and Katharina Eggensperger.Towards Localization via Data Embedding for TabPFN.Advances in Neural Information Processing Systems TableRepresentation Learning Workshop, 2024.
  • Liu etal. [2023]Jiashuo Liu, Tianyu Wang, Peng Cui, and Hongseok Namkoong.On the need for a language describing distribution shifts:Illustrations on tabular datasets.Advances in Neural Information Processing Systems, pages51371–51408, 2023.
  • Liu and Ye [2025]Si-Yang Liu and Han-Jia Ye.TabPFN Unleashed: A Scalable and Effective Solution to TabularClassification Problems.arXiv preprint arXiv:2502.02527, 2025.
  • Liu etal. [2024a]Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye.Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024a.
  • Liu etal. [2024b]Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, YuCheng, and Junxian He.Diving into self-evolving training for multimodal reasoning.arXiv preprint arXiv:2412.17451, 2024b.
  • Liu etal. [2022]Yang Liu, IdilEsen Zulfikar, Jonathon Luiten, Achal Dave, Deva Ramanan,Bastian Leibe, Aljoša Ošep, and Laura Leal-Taixé.Opening up open world tracking.In Proceedings of the IEEE/CVF conference on computer visionand pattern recognition, pages 19045–19055, 2022.
  • McElfresh etal. [2023]Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, VishakPrasad C., GaneshRamakrishnan, Micah Goldblum, and Colin White.When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, pages34–47, 2023.
  • Meijerink etal. [2020]Lotta Meijerink, Giovanni Cinà, and Michele Tonutti.Uncertainty estimation for classification and risk prediction onmedical tabular data.arXiv preprint arXiv:2004.05824, 2020.
  • Noda etal. [2024]Ryunosuke Noda, Daisuke Ichikawa, and Yugo Shibagaki.Machine learning-based diagnostic prediction of minimal changedisease: model development study.Scientific Reports, 14(1):23460, 2024.
  • Parmar etal. [2023]Jitendra Parmar, Satyendra Chouhan, Vaskar Raychoudhury, and Santosh Rathore.Open-world machine learning: applications, challenges, andopportunities.ACM Computing Surveys, 55(10):1–37, 2023.
  • Popov etal. [2020]Sergei Popov, Stanislav Morozov, and Artem Babenko.Neural Oblivious Decision Ensembles for Deep Learning on TabularData.In Proceedings of the 8th International Conference on LearningRepresentations, 2020.
  • Prokhorenkova etal. [2018]Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, AnnaVeronika Dorogush,and Andrey Gulin.CatBoost: unbiased boosting with categorical features.Advances in Neural Information Processing Systems, pages6639–6649, 2018.
  • Ruiz-Villafranca etal. [2024]Sergio Ruiz-Villafranca, José Roldán-Gómez, Juan ManuelCasteloGómez, Javier Carrillo-Mondéjar, and JoséLuis Martinez.A TabPFN-based intrusion detection system for the industrialinternet of things.The Journal of Supercomputing, 80(14):20080–20117, 2024.
  • Sahakyan etal. [2021]Maria Sahakyan, Zeyar Aung, and Talal Rahwan.Explainable artificial intelligence for tabular data: A survey.IEEE access, 9:135392–135422, 2021.
  • Sancaktar etal. [2022]Cansu Sancaktar, Sebastian Blaes, and Georg Martius.Curious exploration via structured world models yields zero-shotobject manipulation.Advances in Neural Information Processing Systems, pages24170–24183, 2022.
  • Shwartz-Ziv and Armon [2022]Ravid Shwartz-Ziv and Amitai Armon.Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022.
  • Sugiyama etal. [2007]Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift adaptation by importance weighted cross validation.Journal of Machine Learning Research, 8(5), 2007.
  • Thomas etal. [2024]Valentin Thomas, Junwei Ma, Rasa Hosseinzadeh, Keyvan Golestan, Guangwei Yu,Maksims Volkovs, and AnthonyL. Caterini.Retrieval & Fine-Tuning for In-Context Tabular Models.Advances in Neural Information Processing Systems, pages108439–108467, 2024.
  • Tran and Byeon [2024]VinhQuang Tran and Haewon Byeon.Predicting dementia in parkinson’s disease on a small tabular datasetusing hybrid lightgbm–tabpfn and shap.Digital Health, 10:20–55, 2024.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in Neural Information Processing Systems, pages6000–6010, 2017.
  • Wang etal. [2021]Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,and EdChi.Dcn v2: Improved deep & cross network and practical lessons forweb-scale learning to rank systems.In Proceedings of the Web Conference 2021, pages 1785–1797,2021.
  • West [2000]David West.Neural network credit scoring models.Computers & operations research, 27(11):1131–1152, 2000.
  • Xu etal. [2025]DerekQiang Xu, FOlcay Cirit, Reza Asadi, Yizhou Sun, and Wei Wang.Mixture of In-Context Prompters for Tabular PFNs.In Proceedings of the 13th International Conference on LearningRepresentations, 2025.
  • Ye etal. [2024]Han-Jia Ye, Huai-Hong Yin, and De-Chuan Zhan.Modern Neighborhood Components Analysis: A Deep Tabular Baseline TwoDecades Later.arXiv preprint arXiv:2407.03257, 2024.
  • Ye etal. [2025]Han-Jia Ye, Si-Yang Liu, and Wei-Lun Chao.A Closer Look at TabPFN v2: Strength, Limitation, and Extension.arXiv preprint arXiv:2502.17361, 2025.
  • Yıldız and Kalayci [2024]AYarkın Yıldız and Asli Kalayci.Gradient boosting decision trees on medical diagnosis over tabulardata.arXiv preprint arXiv:2410.03705, 2024.
  • Zhou etal. [2021]Kaiyang Zhou, Yongxin Yang, YuQiao, and Tao Xiang.Domain Generalization with MixStyle.In Proceedings of the 9th International Conference on LearningRepresentations, 2021.
  • Zhou etal. [2025]Zhi Zhou, Kun-Yang Yu, Lan-Zhe Guo, and Yu-Feng Li.Fully Test-time Adaptation for Tabular Data.In Proceedings of the 39th AAAI conference on ArtificialIntelligence, 2025.
  • Zhou [2022]Zhi-Hua Zhou.Open-environment machine learning.National Science Review, 9(8):nwac123,2022.
  • Zhou etal. [2019]Zhi-Hua Zhou, Yang Yu, and Chao Qian.Evolutionary learning: Advances in theories and algorithms.Springer, 2019.
  • Zhu etal. [2021]Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv,Fuli Feng, and Tat-Seng Chua.TAT-QA: A question answering benchmark on a hybrid of tabular andtextual content in finance.arXiv preprint arXiv:2105.07624, 2021.
  • Zuluaga etal. [2013]Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus Püschel.Active learning for multi-objective optimization.In Proceedings of the 29th International Conference on MachineLearning, pages 462–470, 2013.

Appendix A TabPFN and TabPFN v2

A.1 TabPFN

Developed by[25], TabPFN reimagines classification through an innovative adaptation of a Transformer-based architecture. At its core, the method reformulates the classification task as a sequence processing problem with the following key components:

Data Representation.

Each data point (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) undergoes k𝑘kitalic_k-dimensional standardization through linear projections:

xix~ik,yiy~ikformulae-sequencemaps-tosubscript𝑥𝑖subscript~𝑥𝑖superscript𝑘maps-tosubscript𝑦𝑖subscript~𝑦𝑖superscript𝑘x_{i}\mapsto\tilde{x}_{i}\in\mathbb{R}^{k},\quad y_{i}\mapsto\tilde{y}_{i}\in%\mathbb{R}^{k}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↦ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↦ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT

Where zero-padding ensures all vectors conform to the predefined dimensionality k𝑘kitalic_k.

Contextual Learning Framework.

The model operates by constructing a dynamic context matrix 𝒜𝒜\mathcal{A}caligraphic_A that jointly encodes with N𝑁Nitalic_N training samples and one test sample xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

𝒜=[x~1y~1x~Ny~Nx~](N+1)×k𝒜matrixdirect-sumsubscript~𝑥1subscript~𝑦1direct-sumsubscript~𝑥𝑁subscript~𝑦𝑁superscript~𝑥superscript𝑁1𝑘\mathcal{A}=\begin{bmatrix}\tilde{x}_{1}\oplus\tilde{y}_{1}\\\vdots\\\tilde{x}_{N}\oplus\tilde{y}_{N}\\\tilde{x}^{*}\end{bmatrix}\in\mathbb{R}^{(N+1)\times k}caligraphic_A = [ start_ARG start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⊕ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_k end_POSTSUPERSCRIPT

where direct-sum\oplus denotes vector concatenation. This formulation treats each transformed data point as a token in a sequence, enabling flexible handling of varying dataset sizes.

Architecture.

The context matrix is processed through a stack of Transformer layers capable of processing variable-length token sequences and a specialized MLP head that converts the test instance’s output token into class probabilities

The model’s distinctive approach lies in its in-context learning paradigm, where the prediction for test sample xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT emerges from the Transformer’s processing of the entire augmented sequence containing both training and testing representations. This design eliminates the need for traditional iterative training while maintaining competitive accuracy on tabular tasks.

A.2 TabPFN v2

Building upon TabPFN, TabPFN v2[26] introduces three key architectural innovations that redefine feature processing in tabular data analysis:

Feature Space Transformation.

Each raw feature undergoes linear projection into a k𝑘kitalic_k-dimensional latent space, followed by controlled perturbation. This mechanism, characterized by[57] as a tokenization variant of[19]’s approach, creates unique positional identifiers for features.

Computational Framework.

The computational framework operates on a three-dimensional tensor structure and processes through dual attention mechanisms of cross-sample attention for dataset-level patterns and intra-feature attention for feature relationships.

Knowledge Transfer.

Pre-trained weights derived from synthetic data generated by structural causal model which facilitates zero-shot transfer, thereby addressing the challenges of tabular data diversity.

TabPFN v2 has three fundamental constraints: (1) quadratic complexity scaling, (2) dataset size limit <104absentsuperscript104<10^{4}< 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, and (3) maximum class count (10absent10\leq 10≤ 10 for classification tasks). Hence,[57] introduces a divide-and-conquer mechanism to address these limitations. To address the performance degradation on high-dimensional datasets, a method combining feature subset sampling and ensemble learning is employed. For the inadequate performance on large-scale datasets, two improved schemes, data-to-embedding and decision tree, are proposed. To tackle the problem of inapplicability to multi-classification tasks, the Decimal Encoding and ECOC methods are utilized.

While current research[26, 57] has thoroughly assessed TabPFN v2’s performance, these evaluations primarily focus on its performance in closed environments. This leaves a critical gap in understanding how the model adapts to open environments. To fully realize TabPFN v2’s potential and explore its practical value, we conduct comprehensive evaluations. These evaluations focus on TabPFN v2’s performance under various open environments challenges.

Appendix B Tabular Challenges in Open Environments

B.1 Emerging New Classes

In closed environments, it is commonly assumed that the label of any testing sample must come from the label set used during training. However, this assumption is not always valid in open environments. For instance, in a forest disease monitoring system that relies on a machine learning model trained with signals from sensors deployed in the forest, it is impractical to enumerate all possible classes in advance, as some forest diseases may be entirely novel, such as those caused by invasive insect pests that have never been encountered in the region before.

B.2 Decremental/Incremental Features

Decremental/Incremental Features are another open environments challenge, wherein the feature set previously utilized as inputs is either partially removed or expanded by new features, also known as feature shift. Given a forest disease monitoring system that relies on a machine learning model trained with signals from sensors deployed in the forest, certain existing sensors may cease to function, leading to a reduction of the feature set (Decremental Features). Meanwhile, additional sensors may be deployed to monitor, resulting in an expansion of the feature set (Incremental Features).

B.3 Changing Data Distributions

Machine learning research in closed environments generally assumes that all data in both the training and testing phases are independent samples from the identical distribution. Unfortunately, this assertion does not always hold true in open environments. In the forest disease monitoring system, the model may be built in summer based on sensor signals specific to that season, but it is expected to perform well across all seasons.

B.4 Varied Learning Objectives

The performance of the machine learning model f𝑓fitalic_f can be measured by a learning objective Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, such as accuracy, F1 score, or ROC-AUC. Learning towards different objectives may lead to a model with different strengths. Being optimal on one measure does not mean that the model will also be optimal on other measures. Machine learning research in closed environments generally assumes that the Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT used to measure model performance is invariant and known in advance. However, this assertion may not invariably be valid in open environments. In the forest disease monitoring system, the sensor dispatch task may prioritize different objectives over time. Initially, various sensors are dispatched to pursue high monitoring accuracy; later, after a relatively high accuracy has been achieved, different sensors may be used to ensure that the system operates with minimal energy consumption. When facing this challenge, the model should be able to perform well on various learning objectives without requiring the data to be recollected and a completely new model to be trained.

Appendix C Evaluation Framework

To facilitate the use of our proposed evaluation framework, we provide a set of APIs. More details are available at https://anonymous.4open.science/r/tabpfn-ood-4E65. The API accepts four parameters: dataset, model, task, and export_dataset. We will give further specifications in the supplementary material and the repository readme.md.

The dataset parameter specifies the full name of the dataset to be used. Our evaluation framework supports datasets from OpenML, Kaggle, and local directories.

The model parameter defines the model to be evaluated and can be selected from tree-based models and deep-learning models, which we evaluated in this paper. New models can be added by following the instructions in the "How to Add New Models" section.

The task parameter determines the type of feature-shift experiment to be conducted. The available options include emerging new classes (enc), decremental features (df), changing data distributions (cdd), and varied learning objectives (vlo).

The export_dataset parameter controls whether the modified dataset—corresponding to a specific open environments challenge—is exported as a CSV file for further use.

An example command for running the evaluation framework is as follows ():

Appendix D General Experimental Settings

D.1 Traning settings

Deep learning models are trained on an NVIDIA 4090 GPU. Tree-based models are trained on an AMD Ryzen 5 7500F 6-Core Processor. All experimental results are reported as the average of three different random seeds to ensure statistical reliability.

D.2 Models

In this subsection, we provide detailed descriptions of all the models used in our paper.

XGBoost

XGBoost[11] is an efficient and flexible machine learning model that incrementally builds multiple decision trees by optimizing the loss function, with each tree correcting the errors of the previous one to continuously improve the model’s predictive performance. XGBoost also incorporates the gradient boosting algorithm, iteratively training decision tree-based models with the goal of minimizing residuals and enhancing predictive accuracy.

CatBoost

CatBoost[44] is a powerful boosting-based model designed for efficient handling of categorical features. It uses the "Ordered Boosting" technique, which calculates gradients sequentially to prevent target leakage and maintain the independence of each training instance. At the same time, CatBoost employs "Target-based Categorical Encoding," converting categorical variables into numerical representations based on target statistics, thereby reducing the need for extensive preprocessing and improving model performance.

RandomForest

RandomForest[7] is a classical ensemble learning method based on bagging and decision trees. It constructs a multitude of decision trees during training and outputs the mode or mean prediction of individual trees. Its robustness to overfitting, strong performance with minimal tuning, and ability to handle both classification and regression tasks make it a widely used baseline in tabular data benchmarks.

MLP

An MLP consists of multiple layers of neurons, with each layer fully connected to the next. An MLP contains at least three layers: an input layer, one or more hidden layers, and an output layer. It continuously adjusts the connection weights between neurons through training methods such as the backpropagation algorithm and gradient descent to minimize prediction errors.

ModernNCA

ModernNCA[56] is an enhanced Neighborhood Component Analysis (NCA) model that improves tabular data processing by adjusting learning objectives, integrating deep learning architectures, and using stochastic neighbor sampling for better efficiency and accuracy.

RealMLP

RealMLP[27] is an enhanced multilayer perceptron designed for tabular data tasks, combining architectural improvements with meta-learned default hyperparameters. It achieves a strong balance between accuracy and training efficiency.

D.3 Hyperparameter Tuning

In this subsection, we provide hyperparameter grids of tree-based and deep learning models in Table4,5.

For tree-based models, we employ GridSearchCV from the scikit-learn library to conduct an exhaustive hyperparameter search. This approach systematically explores a predefined parameter grid through 5-fold cross-validation to ensure the reproducibility of results. The search process is optimized for computational efficiency by enabling parallel processing.

Regarding deep learning models, we implement an adaptive hyperparameter optimization strategy based on the Optuna framework[1], following methodologies established in prior studies[36]. The optimization protocol maintains a constant batch size of 1024 and performs 100 independent trials using training-validation splits to prevent potential data leakage from the test set.

ModelHyperparameterValues
XGBoostLearning Rate{0.01,0.1}0.010.1\{0.01,0.1\}{ 0.01 , 0.1 }
Max. Depth{1,5,9}159\{1,5,9\}{ 1 , 5 , 9 }
N Estimators{10000,20000,30000}100002000030000\{10000,20000,30000\}{ 10000 , 20000 , 30000 }
Subsample{0.5,0.8,1.0}0.50.81.0\{0.5,0.8,1.0\}{ 0.5 , 0.8 , 1.0 }
Colsample Bytree{0.5,0.8,1.0}0.50.81.0\{0.5,0.8,1.0\}{ 0.5 , 0.8 , 1.0 }
Min Child Weight{1,3,5}135\{1,3,5\}{ 1 , 3 , 5 }
CatBoostLearning Rate{0.01,0.05,0.1}0.010.050.1\{0.01,0.05,0.1\}{ 0.01 , 0.05 , 0.1 }
Depth{4,6,8}468\{4,6,8\}{ 4 , 6 , 8 }
Iterations{500,1000,2000}50010002000\{500,1000,2000\}{ 500 , 1000 , 2000 }
RandomForestMin Samples Split[2,10]210[2,10][ 2 , 10 ]
Min Samples Leaf[1,10]110[1,10][ 1 , 10 ]

ModelHyperparameterValues
MLPD_layers{1,8,64,512}1864512\{1,8,64,512\}{ 1 , 8 , 64 , 512 }
DropoutUniform {0.0,0.5}0.00.5\{0.0,0.5\}{ 0.0 , 0.5 }
Learning RateLoguniform{e5,0.01}superscript𝑒50.01\{e^{-5},0.01\}{ italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 0.01 }
Weight DecayLoguniform{e6,0.001}superscript𝑒60.001\{e^{-6},0.001\}{ italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 0.001 }
ModernNCADropoutUniform {0.0,0.5}0.00.5\{0.0,0.5\}{ 0.0 , 0.5 }
D_blockInt{64,1024}641024\{64,1024\}{ 64 , 1024 }
N_blocksInt{0,2}02\{0,2\}{ 0 , 2 }
N_frequenciesInt{16,96}1696\{16,96\}{ 16 , 96 }
Frequency ScaleLoguniform{0.005,10}0.00510\{0.005,10\}{ 0.005 , 10 }
D_embeddingInt{16,64}1664\{16,64\}{ 16 , 64 }
Sample RateUniform{0.05,0.6}0.050.6\{0.05,0.6\}{ 0.05 , 0.6 }
Learning RateLoguniform{e5,0.1}superscript𝑒50.1\{e^{-5},0.1\}{ italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 0.1 }
Weight DecayLoguniform{e6,0.001}superscript𝑒60.001\{e^{-6},0.001\}{ italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 0.001 }
RealMLPNum Emb Type{none,pbld,pl,plr}nonepbldplplr\{\text{none},\ \text{pbld},\ \text{pl},\ \text{plr}\}{ none , pbld , pl , plr }
Add Front Scale{True,False}TrueFalse\{\text{True},\ \text{False}\}{ True , False }
Learning Rate (lr)logU(0.02, 0.3)𝑈0.020.3\log U(0.02,\ 0.3)roman_log italic_U ( 0.02 , 0.3 )
Dropout (p_drop){0.00, 0.15, 0.30}0.000.150.30\{0.00,\ 0.15,\ 0.30\}{ 0.00 , 0.15 , 0.30 }
Activation (act){selu,relu,mish}selurelumish\{\text{selu},\ \text{relu},\ \text{mish}\}{ selu , relu , mish }
Hidden Sizes{[256,256,256],[64,64,64,64,64],[512]}2562562566464646464delimited-[]512\{[256,256,256],\ [64,64,64,64,64],\ [512]\}{ [ 256 , 256 , 256 ] , [ 64 , 64 , 64 , 64 , 64 ] , [ 512 ] }
Weight Decay (wd){0.0, 0.02}0.00.02\{0.0,\ 0.02\}{ 0.0 , 0.02 }
PLR SigmalogU(0.05, 0.5)𝑈0.050.5\log U(0.05,\ 0.5)roman_log italic_U ( 0.05 , 0.5 )
Label Smoothing Epsilon (ls_eps){0.0, 0.1}0.00.1\{0.0,\ 0.1\}{ 0.0 , 0.1 }

Appendix E Emerging New Classes

E.1 Dataset

Eye Movements

This dataset is designed to predict the relevance of sentences in relation to a given question based on eye movement data. The target is to classify sentences as irrelevant, relevant, or correct, using 27 features, including landing position, first fixation duration, next fixation duration, time spent on the predicted region, and other relevant eye movement metrics. This dataset is available at https://www.kaggle.com/datasets/vinnyr12/eye-movements.

Contraceptive Method Choice (CMC)

This dataset contains 1,473 instances with 10 demographic and socio-economic attributes, originally derived from the 1987 National Indonesia Contraceptive Prevalence Survey. Each instance represents a married woman who was not pregnant (or unsure) at the time of the interview. The target is to predict the contraceptive method currently used by the individual, categorized into three classes: no-use, long-term methods, and short-term methods. This dataset was prepared by Tjen-Sien Lim and is available at https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.

Wine Quality (Red and White)

This dataset includes two subsets related to red and white Vinho Verde wine samples from the north of Portugal. Each sample is described by 11 physicochemical attributes (e.g., acidity, sugar, pH) and a quality score ranging from 1 to 4. The task is to predict the sensory quality of the wine based on its physicochemical properties. The dataset was first introduced by Cortez et al.[12] and is available at https://archive.ics.uci.edu/dataset/186/wine+quality.

E.2 Results

To evaluate a model’s ability to detect novel classes, we adopt a leave-one-class-out protocol. For each class label, we exclude its samples from training and treat them as novel (label 1) during testing. An equal number of samples from the remaining classes are randomly sampled as known (label 0). After training the model on the reduced dataset, we compute confidence scores for test samples using the maximum predicted probability. Samples with low confidence (within a fixed interval [θmin,θmax]subscript𝜃subscript𝜃[\theta_{\min},\theta_{\max}][ italic_θ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]) are predicted as novel. We assess performance using ROC-AUC and AUPR, measuring the model’s ability to separate known and novel instances based on confidence.

To further assess whether models exhibit appropriate uncertainty when encountering novel classes, we hold out one or more classes during training and evaluate the predicted probabilities on test samples from these unseen classes. A prediction is deemed uncertain if its maximum confidence falls within a predefined low-confidence interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. We report the proportion of novel samples falling into this interval as an indicator of the model’s ability to recognize unfamiliar inputs. This metric complements ROC-AUC and AUPR by directly measuring how often the model expresses uncertainty when presented with out-of-distribution classes.

Table6, Table7, and Table8 show the results for uncertainty intervals [0.4,0.6]0.40.6[0.4,0.6][ 0.4 , 0.6 ], [0.45,0.55]0.450.55[0.45,0.55][ 0.45 , 0.55 ], and [0.49,0.51]0.490.51[0.49,0.51][ 0.49 , 0.51 ], respectively.

ModelEyeMovementCMCWine-RedWine-White
RandomForest0.7160.5200.7640.562
XGBoost0.1070.1590.0610.052
CatBoost0.1500.1830.1270.070
MLP0.0290.3560.0830.068
RealMLP0.1980.5230.0000.000
ModernNCA0.1010.3230.1820.433
TabPFN v20.1390.3000.2460.400
ModelEyeMovementCMCWine-RedWine-White
RandomForest0.3400.3510.2440.294
XGBoost0.0540.0750.0270.023
CatBoost0.0770.0950.0400.031
MLP0.0140.2030.0370.052
RealMLP0.0910.2720.0000.000
ModernNCA0.0510.1780.0850.205
TabPFN v20.0720.1560.1420.241
ModelEyeMovementCMCWine-RedWine-White
RandomForest0.1270.0540.0380.005
XGBoost0.0120.0160.0010.001
CatBoost0.0160.0190.0010.010
MLP0.0030.0380.0030.004
RealMLP0.0190.0530.0000.000
ModernNCA0.0090.0390.0180.046
TabPFN v20.0150.0360.0200.030

Appendix F Decremental/Incremental Features

F.1 Dataset

We refer to TabFSBench[10] for details of the evaluated datasets.

Credit

The original dataset contains 1,000 entries with 20 categorical/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit from a bank. Each person is classified as having good or bad credit risk according to the set of attributes. The target is to determine whether the customer’s credit is good or bad. This dataset is available at https://www.openml.org/search?type=data&sort=runs&id=31&status=active.

Electricity

The Electricity dataset, collected from the Australian New South Wales Electricity Market, contains 45,312 instances from May 1996 to December 1998. Each instance represents a 30-minute period and includes fields for the day, timestamp, electricity demand in New South Wales and Victoria, scheduled electricity transfer, and a class label. The target is to predict whether the price in New South Wales is up or down relative to a 24-hour moving average, based on market demand and supply fluctuations. This dataset is available on https://www.kaggle.com/datasets/vstacknocopyright/electricity.

Heart

Cardiovascular diseases (CVDs) are the leading cause of death globally, responsible for 17.9 million deaths annually. Heart failure is a common event caused by CVDs, and this dataset contains 11 features that can be used to predict a possible heart disease. The target is to determine whether the patient’s heart disease is present or absent. This dataset is available on https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction.

Miniboone

This dataset aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a binary classification task where we attempt to predict one of two possible outcomes. The target is to determine whether the neutrino is an electron or a muon. This dataset is available at https://www.kaggle.com/datasets/alexanderliapatis/miniboone.

Iris

The Iris flower dataset, introduced by Ronald Fisher in 1936, contains 150 samples from three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Each sample has four features: sepal length, sepal width, petal length, and petal width, measured in centimeters. The target is to classify the Iris species as setosa, versicolor, or virginica. This dataset is available on https://www.kaggle.com/datasets/uciml/iris.

Jannis

This dataset is used in the tabular benchmark from[20]. It belongs to the ’classification on numerical features’ benchmark. The dataset is designed to test classification performance using numerical features, and it presents challenges such as varying data distributions, class imbalances, and potential missing values. It serves as a critical evaluation tool for machine learning models in real-world scenarios, including medical diagnosis, credit rating, and object recognition tasks. This dataset is available on https://www.openml.org/search?type=data&status=active&id=45021.

Penguins

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The goal of the Palmer Penguins dataset is to offer a comprehensive resource for data exploration and visualization, serving as an alternative to the Iris dataset. The target is to classify the penguin species as Adelie, Chinstrap, or Gentoo. This dataset is available at https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species.

DatasetTrain SizeID TestOOD TestTotalNum. ColumnsCat. Columns# ClassesΔx(Eqn.2)subscriptΔ𝑥Eqn.2\Delta_{x}(\text{Eqn.}~{}\ref{eqn:xshift})roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( Eqn. )Δy|x(Eqn.3)subscriptΔconditional𝑦𝑥Eqn.3\Delta_{y|x}(\text{Eqn.}~{}\ref{eqn:dfd})roman_Δ start_POSTSUBSCRIPT italic_y | italic_x end_POSTSUBSCRIPT ( Eqn. )Δy(Eqn.4)subscriptΔ𝑦Eqn.4\Delta_{y}(\text{Eqn.}~{}\ref{eqn:delta-y})roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( Eqn. )
college_scorecard43,9085,48860249,9981180243,566.392116.630.0337
brfss_diabetes37,2844,6608,05449,9981420212.280.100.0332
diabetes_readmission19,1462,39328,46049,9991830242.371.300.0060

DatasetTrainID TestOOD TestTotal#FeaturesSettingShift Pattern
ACS Income (CA\rightarrowPR)38,2279,5572,21549,9999California \rightarrow Puerto RicoY|XXmuch-greater-thanconditional𝑌𝑋𝑋Y|X\gg Xitalic_Y | italic_X ≫ italic_X
ACS Mobility (MS\rightarrowHI)4,2541,0642,7338,05121Mississippi \rightarrow HawaiiY|XXmuch-greater-thanconditional𝑌𝑋𝑋Y|X\gg Xitalic_Y | italic_X ≫ italic_X
ACS Pub.Cov (NE\rightarrowLA)23,2115,0651,26716,87918Nebraska \rightarrow LouisianaY|XXmuch-greater-thanconditional𝑌𝑋𝑋Y|X\gg Xitalic_Y | italic_X ≫ italic_X
ACS Pub.Cov (2010\rightarrow2017)20,5015,12624,37249,999182010 (NY) \rightarrow 2017 (NY)Y|XXmuch-less-thanconditional𝑌𝑋𝑋Y|X\ll Xitalic_Y | italic_X ≪ italic_X
ACS Income (Young 80%)20,0005,00025,00050,0009Younger People (80%)Y|XXmuch-less-thanconditional𝑌𝑋𝑋Y|X\ll Xitalic_Y | italic_X ≪ italic_X
ACS Income (Young 90%)20,0005,00025,00050,0009Younger People (90%)Y|XXmuch-less-thanconditional𝑌𝑋𝑋Y|X\ll Xitalic_Y | italic_X ≪ italic_X

Eye Movements

This dataset is designed to predict the relevance of sentences in relation to a given question based on eye movement task. The target is to classify sentences as irrelevant, relevant, or correct, using 27 features, including landing position, first fixation duration, next fixation duration, time spent on the predicted region, and other relevant eye movement metrics. This dataset is available at https://www.kaggle.com/datasets/vinnyr12/eye-movements.

Abalone

The age of abalone is traditionally determined by cutting the shell, staining it, and counting the rings under a microscope, a process that is both tedious and time-consuming. This dataset uses easier-to-obtain physical measurements, such as length, diameter, and weight, to predict the abalone’s age. The target is to predict the age, providing a more efficient approach. This dataset is available on https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset.

Bike

The dataset records the rental of shared bikes in the Washington area from 2011-01-01 to 2012-12-31, containing 11 features such as season, holiday, working day, and weather conditions. The goal is to predict the total count of bikes rented each hour, with the target being to forecast the number of bicycles available for rent today based on historical rental patterns and external factors like temperature, humidity, and seasonal trends. This dataset is available on https://www.kaggle.com/datasets/abdullapathan/bikesharingdemand.

Concrete

Concrete is the most important material in civil engineering, and its compressive strength is influenced by a highly nonlinear relationship with its ingredients and age. The dataset contains 9 attributes, including variables such as cement, water, and age. The target is to predict the concrete compressive strength (measured in MPa) using these input variables. This dataset is available on https://www.kaggle.com/datasets/maajdl/yeh-concret-data.

Laptop

The original dataset was relatively compact, with many details embedded in each column. The columns mostly consisted of long strings of data, which were relatively human-readable and concise. However, for Machine Learning algorithms to work more efficiently, it is better to separate different details into individual columns. After doing so, 28 duplicate rows were exposed and removed. The cleaned dataset serves as the final result. The target is to predict the price of a laptop based on its specifications. This dataset is available on https://www.kaggle.com/datasets/owm4096/laptop-prices.

F.2 Performance Gap

We consider the percentage of model performance gap ΔΔ\Deltaroman_Δ as model robustness in feature-shift scenarios by following TabFSBench[10],

Δ=(metricimetric0)metric0Δ𝑚𝑒𝑡𝑟𝑖subscript𝑐𝑖𝑚𝑒𝑡𝑟𝑖subscript𝑐0𝑚𝑒𝑡𝑟𝑖subscript𝑐0\Delta=\frac{(metric_{i}-metric_{0})}{metric_{0}}roman_Δ = divide start_ARG ( italic_m italic_e italic_t italic_r italic_i italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_m italic_e italic_t italic_r italic_i italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_m italic_e italic_t italic_r italic_i italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG(1)

metrici𝑚𝑒𝑡𝑟𝑖subscript𝑐𝑖metric_{i}italic_m italic_e italic_t italic_r italic_i italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the model performance where i𝑖iitalic_i features shift. In subsequent sections, we use metric𝑚𝑒𝑡𝑟𝑖𝑐metricitalic_m italic_e italic_t italic_r italic_i italic_c to refer to performance, and ΔΔ\Deltaroman_Δ to refer to robustness.

F.3 Results

To systematically assess TabPFN v2 in the presence of decremental features, we design random-shift experiments in TabFSBench. We use accuracy and ROC-AUC for classification tasks and RMSE for regression tasks. Table2 provides detailed model performance.

Appendix G Changing Data Distributions

G.1 Dataset

We evaluate our models on two established benchmarks for distribution shift in tabular data: TableShift[18] and WhyShift[34].

From the TableShift benchmark, we select three datasets—College Scorecard, Hospital Readmission, and Diabetes. To ensure scalability and consistency across experiments, we apply stratified subsampling to limit each dataset to 50,000 total instances while preserving the original train/test split ratio. Detailed statistics are provided in Table9.

From the WhyShift benchmark, we adopt six pre-defined settings provided by the original paper, covering a variety of real-world covariate and concept shift scenarios. For datasets containing more than 50,000 instances, we similarly apply stratified subsampling to retain a total of 50,000 samples. The configuration and statistics for all selected WhyShift settings are summarized in Table10.

G.2 Domain Shift Metrics

Domain shift can be categorized into covariate shift, concept shift, and label shift. We adopt the metrics proposed in TableShift[18] to quantify the degree of these three types of domain shift.

Measuring Covariate Shift with OTDD:

Δx=OTDD(𝒟train,𝒟test)subscriptΔ𝑥OTDDsuperscript𝒟trainsuperscript𝒟test\Delta_{x}=\text{OTDD}(\mathcal{D}^{\text{train}},\mathcal{D}^{\text{test}})roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = OTDD ( caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT )(2)

Here, 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and 𝒟testsubscript𝒟test\mathcal{D}_{\text{test}}caligraphic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT denote the source and target domain datasets, respectively. OTDD represents the Optimal Transport Dataset Distance, computed under a Gaussian approximation[3].

Measuring Concept Shift with Frechet Dataset Distance (FDD):

Inspired by the widely used Frechet Inception Distance (FID) in machine learning[24], FDD utilizes intermediate representations of a classifier to quantify distributional discrepancies. It calculates the Frechet distance (also known as the Wasserstein-2 distance) between two distributions to assess the extent of concept shift.

The computation of this metric proceeds as follows: First, a classifier (we use MLPs) is trained on the source domain using the best hyperparameters obtained through hyperparameter search. Then, for each input x𝒟𝑥𝒟x\in\mathcal{D}italic_x ∈ caligraphic_D, we compute the activation values at each layer of the model, obtaining the activation vector x^:=fθ[i](x)assign^𝑥subscript𝑓𝜃delimited-[]𝑖𝑥\hat{x}:=f_{\theta}[i](x)over^ start_ARG italic_x end_ARG := italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ italic_i ] ( italic_x ), where i𝑖iitalic_i denotes the i𝑖iitalic_i-th layer of the model. Finally, the Frechet Dataset Distance is calculated to measure the divergence between the two distributions.

DFD(𝒟train,𝒟test)=μ𝒟trainμ𝒟test2+Tr(Σ𝒟train+Σ𝒟test2Σ𝒟trainΣ𝒟test)DFDsuperscript𝒟trainsuperscript𝒟testsuperscriptnormsubscript𝜇superscript𝒟trainsubscript𝜇superscript𝒟test2𝑇𝑟subscriptΣsuperscript𝒟trainsubscriptΣsuperscript𝒟test2subscriptΣsuperscript𝒟trainsubscriptΣsuperscript𝒟test\textrm{DFD}(\mathcal{D}^{\text{train}},\mathcal{D}^{\text{test}})=||\mu_{%\mathcal{D}^{\text{train}}}-\mu_{\mathcal{D}^{\text{test}}}||^{2}+Tr(\Sigma_{%\mathcal{D}^{\text{train}}}+\Sigma_{\mathcal{D}^{\text{test}}}-2*\sqrt{\Sigma_%{\mathcal{D}^{\text{train}}}*\Sigma_{\mathcal{D}^{\text{test}}}})DFD ( caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) = | | italic_μ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_r ( roman_Σ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - 2 ∗ square-root start_ARG roman_Σ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∗ roman_Σ start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG )(3)

μDsubscript𝜇𝐷\mu_{D}italic_μ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT represents the set of feature vectors extracted from domain D𝐷Ditalic_D. Based on μDsubscript𝜇𝐷\mu_{D}italic_μ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, we construct the covariance matrix ΣDsubscriptΣ𝐷\Sigma_{D}roman_Σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. In the discussion below, we refer to the resulting measure as Δy|xsubscriptΔconditional𝑦𝑥\Delta_{y|x}roman_Δ start_POSTSUBSCRIPT italic_y | italic_x end_POSTSUBSCRIPT. A lower FDD score indicates a smaller distance between any xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the training domain Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the test domain Dtestsubscript𝐷testD_{\text{test}}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT.

Realistic Evaluation of TabPFN v2 in Open Environments (4)
Realistic Evaluation of TabPFN v2 in Open Environments (5)

Measuring label shift:

TableShift[18] proposes a simple formula to quantify the label shift between the source and target distributions:

Δy=y¯𝒟trainy¯𝒟test2subscriptΔ𝑦superscriptnormsubscript¯𝑦superscript𝒟trainsubscript¯𝑦superscript𝒟test2\Delta_{y}=||\bar{y}_{\mathcal{D}^{\text{train}}}-\bar{y}_{\mathcal{D}^{\text{%test}}}||^{2}roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = | | over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(4)

In this equation, y¯𝒟=1|𝒟|i𝒟yisubscript¯𝑦𝒟1𝒟subscript𝑖𝒟subscript𝑦𝑖\bar{y}_{\mathcal{D}}=\frac{1}{|\mathcal{D}|}\sum_{i\in\mathcal{D}}y_{i}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_D end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the average label value computed from samples within domain 𝒟𝒟\mathcal{D}caligraphic_D. Given that all tasks in our study are binary classification, this formulation captures the squared L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the class prior probabilities of the source and target domains.

Quantifying the Contribution of X𝑋Xitalic_X-Shifts and Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-Shifts

To gain a fine-grained understanding of the sources of model performance degradation under distribution shifts, we adopt the DIStribution Shift DEcomposition (DISDE) framework proposed by [8]. This framework enables the decomposition of the overall generalization gap into components attributed to covariate shifts (X𝑋Xitalic_X-shifts) and conditional label shifts (Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-shifts).

Formally, for a model fPsubscript𝑓𝑃f_{P}italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT trained on a source distribution P𝑃Pitalic_P and evaluated on a target distribution Q𝑄Qitalic_Q, DISDE decomposes the generalization gap as:

𝔼Q[(fP(X),Y)]𝔼P[(fP(X),Y)]subscript𝔼𝑄delimited-[]subscript𝑓𝑃𝑋𝑌subscript𝔼𝑃delimited-[]subscript𝑓𝑃𝑋𝑌\displaystyle\mathbb{E}_{Q}[\ell(f_{P}(X),Y)]-\mathbb{E}_{P}[\ell(f_{P}(X),Y)]blackboard_E start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT [ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) , italic_Y ) ] - blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) , italic_Y ) ]=𝔼SX[RP(X)]𝔼P[RP(X)]absentsubscript𝔼subscript𝑆𝑋delimited-[]subscript𝑅𝑃𝑋subscript𝔼𝑃delimited-[]subscript𝑅𝑃𝑋\displaystyle=\mathbb{E}_{S_{X}}[R_{P}(X)]-\mathbb{E}_{P}[R_{P}(X)]\hfill= blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) ] - blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) ](\@slowromancapi@)\@slowromancap𝑖@\displaystyle(\@slowromancap i@)( italic_i @ )(5)
+𝔼SX[RQ(X)RP(X)]subscript𝔼subscript𝑆𝑋delimited-[]subscript𝑅𝑄𝑋subscript𝑅𝑃𝑋\displaystyle\quad+\mathbb{E}_{S_{X}}[R_{Q}(X)-R_{P}(X)]\hfill+ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_X ) - italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) ](\@slowromancapii@)\@slowromancap𝑖𝑖@\displaystyle(\@slowromancap ii@)( italic_i italic_i @ )
+𝔼Q[RQ(X)]𝔼SX[RQ(X)]subscript𝔼𝑄delimited-[]subscript𝑅𝑄𝑋subscript𝔼subscript𝑆𝑋delimited-[]subscript𝑅𝑄𝑋\displaystyle\quad+\mathbb{E}_{Q}[R_{Q}(X)]-\mathbb{E}_{S_{X}}[R_{Q}(X)]\hfill+ blackboard_E start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_X ) ] - blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_X ) ](\@slowromancapiii@)\@slowromancap𝑖𝑖𝑖@\displaystyle(\@slowromancap iii@)( italic_i italic_i italic_i @ )

where Rμ(x)=𝔼μ[(fP(X),Y)X=x]subscript𝑅𝜇𝑥subscript𝔼𝜇delimited-[]conditionalsubscript𝑓𝑃𝑋𝑌𝑋𝑥R_{\mu}(x)=\mathbb{E}_{\mu}[\ell(f_{P}(X),Y)\mid X=x]italic_R start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) , italic_Y ) ∣ italic_X = italic_x ] denotes the conditional expected loss under distribution μ{P,Q}𝜇𝑃𝑄\mu\in\{P,Q\}italic_μ ∈ { italic_P , italic_Q }, and SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is an auxiliary distribution over X𝑋Xitalic_X whose support is contained within both PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Each term in Eqn.5 in this decomposition corresponds to a specific type of shift:

  • Eqn.5.\@slowromancapi@&\@slowromancapiii@ reflect changes due to differences in the marginal distribution of covariates—i.e., the contribution of X𝑋Xitalic_X-shifts.

  • Eqn.5.\@slowromancapii@ captures the shift in the conditional distribution of labels given features, corresponding to Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-shifts.

Building upon this decomposition, we utilize the open-source WhyShift package111https://github.com/namkoong-lab/whyshift, which implements DISDE in a scalable and extensible manner. This allows us to rigorously quantify the relative impact of X𝑋Xitalic_X-shifts and Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-shifts on performance degradation across datasets and domains, providing deeper insight into model robustness under open environments evaluation settings.

G.3 Results

We evaluate TabPFN v2 alongside several mainstream tabular models under changing data distributions scenarios, using metrics including Accuracy, Balanced Accuracy, F1-score, and ROC-AUC. The evaluation is conducted on nine fully numerical datasets drawn from the WhyShift[34] and TableShift[18] benchmarks, which contain three types of data distribution scenarios. To accommodate memory constraints and the current limitations of TabPFN in handling very large datasets, we apply stratified subsampling (up to 50,000 instances) while preserving the original train/test splits. Figure4, 5,6and7 demonstrates results on changing data distributions. Although there has been recent work attempting to apply TabPFN to scenarios involving temporal distribution shift[22], their implementation is not publicly available. Hence, we do not include this method in our evaluation.

Realistic Evaluation of TabPFN v2 in Open Environments (6)
Realistic Evaluation of TabPFN v2 in Open Environments (7)

Appendix H Varied Learning Objectives

We conduct an analysis across four primary classification learning objectives: accuracy, ROC-AUC, F1-score, and Balanced Accuracy. The analysis is performed on i.i.d. datasets that are employed in Changing Data Distribution. Table11 demonstrates results on varied learning objectives.

ObjectiveModelACS Income(CA-PR)ACS Mobility(MS-HI)ACS Pub.Cov(NE-LA)ACS Pub.Cov(2010-2017)ACS Income(Setting 21)ACS Income(Setting 22)college_scorecardbrfss_diabetesdiabetes_readmission
AccuracyXGBoost0.8130.7880.8190.8310.8450.8640.9490.8720.642
CatBoost0.8150.8010.8270.8360.8520.8700.9490.8740.651
MLP0.7810.7600.7930.7940.8210.8430.9340.8690.569
ModernNCA0.8100.8000.8190.8220.8420.8590.9460.8740.651
RandomForest0.8050.8000.8210.8230.8430.8590.9430.8760.649
RealMLP0.8120.7960.8140.8190.8400.8560.9460.8750.653
TabPFN v20.8060.8040.8240.8340.8480.8670.9380.8750.651
ROC-AUCXGBoost0.8930.8150.8170.7900.8480.8650.9780.8070.679
CatBoost0.8970.8320.8330.8020.8610.8770.9780.8170.697
MLP0.8480.7580.7570.7210.7910.8070.9590.8090.585
ModernNCA0.8930.8250.8210.8120.8390.8550.9760.8120.689
RandomForest0.8860.8330.8330.8230.8470.8610.9710.8170.689
RealMLP0.8930.8060.8040.8010.8280.8450.9500.8080.685
TabPFN v20.8880.8320.8310.7790.8580.8730.9680.8160.688
F1-ScoreXGBoost0.7700.8150.7320.4710.7170.6930.7920.2390.522
CatBoost0.7720.8240.7350.4680.7190.6940.7890.2100.517
MLP0.7320.7900.6990.4340.6760.6450.7390.1570.487
ModernNCA0.7620.8170.7220.6460.6510.6390.7780.1290.507
RandomForest0.7490.8140.7090.6280.6260.6090.7480.0580.446
RealMLP0.7640.8130.7190.6460.6500.6360.7860.1820.534
TabPFN v20.7550.8170.7210.4160.7020.6750.7280.0440.516
Balanced-AccuracyXGBoost0.8010.7230.7210.6580.7390.7410.8600.5670.617
CatBoost0.8070.7200.7150.6560.7330.7350.8550.5570.622
MLP0.7750.7080.7030.6420.7180.7200.8390.5510.558
ModernNCA0.8010.7080.7020.6850.7050.7100.8490.5310.619
RandomForest0.7910.6950.6870.6700.6850.6860.8140.5140.604
RealMLP0.8030.7230.7140.6950.7140.7160.8670.5480.628
TabPFN v20.7940.7110.7040.6310.7180.7180.8100.5100.621

Realistic Evaluation of TabPFN v2 in Open Environments (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 5845

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.