Brain age, most commonly inferred from T1-weighted magnetic resonance images (T1w MRI), is a robust biomarker of brain health and related diseases. Superior accuracy in brain age prediction, often falling within a 2–3 year range, is achieved predominantly through deep neural networks. However, comparing study results is difficult due to differences in datasets, evaluation methodologies and metrics. Addressing this, we introduce Brain Age Standardized Evaluation (BASE), which includes (i) a standardized T1w MRI dataset including multi-site, new unseen site, test-retest and longitudinal data, and an associated (ii) evaluation protocol, including repeated model training and upon based comprehensive set of performance metrics measuring accuracy, robustness, reproducibility and consistency aspects of brain age predictions, and (iii) statistical evaluation framework based on linear mixed-effects models for rigorous performance assessment and cross-comparison. To showcase BASE, we comprehensively evaluate four deep learning based brain age models, appraising their performance in scenarios that utilize multi-site, test-retest, unseen site, and longitudinal T1w brain MRI datasets. Ensuring full reproducibility and application in future studies, we have made all associated data information and code publicly accessible at
https://github.com/AralRalud/BASE.git.