Face morphing attacks pose a growing threat to biometric systems, exacerbated by the rapid emergence of powerful generative techniques that enable realistic and seamless facial image manipulations. To address this challenge, we introduce SelfMAD++, a robust and generalized single-image morphing attack detection (S-MAD) framework. Unlike our previous work SelfMAD, which introduced a data augmentation technique to train off-the-shelf classifiers for attack detection, SelfMAD++ advances this paradigm by integrating the artifact-driven augmentation with foundation models and fine-grained spatial reasoning. At its core, SelfMAD++ builds on CLIP–a vision-language foundation model–adapted via Low-Rank Adaptation (LoRA) to align image representations with task-specific text prompts. To enhance sensitivity to spatially subtle and fine-grained artifacts, we integrate a parallel multi-scale convolutional branch specialized in dense, multi-scale feature extraction. This branch is guided by an auxiliary segmentation module, which acts as a regularizer by disentangling bona fide facial regions from potentially manipulated ones. The dual-branch features are adaptively fused through a gated attention mechanism, capturing both semantic context and fine-grained spatial cues indicative of morphing. SelfMAD++ is trained end-to-end using a multi-objective loss that balances semantic alignment, segmentation consistency, and classification accuracy. Extensive experiments across nine standard benchmark datasets demonstrate that SelfMAD++ achieves state-of-the-art performance, with an average Equal Error Rate (EER) of 3.91 %, outperforming both supervised and unsupervised MAD methods by large margins. Notably, SelfMAD++ excels on modern, high-quality morphs generated by GAN and diffusion–based morphing methods, demonstrating its robustness and strong generalization capability. SelfMAD++ code and supplementary resources are publicly available at:
https://github.com/LeonTodorov/SelfMADpp.