This master thesis addresses the generation of synthetic images of one modality from images of another. In modern medical diagnostics and treatment planning, imaging methods such as computed tomography (CT) and magnetic resonance (MR) play a crucial role. Each modality offers a unique insight into the human body, and in many cases, images from multiple modalities are needed. However, acquiring all required scans can be time-consuming, expensive, and uncomfortable for patients. Longer acquisition times also increase the likelihood of artifacts, often resulting in incomplete imaging protocols with missing modalities. These challenges create a clinical need for the development of methods that enable the synthesis of missing imaging modalities from existing ones, without the need for additional scanning.
To address this, we implemented and evaluated an adversarial diffusion model called SynDiff \cite{ozbey_unsupervised_2023}, which combines the strengths of generative adversarial networks (GANs) and diffusion probabilistic models (DDPMs). To evaluate the model, ten experiments were conducted with different datasets, differing in modality pairs, anatomies and clinical contexts. The experiments included translation between brain MR sequences (FLAIR -- DIR, T1 -- T1ce, T1 -- T2), translation between CT and MR brain images, translation from CBCT to CT images for the head and pelvis, translation between CTA and CT images of the head, and the translation of anisotropic 2D T1 MR scans into isotropic 3D T1 MR images of the brain. Training and evaluation were performed on datasets from multiple centers, to verify out-of-domain robustness. The quality of the generated images was assessed using an extensive set of metrics. By using a variety of metrics, we aimed to approximate human assessment as closely as possible. We used full-reference metrics (PSNR, SSIM, MS-SSIM, IW-SSIM, FSIM, VSI, GMSD, DISTS, LPIPS, HaarPSI) and two feature distribution-based metrics (FID and KID). Furthermore, we critically analyzed the reliability of these metrics by testing their behavior under controlled degradations such as noise, blur, and contrast changes, and investigated the influence of background regions on their outcomes.
The results showed that the SynDiff model is capable of generating realistic synthetic images, though performance varied depending on task complexity. The model performed best on intra-modality translations, particularly in super-resolution tasks (experiments 3, 4, and 5). These experiments yielded the best results in the translation from isotropic 3D T1 MR images to anisotropic 2D T1 MR images, entailing information reduction, which is confirmed by the quantitative metrics: PSNR 31 -- 35~dB in SSIM 0,96 -- 0,98, LPIPS 0,04 -- 0,06 in DISTS 0,07 -- 0,10, FID 43 -- 42 and KID 0,01 -- 0,02. In all experiments, the task of reducing information was easier than the task of adding information. This was demonstrated by the experiments involving translation between contrast and non-contrast images (CTA -- CT and T1ce -- T1 MR).
Next, the model achieved better results when translating CBCT scans into CT, effectively enhancing quality, as compared to simulating artifacts in CBCT in the reverse direction. Tasks involving pelvic images were more challenging than brain images, reflected in lower metric values: PSNR 23,24~dB vs. 23,81~dB, HaarPSI 0,31 vs. 0,54, LPIPS 0,27 vs. 0,14, FID 152,72 vs. 110,46.
The greatest challenge was the synthesis between structurally very different modalities, such as generating MR images from CT images, where the model had to infer soft-tissue details not visible in CT. On this task, the model achieved expectedly poor results: PSNR 18,00~dB, SSIM 0,56, HaarPSI 0,36, LPIPS 0,23, DISTS 0,20, FID 107,29.
The comparison between quantitative metrics and visual assessment revealed a significant limitation of the metrics used, which was most apparent in the translation from CTA to CT images. On this task, the model achieved good quantitative results, which were among the best in the entire study (PSNR 33,16~dB, SSIM 0,90, HaarPSI 0,78, FID 45,07). However, in this case, the high metric values did not align with the visual assessment. The high scores were biased by structural similarities (e.g., skull regions) between source and target scans, making the task deceptively easy for similarity-based evaluation metrics. This illustrates the limitations of relying solely on quantitative metrics.
Overall, our analysis confirmed that no single metric is sufficient for reliable evaluation. PSNR proved to be the least reliable metric, as it is highly sensitive to simple contrast changes, while being less sensitive to image blur. SSIM was sensitive to noise, while its variants MS-SSIM and IW-SSIM were more robust and provided a better assessment of structural preservation. The HaarPSI metric has been shown to be highly sensitive to image blur and sharpness loss, making it especially suitable for assessing the preservation of fine details. The LPIPS and VSI metrics are the most suitable for perceptual and structural quality assessment. Their key advantage is their robustness to changes in global contrast, which allows them to focus on actual structural and textural errors.
|