Medical Imaging 2025: Image Perception, Observer Performance, and Technology Assessment

Front Matter: Volume 13409

Show abstract

This PDF file contains the front matter associated with SPIE Proceedings Volume 13409, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.

Designing AI for clinical imaging: the important role of model observers

Abhinav K. Jha

Show abstract

Deep learning algorithms for image reconstruction and processing are showing strong promise for multiple medical-imaging applications. However, medical images are acquired for clinical tasks, such as defect detection and feature quantification, and these algorithms are often developed and evaluated agnostic to this clinical task. This talk will demonstrate how model observers can facilitate the development and evaluation of deep learning algorithms for clinical tasks by presenting two case studies. The first case study will underscore the misleading interpretations that clinical-task-agnostic evaluation of AI algorithms can yield, emphasizing the crucial need for clinical-task-based evaluation. Next, we will see how model observers can not only facilitate such evaluation but also enable the designing of deep learning algorithms that explicitly account for the clinical task, thus poising the algorithm for success in clinical applications. The second case study will demonstrate the use of model observers to select deep learning algorithms for subsequent human-observer evaluation. We will then see how this led to the successful evaluation of a candidate algorithm in a multi-reader multi-case human observer study. These case studies will illustrate how model observers provide a practical, reliable, interpretable, and efficient mechanism for development and translation of AI-based medical imaging solutions.

Observer performance and eye tracking variations as a function of AI output format

Elizabeth A. Krupinski, Marly van Assen, Carlo N. De Cecco, et al.

Show abstract

Artificial intelligence (AI) tools are designed to improve the efficacy and efficiency of data analysis and interpretation by the human decision maker. However, we know little about the optimal ways to present AI output to providers. This study used radiology image interpretation with AI-based decision support to explore the impact of different forms of AI output on reader performance. Readers included 5 experienced radiologists and 3 radiology residents reporting on a series of COVID chest x-ray images. Four different forms (1 word summarizing diagnoses (normal, mild, moderate, severe), probability graph, heatmap, heatmap plus probability graph) of AI outputs (plus no AI feedback) were evaluated. Results reveal that most decisions regarding presence/absence of COVID without AI were correct and overall remained unchanged across all types of AI outputs. Fewer than 1% of decisions that were changed as a function of seeing the AI output were negative (true positive to false negative or true negative to false positive) regarding presence/absence of COVID; and about 1% were positive (false negative to true positive, false positive to true negative). More complex output formats (e.g., heat map plus a probability graph) tend to increase reading time and the number of scans between the clinical image and the AI outputs as revealed through eyetracking. The key to the success of AI tools in medical imaging will be to incorporate the human into the overall process to optimize and synergize the human-computer dyad, since at least for the foreseeable future, the human is and will be the ultimate decision maker. Our results demonstrate that the form of the AI output is important as it can impact clinical decision making and efficiency.

Mitigating visual hindsight bias in radiology: can education alter perceptual decision making?

Jacky Chen, Warren Reed, Ziba Gandomkar

Show abstract

This study investigated whether educating radiologists on hindsight bias could mitigate its effects when interpreting chest radiographs containing pulmonary nodules. Sixteen radiologists analysed 15 PA chest X-rays (CXRs) with three levels of lung nodule conspicuity. Initially, they identified nodules by reducing image blurring (foresight phase) and later increased blurring until the nodules became unidentifiable (hindsight phase). After an educational intervention, the experiment was repeated to assess changes in perception. Eye-tracking metrics, including Time to First Fixation (TFF) and Total Fixation Duration (TFD), were measured pre and post intervention. Wilcoxon Signed Rank Tests assessed statistical differences. While a linear mixed-effects model accounted for participant demographics such as specialty, experience and case volume. Results showed a significant change in TFF during the foresight phase (p<0.001) but no change in TFD. Indicating that hindsight bias influenced fixation patterns. In the hindsight phase, TFD significantly changed post-intervention, suggesting that education altered visual search behaviours. Higher lesion conspicuity increased the impact of education on TFD (p=0.03), while years of experience (p=0.03) and thoracic radiology specialisation (p=0.02) reduced this effect. No significant changes were found for TFF. These findings suggest that education can mitigate hindsight bias in radiology by altering search behaviours in the hindsight phase, though its impact varies by experience and speciality. This may have implications for radiology training and expert witness evaluations in medicolegal cases.

Capturing eye movements during ultrasound-guided embryo transfer: first insights

Josselin Gautier, Kimberley Truyen, Ndeye Racky Sall, et al.

Show abstract

Embryo transfer is a critical step of in vitro fertilization, the most effective treatment for infertility experienced by one in six people in their lifetime. To date, despite advances in optimizing embryo quality, an important variability of pregnancy rate remains between practitioners. In order to evaluate the key technical skills that might affect such behavioural differences, we conducted a preliminary multi-centric study on assisted reproductive technologies (ART) specialists using a Gynos Virtamed simulator for ultrasound guided embryo transfer (UGET) combined with a portable eyetracker (Neon, Pupil labs). Our first analyses demonstrate the capability of a recent portable eyetracker in tracking fine eye movements in an ecological (head unrestrained, dim light condition) embryo transfer condition. A dedicated processing pipeline was developed and gaze were analyzed on Areas of Interest (AoI) consisting of the ultrasound image, the uterine model (A, C or E) or the catheter. A separate analysis of the fixated anatomical subregions of the ultrasound image was also conducted. Preliminary analyses show two distinctive patterns of eye movements during UGET: a target based behaviour or a switching and tool following behaviour, suggesting more pro-active gaze behaviour in experts, in agreement with the literature on other image guided interventions.

The relationship between eye tracking features and transfer learning in modeling decision prediction of radiologists reading mammograms

Karthika Kelat, Sarah E. Gerard, Bulat Ibragimov, et al.

Show abstract

Predicting radiologists’ decisions when reading mammograms is a novel way to reduce the number of false positives and false negatives made at breast cancer screening. In this study, we aimed to enhance the accuracy of predicting radiologists’ decisions in mammography by leveraging transfer learning. Our dataset comprised 120 digital mammogram cases, each annotated with radiologists’ decisions categorized as true positive (TP), false positive (FP), or false negative (FN). We adopted the ResNet50 convolutional neural network (CNN) for our modeling approach, developing two different models. In the first model, ResNet50 was pretrained on the ImageNet dataset, with the initial layers frozen and the remaining layers fine-tuned to adapt to our mammography data. The second model was initialized with ImageNet weights obtained in the first model and further pretrained using the VinDr-Mammo dataset, an open-access large-scale Vietnamese dataset of full-field digital mammograms (FFDM) consisting of 5,000 four-view exams with breast-level assessments and extensive lesion-level annotations. Our transfer learning method improved the decision prediction accuracy by leveraging features from the VinDr-Mammo models.

Performance of tomosynthesis vs. mammography in women with a family history of breast cancer

Tong Li, Yu-Ru Su, Janie M. Lee, et al.

Show abstract

Background: Digital breast tomosynthesis (DBT) improves screening performance compared to digital mammography (DM) in population screening, but data on DBT's performance in women with a family history of breast cancer (FHBC) are limited. Methods: Collaboratively with the Breast Cancer Surveillance Consortium, a cohort of women with a FHBC who received DBT or DM screening from 2011–2018 was assembled. This study reports the crude measures of DBT and DM screening. Results: The dataset consisted of 502,357 examinations (121,790DBT; 380,567DM) with complete 1-year cancer registry follow-up from 208,945 women with a FHBC. Crude cancer detection rates were 5.7 and 5.2 per 1,000 examinations for DBT and DM, respectively, and the corresponding recall rates were 8.5% and 10.1%, respectively. Detection rates of invasive cancers and ductal carcinoma in situ were 4.5 (vs 3.9) and 1.2 (vs 1.3) per 1,000 examinations for DBT (vs DM), respectively. The biopsy and false-positive biopsy recommendation rates were 1.4% (vs 1.4%) and 0.9% (vs 1.0%) for DBT (vs DM), respectively. Interval cancer rates were 1.0 and 0.9 per 1,000 examinations for DBT and DM. Advanced cancer rates were same for DBT and DM at 0.6 per 1,000 examinations. The sensitivity and specificity were 85.4% (vs 85.0%) and 92.1% (90.4%) for DBT (vs DM). Conclusions: This large-scale cohort study of women with a FHBC who received DBT or DM screening presents the crude rates for screening metrics. The primary benefit from DBT is reduction in recall rate with preservation of cancer detection for women with a FHBC.

Interrogating expert observer performance in a BreastScreen Australia radiology cohort

Jayden B. Wells, Phyong D. Trieu, Dania Abu Awwad, et al.

Show abstract

Breast cancer has the highest incidence among Australian women, with approximately 20,000 new cases diagnosed in 2022. The national screening program, BreastScreen Australia (BSA), plays a crucial role in early detection, which significantly improves survival rates. Using data from 592 readers on the BreastScreen Reader Assessment Strategy (BREAST) platform collected between 2014-2024, this study evaluated the differences in clinical workload and demographic characteristics between BSA readers performing at or above the 95^th percentile, compared to that of the general cohort. Furthermore, the impact of cases per week on sensitivity, specificity, lesion localization accuracy, ROC AUC, and JAFROC was considered. It was found that top performing readers had significantly more years in their clinical roles, read more cases per week, and had more experience in mammogram reading. An increased number of cases per week (CPW) was significantly associated with better performance, with a performance plateau observed at approximately 101-150 cases per week. These insights highlight the importance of maintaining reader caseload to achieve optimal screening performance and may inform future guidelines for reader benchmarks and training in the BSA program.

Adaptation effects on breast density judgements with blended stimuli

Craig K. Abbey, Mohana K. Parthasarathy, Andriy I. Bandos, et al.

Show abstract

Sequential effects in batch reading of breast-cancer screening images have now been reported in multiple studies across different countries, and across both digital mammography and digital breast tomosynthesis modalities. Common to all of these studies is a change in assessment characteristics as a reader progresses through a batch (e.g. decrease in assessment time and changing recall rate). Understanding the mechanism(s) of these phenomena remains an open question. We have been investigating visual adaptation as a contributing factor. A general perceptual mechanism like adaptation should affect the reading process generally, and in this work, we evaluate radiologist perception of breast density to see if there is evidence of adaptation in this ancillary judgement. We report the result of a series of density rating experiments run at the European Congress of Radiology in 2023. 13 radiologists with breast-reading experience participated in this study, and rated breast density for patches of fatty and dense tissue in 3 adaptation states (No-Adapt, Fatty-Adapt, and Dense-Adapt). In these studies, blending of fatty and dense images served as a quantitative surrogate for breast density (e.g. an 80%:20% blend of dense and fatty tissue is considered more dense than a 20%:80% blend). Adaptation was to a rapid, random sequence of unblended dense or fatty images. We used nonparametric multi-sample bootstrap for statistical inference. We find that radiologists consistently rate denser blends higher (i.e. as more dense) with more than 0.25 points per 20% increase in the blend, depending on adaptation state (p<0.001). This rate is significantly lower for No-Adapt ratings (0.26) than for Fatty-Adapt (0.37, p<0.01) and Dense-Adapt (0.38, p<0.01). Average ratings are 0.33 points higher consistently for the fatty-adapt condition compared to the dense-adapt condition across the blends (p<0.01). Thus, we find systematic differences in ratings following the different adaptation states. In particular, relative to the No-Adapt condition, adapting to fatty images makes denser blends appear more dense, and adapting to dense images appears to make fatty blends look more fatty. These findings are consistent with laboratory studies of visual adaptation, which have demonstrated perceptual after effects for mammogram and DBT images in non-radiologists, and suggest these effects could also impact BI-RADS classification by trained readers.

AI performance in screening mammograms may improve through multiresolution data augmentation

Zhengqiang Jiang, Ziba Gandomkar, Phuong D. Trieu, et al.

Show abstract

This paper integrated a multi-resolution strategy into two state-of-the-art AI models for cancer detection within a double reader breast screening program and determined whether tumour sizes affected the performance of the better AI model. Transfer learning and a multi-resolution strategy were conducted on the Globally-aware Multiple Instance Classifier (GMIC) and Global–Local Activation Maps (GLAM) models using two Australian mammographic databases. The specificity and sensitivity of these AI models, both with and without transfer learning and multi-resolution strategies, were evaluated on our database of 450 normal cases and 450 cancer cases. When transfer learning and multi-resolution strategy were incorporated, the GMIC model outperformed the GLAM model in terms of specificity and sensitivity. The performance of the GMIC and GLAM with transfer learning and multi-resolution strategy was best with 91.6% and 86.9% of sensitivity, outperforming its transfer learning only and pre-trained mode. The sensitivity of the two transfer learning AI models was significantly improved using the multi-resolution strategies. The GMIC with transfer learning and the multi-resolution strategy demonstrated similar performance on screening mammograms with smaller tumour sizes, compared with larger tumour sizes. The study also supports the potential of the AI models to assist radiologists interpreting mammograms within a double reader breast screening program.

Anatomical texture impacts model observer detection performance: an inkjet-printed phantom study

Laura K. Evans, Paul Jahnke, François O. Bochud, et al.

Show abstract

Deep learning reconstruction (DLR) algorithms are trained on patient images whose features define the DLR model. These features are currently absent from phantoms used in image quality evaluations. As DLR assumes similar features between training and scan data, image quality may differ between patients and image quality phantoms. DLR thus motivates the use of realistic phantoms, but these have never been used in image quality evaluations involving model observers. This study has two aims. First, to investigate whether a channelized Hotelling observer (CHO) can run on two inkjet-printed phantoms, one uniform semi-anthropomorphic and one realistic, both with low-contrast lesions of 12 and 8 mm in diameter. Second, to assess if CHO performance differs between phantom types. Repeated computed tomography (CT) acquisitions of both phantoms allowed to extract 40-pixels regions-of-interest (ROIs) then given to a CHO shown to predict human performance in small anatomical liver ROIs. CHO performance depended significantly on dose, lesion contrast and size. CHO performance for 12mm lesions was significantly greater with the uniform phantom than with the realistic phantom. Thus, estimating a dose reduction with the uniform phantom could lead to an undesired loss of detection performance in patient images, ranging from 15 to 20% depending on the reconstruction algorithm. Such a loss could be reduced by using realistic phantoms, further allowing to bridge medical physicists’ image quality evaluations with radiologists’ clinical realities.

Using gradient of Lagrangian function to compute efficient channels for the ideal observer

Weimin Zhou

Show abstract

It is widely accepted that the Bayesian ideal observer (IO) should be used to guide the objective assessment and optimization of medical imaging systems. The IO employs complete task-specific information to compute test statistics for making inference decisions and performs optimally in signal detection tasks. However, the IO test statistic typically depends non-linearly on the image data and cannot be analytically determined. The ideal linear observer, known as the Hotelling observer (HO), can sometimes be used as a surrogate for the IO. However, when image data are high dimensional, HO computation can be difficult. Efficient channels that can extract task-relevant features have been investigated to reduce the dimensionality of image data to approximate IO and HO performance. This work proposes a novel method for generating efficient channels by use of the gradient of a Lagrangian-based loss function that was designed to learn the HO. The generated channels are referred to as the Lagrangian-gradient (L-grad) channels. Numerical studies are conducted that consider binary signal detection tasks involving various backgrounds and signals. It is demonstrated that channelized HO (CHO) using L-grad channels can produce significantly better signal detection performance compared to the CHO using PLS channels. Moreover, it is shown that the proposed L-grad method can achieve significantly lower computation time compared to the PLS method.

Effects of feature selection and internal noise levels for a search-capable model observer

Howard C. Gifford, Hongwei Lin

Show abstract

Diagnostic imaging trials are important for evaluating and regulating medical imaging technology. A typical trial might have a group of radiologists separately reading sets of patient cases. Such trials are costly and time-consuming, factors which limit the practical use of trial methodologies. The idea of applying computer models as substitutes for the expert readers in clinically relevant trials has been around for many decades, with the potential for broadening the applicability of imaging trials at relatively lower cost. However, achieving and affirming this relevance has been an issue. In this work, we are examining feature-driven computer models derived from statistical decision theory that could serve in appropriate imaging trials involving patient variability, target variability and target search. Some properties and parameters of these models that are being examined are the effects of feature selection (including the use of thresholds on acceptable features) and internal-noise modeling. Performance comparisons with existing ideal and anthropomorphic computer models are of interest for future study.

Combining image texture and morphological features in low-resource perception models for signal detection tasks

Mini Das, Hongwei Lin, Diego Andrade, et al.

Show abstract

Texture analysis holds significant importance in various imaging fields due to its ability to provide statistical, structural, and intrinsic spatial information from images. In this work, we examine several first and second-order texture features on simulated and clinical DBT images. We examined some essential characteristics of texture features that show higher discriminatory potential for mass detection in digital breast tomosynthesis. We further examined the use of these texture features along with morphological features in a two stage visual search (VS) model observer for mass detection in DBT. Our preliminary results show that incorporation of texture features reduced the number of suspicious locations in the first stage of VS model. Our preliminary results with an eye tracking system and observer gaze points align well with the “search” regions predicted by either the texture aided or thresholded VS observer. In summary, we show how additing perceptually relevant texture features or a thresholding mechanism enhances our visual search observer models. Future work will examine feature selections for changing tasks.

Perceived color contrast metrics for clinical images

Jonas de Vylder, Peter Ouillette, Bart Diricx, et al.

Show abstract

In medical imaging, contrast plays a crucial role in determining visual quality and facilitating accurate interpretation, particularly in fields like digital pathology and dermatology where color variations are diagnostically significant. Traditional contrast metrics often focus on intensity variations, potentially overlooking critical color information. This study introduces two novel color contrast metrics designed to quantify color variations at the image level, addressing this gap. The proposed metrics are a generalization of histogram-based methods where perceptual color difference metrics are used to calculate maximum and spread contrast measures. Experimental evaluation, including a study with 28 volunteers assessing image pairs, demonstrated a high correlation (Cohen's kappa of 0.92) between these metrics and human-perceived color contrast. Furthermore, statistical analyses confirmed that these metrics reliably distinguish color differences beyond luminance variations. The metrics proved robust across different parameter settings, as demonstrated by stability tests. Additionally, practical applications were explored in the field of digital pathology. The study concludes that these color contrast metrics align well with human perception and offer a valuable tool for enhancing diagnostic accuracy in medical imaging.

Assessment of cell nuclei AI foundation models in kidney pathology

Junlin Guo, Siqi Lu, Can Cui, et al.

Show abstract

Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely used state-of-the-art (SOTA) cell nuclei foundation models—Cellpose, StarDist, and CellViT. Specifically, we created a highly diverse evaluation dataset consisting of 2,542 kidney whole slide images (WSIs) collected from both human and rodent sources, encompassing various tissue types, sizes, and staining methods. To our knowledge, this is the largest-scale evaluation of its kind to date. Our quantitative analysis of the prediction distribution reveals a persistent performance gap in kidney pathology. Among the evaluated models, CellViT demonstrated superior performance in segmenting nuclei in kidney pathology. However, none of the foundation models are perfect; a performance gap remains in general nuclei segmentation for kidney pathology.

Does concurrent reading with AI lead to more false negative errors for cancers that are not marked by AI?

Robert M. Nishikawa, Jeffrey W. Hoffmeister, Emily F. Conant, et al.

Show abstract

Purpose: To determine if reading digital breast tomosynthesis (DBT) concurrently with an artificial intelligence (AI) system increases the probability of missing a cancer not marked by AI for cancers that the radiologist detected reading without AI. Method We retrospectively analyzed an observer study of radiologists reading with and without an AI system. In that study, there were 260 DBT screening exams (65 containing at least one malignant lesion). Twenty-four radiologists read the cases in two separate sessions (with a 4-week washout period) once without the AI tool and once with AI concurrently (i.e., the AI marks and scores were available immediately upon examining the images). We separated the cases into AI-detected and AI-notDetected and then examined only cases that the radiologist recalled when reading without AI. We determined the fraction of cases from each group that the radiologist recalled when reading with AI; this was done separately for cancer and non-cancer cases. Results: When reading without AI, the readers detected an average of 5.0 of 7 (71%) cancers that were not marked by AI (range 1-7) and 49.8 of 58 (86%) cancers that were marked by AI (range 30-57). When reading with AI concurrently, readers found 3.3 (46%) of the 7 AI-notDetected cancers and agreed with 54.2 (93%) of the 58 AI-detected cancers. Using a two-tailed, paired t-test, this difference (46% vs 93%) was statistically significant (p<<0.00001). Nevertheless, the overall sensitivity increased with concurrent reading compared to reading without AI (77% to 85%). Similarly, for non-cancer cases that were recalled (FP) without AI (47%/26% not-marked/marked by AI), there was a smaller fraction recalled for the not-marked cases (8.1% vs 48%, p<<0.00001). This contributed to an increase in specificity with concurrent reading (63% to 70%). Conclusion: When reading with AI concurrently, radiologists are more likely to miss a cancer when AI fails to mark that cancer. Likewise, radiologists are more likely not to recall a non-cancer case when AI fails to mark a lesion in the case, even though the radiologist recalled the case when reading without AI.

Dual roles of calcification features in the Mirai mammographic breast cancer risk prediction model: early microcalcification detection and identification of high-risk calcifications

Y. K. Wang, Z. Klanecek, T. Wagner, et al.

Show abstract

In breast cancer screening, evaluating risk prediction independently from cancer detection is challenging due to factors such as early cancer signs and missed cancers in negative mammograms. Recent developments in risk prediction include Mirai, a deep learning-based mammographic breast cancer risk prediction model. As a means of establishing which image features contribute to the Mirai risk estimate, we developed CalcMirai, a Mirai variant that is limited to calcification features. The aim of this study was to use CalcMirai to test the casual contribution of calcification features in Mirai to the resultant risk prediction. Screening mammograms from the EMory BrEast imaging Dataset (EMBED) were used in selective mirroring experiments, where mammograms from one breast were mirrored to replace the contralateral breast. Results showed that both Mirai and CalcMirai had better performance in the positive mirroring setting i.e. considering only the future cancerous side, compared to using both left and right breast views in the original models (p-values <0.01). While negative mirroring i.e. considering the future healthy breast side of the cancerous patient, resulted in poorer performance (p-values <0.01), both models remained discriminative (AUCs 0.61-0.62). There was no significant difference between the negative mirroring performance of Mirai and CalcMirai. Additionally, visual assessments of receptive fields confirmed that the calcification features identified in CalcMirai accurately captured the positions of calcifications in the contralateral healthy breasts of future cancer patients. This underlines the role of calcification features as strong risk factors. Our findings suggest that the predictive power of Mirai derives mainly from its ability to detect early micro-calcifications and/or identify high-risk calcifications. These results provide new insight into mammographic risk factors, implying that calcification may be underestimated in predicting breast cancer risk compared to the well-known risk factor of parenchymal patterns.

Automated multi-lesion annotation in chest x-rays: annotating over 450,000 images from public datasets using the AI-based smart imagery framing and truthing (SIFT) system

Lin Guo, Fleming Y. M. Lure, Teresa Wu, et al.

Show abstract

This work utilized an artificial intelligence (AI)-based image annotation tool, Smart Imagery Framing and Truthing (SIFT), to annotate pulmonary lesions and abnormalities and their corresponding boundaries on 452,602 chest X-ray (CXR) images (22 different types of desired lesions) from four publicly available datasets (CheXpert Dataset, ChestX-ray14 Dataset, MIDRC Dataset, and NIAID TB Portals Dataset). SIFT is based on Multi-task, Optimal-recommendation, and Max-predictive Classification and Segmentation (MOM ClaSeg) technologies to identify and delineate 65 different abnormal regions of interest (ROI) on CXR images, provide a confidence score for each labeled ROI, and various recommendations of abnormalities for each ROI, if the confidence score is not high enough. The MOM ClaSeg System integrating Mask R-CNN and Decision Fusion Network is developed on a training dataset of over 300,000 CXRs, containing over 240,000 confirmed abnormal CXRs with over 300,000 confirmed ROIs corresponding to 65 different abnormalities and over 67,000 normal (i.e., “no finding”) CXRs. After quality control, the CXRs are entered into the SIFT system to automatically predict the abnormality type (“Predicted Abnormality”) and corresponding boundary locations for the ROIs displayed on each original image. The results indicated that the SIFT system can determine the abnormality types of labeled ROIs and their boundary coordinates with high efficiency (improved 7.92 times) when radiologists used SIFT as an aide compared to radiologists using a traditional semi-automatic method. The SIFT system achieves an average sensitivity of 89.38%±11.46% across four datasets. This can significantly improve the quality and quantity of training and testing sets to develop AI technologies.

A kernel analysis of network denoisers for CT imaging

Craig K. Abbey

Show abstract

Neural network denoising algorithms have become an established component of CT reconstructions in medical imaging. These networks are typically trained to predict pixel values of a high-dose image from a low-dose acquisition. Like many applications of network models, the large number of parameters and the complex process of learning obscure any fundamental understanding of how these denoisers accomplish their task. In this work, we model the output of denoising networks as a way to interpret the workings of the network. We evaluate 4 neural network architectures that have been trained for a specific denoising task, restoring a full-dose image from one acquired at one-quarter dose, using clinical patient CT scans. The architectures consist of a 3-layer convolutional neural network (CNN3), a deep (17 layer) CNN (DNCNN), a residual encoder-decoder CNN (REDCNN), and a dilated U-shaped CNN (U-Net). Each network architecture went through an extensive model selection and training process. In this initial exploratory assessment, we report the results of modeling each network as a linear filter kernel, using global and local estimation strategies. Global kernels are fit over many spatial locations, while local kernels are fit at a single location but with many white-noise perturbations. Global kernel weights are fit to the denoised pixels using least-squares regression over the interior of 434 patient slice images. Local kernels are estimated for a total of 12 locations (4 each in soft-tissue, lung, and bone) using 10,000 white-noise perturbations (σ=5HU). We find that the linear kernel approximates the output of the denoising algorithms quite well globally. Average R² across slices ranged from 96% to more than 99% depending on the class of pixels being fit (soft tissue, lung, etc.). Kernel weights appear to be consistent across networks, with small differences between pixel classes. Results of local fitting are more diverse, with visible differences across networks and pixel classes.

On the clinical usefulness of cone beam CT short scans and super-short scans for detection of fractures in the extremities

F. Noo, D. Dunn, M. Simpson, et al.

Show abstract

Detecting fractures in the upper and lower extremities can be challenging on conventional X-ray imaging, due to the lack of depth information. To clarify negative findings from X-ray images, the physician can order a CT scan or an MRI exam, but these may require long waiting times that are not desirable in the Emergency Department. Given recent advances in robotics, cone-beam imaging using circular short-scans or super short-scans is now becoming a possibility in the X-ray room. From a theoretical point of view, such scans are however known to have important limitations in terms of data completeness, leading to artifacts and invisible boundaries or portions thereof. Nevertheless, we contend that (super) short scans can be clinically useful for detection of fractures of extremities that are occult on conventional X-ray imaging. In this work, we present initial results based on 58 patient exams that favorably support our hypothesis using a clinically available system. The results are reported using both subjective and objective metrics obtained from radiologists serving as interpreters.

Learning stochastic object models using ambient adversarial diffusion models

Muzaffer Özbey, Hua Li, Mark A. Anastasio

Show abstract

Computational simulation plays an important role in the design and optimization of medical imaging systems. It is important to employ objective measures of image quality (IQ) for such purposes, but computing them requires that all sources of randomness in the measured data must be account for, including variations within the objects to-be-imaged. A stochastic object model (SOM) should be established that describes clinically realistic textures and anatomical variations. Ambient generative adversarial networks (GANs) have been explored to establish SOMs from experimental data but face limitations in representing variations in object properties due to mode collapse and premature convergence. This study proposes the Ambient Adversarial Diffusion Model (AADM), a novel ambient generative model inspired by the Adversarial Diffusion Model (ADM) and Ambient GAN frameworks. The AADM is designed to establish more advanced and comprehensive SOMs from noisy, indirect measurement data than previous AmbientGAN-based SOMs. Numerical experiments demonstrate that the performance of AADM is comparable to that of a non-ambient adversarial diffusion model that is training directly on the distribution of objects. The presented study demonstrates the significant potential to learn the distribution of realistic objects from noisy imaging measurements.

A tool for visual quality assessment of display devices used in digital pathology

Johan Rostang, Alexander Truyaert, Jonas de Vylder, et al.

Show abstract

Accurate diagnosis in digital pathology hinges on the ability to discern subtle details within digitized tissues. However, the quality of display screens and ambient lighting conditions significantly impact this perceptual process. In this context, we aim to develop a novel tool to assess reading conditions and the visual fidelity of displays used specifically in digital pathology. By employing abstract color patterns with meticulously controlled contrast levels, this tool simulates the challenges encountered during real-world analysis of H&E-stained tissue samples. These patterns allow the observers to evaluate their ability to differentiate subtle color variations on their display devices in their current lighting conditions. An observer study with 47 participants investigated the effectiveness of this tool. The results demonstrate its ability to differentiate between consumer-grade and medical-grade displays. These statistically significant findings highlight the tool’s potential for reliable display evaluation within digital pathology workflows. Overall, this innovative tool holds significant promise for ensuring optimal viewing conditions in digital pathology, potentially leading to more accurate diagnoses.

Estimating task-based performance bounds for accelerated MRI image reconstruction methods by use of learned ideal observers

Kaiyan Li, Hua Li, Kyle J. Myers, et al.

Show abstract

Medical imaging systems are commonly assessed and optimized by the use of objective measures of image quality (IQ). The performance of the ideal observer (IO) acting on imaging measurements has long been advocated as a figure-of-merit to guide the optimization of imaging systems. For computed imaging systems, the performance of the IO acting on imaging measurements also sets an upper bound on task-performance that no image reconstruction method can transcend. As such, estimation of IO performance can provide valuable guidance when designing under-sampled data-acquisition techniques by enabling the identification of designs that will not permit the reconstruction of diagnostically inappropriate images for a specified task—no matter how advanced the reconstruction method is or how plausible the reconstructed images appear. The need for such analysis is urgent because of the substantial increase of medical device submissions on deep learning-based image reconstruction methods and the fact that they may produce clean images disguising the potential loss of diagnostic information when data is aggressively under-sampled. Recently, convolutional neural network (CNN) approximated IOs (CNN-IOs) was investigated for estimating the performance of data space IOs to establish task-based performance bounds for image reconstruction, under an X-ray computed tomographic (CT) context. In this work, the application of such data space CNN-IO analysis to multi-coil magnetic resonance imaging (MRI) systems has been explored. This study utilized stylized multi-coil sensitivity encoding (SENSE) MRI systems and deep-generated stochastic brain models to demonstrate the approach. Signal-known-statistically and background-known-statistically (SKS/BKS) binary signal detection tasks were selected to study the impact of different acceleration factors on the data space IO performance.

Task-based regularization in penalized least squares for binary signal detection tasks in medical image denoising

Wentao Chen, Tianming Xu, Weimin Zhou

Show abstract

Image denoising algorithms have been extensively investigated for medical imaging. To perform image denoising, penalized least-squares (PLS) problems can be designed and solved, in which the penalty term encodes prior knowledge of the object being imaged. Sparsity-promoting penalties, such as total variation (TV), have been a popular choice for regularizing image denoising problems. However, such hand-crafted penalties may not be able to preserve task-relevant information in measured image data and can lead to oversmoothed image appearances and patchy artifacts that degrade signal detectability. Supervised learning methods that employ convolutional neural networks (CNNs) have emerged as a popular approach to denoising medical images. However, studies have shown that CNNs trained with loss functions based on traditional image quality measures can lead to a loss of task-relevant information in images. Some previous works have investigated task-based loss functions that employ model observers for training the CNN denoising models. However, such training processes typically require a large number of noisy and ground-truth (noise-free or low-noise) image data pairs. In this work, we propose a task-based regularization strategy for use with PLS in medical image denoising. The proposed task-based regularization is associated with the likelihood of linear test statistics of noisy images for Gaussian noise models. The proposed method does not require ground-truth image data and solves an individual optimization problem for denoising each image. Computer-simulation studies are conducted that consider a multivariate-normally distributed (MVN) lumpy background and a binary texture background. It is demonstrated that the proposed regularization strategy can effectively improve signal detectability in denoised images.

Investigating the impact of data consistency in task-informed learned image reconstruction method

Zhuchen Shao, Changjie Lu, Kaiyan Li, et al.

Show abstract

Various supervised learning-based medical image reconstruction methods have been developed with the goal of improving image quality (IQ). These methods typically use loss functions that minimize pixel-level differences between the reconstructed and high-quality target images. While they may seemingly perform well based on traditional image quality metrics such as mean squared error, they do not consistently improve objective IQ measures based on diagnostic task performance. This work introduces a task-informed learned image reconstruction method. To establish the method, a measure of signal detection performance is incorporated in a hybrid loss function that is used for training. The proposed method is inspired by null space learning, and a task-informed data-consistent (DC) U-Net is utilized to estimate a null space component of the object that enhances task performance, while ensuring that the measurable component is stably reconstructed using a regularized pseudo-inverse operator. The impact of changing the specified task or observer at inference time to be different from that employed for model training, a phenomenon we refer to as ”task-shift” or ”observer-shift”, respectively, was also investigated.

Direct optimization of signal detection metrics in learning-based CT image restoration

Gregory Ongie, Megan Lantz, Emil Y. Sidky, et al.

Show abstract

Convolutional neural networks (CNNs) used for medical image restoration tasks are typically trained by minimizing pixel-wise error metrics, such as mean-squared error (MSE). However, CNNs trained with these losses are prone to wipe-out small/low-contrast features that can be critical for screening and diagnosis. To address this issue, we introduce a novel training loss designed to preserve weak signals in CNN-processed images. The key idea is to measure model observer performance on a user-specified signal detection task implanted in the training data. The proposed loss improves on the recently introduced Observer Regularizer (ObsReg) loss,^1,2 which is not directly interpretable in terms of signal detection theory and requires specialized training. In contrast, the proposed loss function is defined in directly in terms of a classical signal detection metric, and does not require specialized training. Finally, our experiments on synthetic sparse-view breast CT data show that training a CNN with the proposed loss yields improvement in model observer performance on a signal-known-exactly/ background-known-exactly detection task as compared to training with the ObsReg loss.

Investigating usable information for assessing the impact of medical image processing

Changjie Lu, Sourya Sengupta, Hua Li, et al.

Show abstract

The data processing inequality (DPI) in information theory posits that no data processing can increase the mutual information between data and their task labels. For any post-processing method, the mutual information between post-processed images and task labels should always be less than or, at best, equal to the mutual information between raw images and diagnostic task labels. This is consistent with the fact that the performance of an ideal Bayesian observer cannot be improved through image processing. As such, mutual information is generally not suitable for evaluating the effects of image processing. Recently, a novel variant of mutual information, termed V-information (V-info), has been introduced to account for the computational constraints associated with a sub-ideal observer. In contrast to conventional mutual information, V-info can increase as a result of processing of data, making it a promising task-oriented metric for assessing the impact of image processing. In this study, for the first time, we investigate the application of V-info, which we refer to as the more readily meaningful term ”observable-usable information” (O-U-Info), for evaluating the impact of medical image processing. Specifically, we examine deep learning-based super-resolution as the image processing operation. A deep learning-based numerical observer (NO) is employed to perform a Rayleigh binary signal discrimination task using low-resolution, high-resolution, and super-resolved images. We quantify O-U-Info under conditions of varying NO capacity and dataset size. The results demonstrate the potential usefulness of O-U-Info as an objective metric for assessing the impact of medical image processing.

Ambient denoising diffusion generative adversarial networks for establishing stochastic object models from noisy image data

Xichen Xu, Wentao Chen, Weimin Zhou

Show abstract

It is widely accepted that medical imaging systems should be objectively assessed via task-based image quality (IQ) measures that ideally account for all sources of randomness in the measured image data, including the variation in the ensemble of objects to be imaged. Stochastic object models (SOMs) that can randomly draw samples from the object distribution can be employed to characterize object variability. To establish realistic SOMs for task-based IQ analysis, it is desirable to employ experimental image data. However, experimental image data acquired from medical imaging systems are subject to measurement noise. Previous work investigated the ability of deep generative models (DGMs) that employ an augmented generative adversarial network (GAN), AmbientGAN, for establishing SOMs from noisy measured image data. Recently, denoising diffusion models (DDMs) have emerged as a leading DGM for image synthesis and can produce superior image quality than GANs. However, original DDMs possess a slow image-generation process because of the Gaussian assumption in the denoising steps. More recently, denoising diffusion GAN (DDGAN) was proposed to permit fast image generation while maintain high generated image quality that is comparable to the original DDMs. In this work, we propose an augmented DDGAN architecture, Ambient DDGAN (ADDGAN), for learning SOMs from noisy image data. Numerical studies that consider clinical computed tomography (CT) images and digital breast tomosynthesis (DBT) images are conducted. The ability of the proposed ADDGAN to learn realistic SOMs from noisy image data is demonstrated. It has been shown that the ADDGAN significantly outperforms the advanced AmbientGAN models for synthesizing high resolution medical images with complex textures.

Comparative analysis of data representativeness across medical image datasets using multidimensional similarity measures

Robert M. Tomek, Fahd T. Hatoum, Heather M. Whitney, et al.

Show abstract

The purpose of our study was to quantify the representativeness of various characteristics between different medical imaging data sets. We extended our prior work with the Jensen-Shannon distance (JSD), a measure of similarity between two distributions based on a single attribute, to include multiple attributes. Previous research had measured similarity across datasets in terms of demographic attributes and disease states. However, those methods calculated a JSD score for each category separately (e.g., JSD score for race only). In this study, we aimed to extend that approach by developing a multidimensional JSD score that incorporates multiple demographic attributes and disease state into a single score. We examined two methods. First, we looked at an Aggregate Method by listing all possible combinations of attributes (demographic and disease state), counting instances from each dataset, and comparing their similarity using the JSD. The second method involved Factor Analysis of Mixed Data (FAMD), a dimensionality reduction technique designed for datasets containing both categorical and numerical data². For our analysis, Principal Component Analysis (PCA) was applied to the age attribute, while Multiple Correspondence Analysis (MCA) was used for all other demographic and disease attributes. The data points were projected onto a one-dimensional axis with the highest variance (eigenvalue), and then binned to create probability distributions. These distributions were then compared using the JSD. In this study, we examined the demographic distributions of imaging data available in the MIDRC data commons, using regional metadata (3 digit zip code prefix) to assess the regional variation in demographics. We found that the FAMD method provided a way to measure population representativeness, and in particular, could be utilized in datasets that are small compared to the number of aggregate combinations.

Dataset distillation in medical imaging: a feasibility study

Muyang Li, Can Cui, Quan Liu, et al.

Show abstract

Data sharing in the medical image analysis field has potential yet remains underappreciated. The aim is often to share datasets efficiently with other sites to train models effectively. One possible solution is to avoid transferring the entire dataset while still achieving similar model performance. Recent progress in data distillation within computer science offers promising prospects for sharing medical data efficiently without significantly compromising model effectiveness. However, it remains uncertain whether these methods would be applicable to medical imaging, since medical and natural images are distinct fields. Moreover, it is intriguing to consider what level of performance could be achieved with these methods. To answer these questions, we conduct investigations on a variety of leading data distillation methods, in different contexts of medical imaging. We evaluate the feasibility of these methods with extensive experiments in two aspects: 1) Assess the impact of data distillation across multiple datasets characterized by minor or great variations. 2) Explore the indicator to predict the distillation performance. Our extensive experiments across multiple medical datasets reveal that data distillation can significantly reduce dataset size while maintaining comparable model performance to that achieved with the full dataset, suggesting that a small, representative sample of images can serve as a reliable indicator of distillation success. This study demonstrates that data distillation is a viable method for efficient and secure medical data sharing, with the potential to facilitate enhanced collaborative research and clinical applications.

Enhancing radiological assessment of dust diseases: evaluating the impact of online self-assessment educational modules and feedback interventions

Mo'ayyad E. Suleiman, Xuetong Tao, Patrick C. Brennan, et al.

Show abstract

Dust diseases, a group of non-malignant interstitial lung disorders caused by prolonged inhalation of dust particles, significantly contribute to the global burden of lung disease. This study evaluated the effectiveness of online self-assessment educational modules and feedback interventions in improving radiological assessment of dust diseases using chest X-ray and lung CT cases. Through a longitudinal design, radiologists and trainees participated in multiple intervention points, with progress measured from baseline to post-intervention datasets. Test sets curated by senior radiologists ensured comparability in difficulty and relevance to dust diseases. Performance improvements, measured in sensitivity, specificity, and weighted Cohen's Kappa, were evaluated using paired Wilcoxon signed-rank tests. The Kruskal-Wallis test further explored associations between participant characteristics and performance gains. Cohen's Kappa was used to assess agreement with expert ratings on radiological features. The findings demonstrated enhanced agreement with expert ratings for CT assessments following the educational interventions, particularly in identifying and grading diffuse well-rounded opacities and predominant parenchymal abnormalities. However, improvements in sensitivity and specificity were not statistically significant. For X-ray assessments, specificity improvement was notable, especially among participants with a specialty interest in lung disease. These results suggest that while educational interventions can enhance certain aspects of radiological assessment, particularly for CT evaluations, further research is needed to optimize their effectiveness across all performance metrics.

Evaluating machine learning models: insights from the Medical Imaging and Data Resource Center mastermind challenge on pneumonia severity

Karen Drukker, Sam Armato III, Lubomir Hadjiiski, et al.

Show abstract

Purpose: The MIDRC Mastermind Grand Challenge of modified radiographic assessment of lung edema (mRALE) tasked participants with developing AI/ML techniques for automated COVID severity assessment via mRALE scores on portable chest radiographs (CXRs). This follow-up study examines potential biases in submitted AI algorithms across demographic subgroups.

Approach: Models submitted during the test phase were evaluated against a non-public test set of CXRs (814 patients) annotated by radiologists for disease severity (mRALE score 0-24). Participants used diverse data and methods for training. Performance was measured using quadratic-weighted kappa (QWK). Bias analyses considered demographics (sex, age, race, ethnicity, and their intersections) using QWK. Bias was defined as statistically significant QWK subgroup differences (DQWK).

Results: Nine algorithms demonstrated good agreement with the reference standard (QWK 0.74-0.88). Of 19 subgroups, Native Hawaiian/Pacific Islander and American Indian/Alaska Native had insufficient samples. The Challenge winner (QWK=0.884 [0.819; 0.949]) was the only model for which no statistically significant subgroup ΔQWK could be identified. The median number of disadvantaged groups in terms of ΔQWK per model was 2 with the most frequently disadvantaged subgroups being older patients 75<age≤84 and age>84 years.

Conclusions: The Challenge demonstrated strong model performances but identified subgroup disparities. Bias analysis is essential, as models with similar accuracy may exhibit varying fairness.

Sequestration of imaging studies in MIDRC: controlling for ingenuous and disingenuous use of sequestered data

Dylan Tang, Heather M. Whitney, Kyle J. Myers, et al.

Show abstract

Evaluation of AI/ML algorithm performance on a sequestered test set may lead to ingenuous and disingenuous use of the dataset, even though the data are not accessible to the developer. In the ‘ingenuous’ case, the resulting algorithm’s performance metric, for example the area under the receiving operator curve (AUC) for a classification algorithm, may unintentionally overestimate or underestimate the true algorithm performance. A developer may also attempt to learn from the sequestered test set through attempting to repeatedly evaluate the algorithm on subsets of the test set, i.e., a ‘disingenuous’ use that may lead to algorithm overfitting of the test set. Creating a metric that can be used to ‘dial in’ ideal data set sampling to avoid each of these issues is an important area of investigation by the Medical Imaging and Data Resource Center (MIDRC, midrc.org). Building upon our prior work to address the ingenuous case, we also now address disingenuous use of the test set through a hash-table implementation that incorporates the ThresholdoutAUC algorithm, and subsequently use the load factor metric to indicate overfitting to the test data. Furthermore, we devise analytical relationships between load factor and ThresholdoutAUC budget. Notably, the relationship between load factor and budget is dependent on a noise rate parameter. We unify these methods with our previous findings for ingenuous use of sequestered data, specifically the relationship between AUC variability and load factor via the use case of a classifier trained to predict COVID-19 severity. The results show that while AUC standard error is inversely related to the load factor, the budget parameter from ThresholdoutAUC is directly related to the load factor and noise rate. Thus, we anticipate using the load factor as a ‘dial’ that controls the number of test subsets eligible for evaluation. Specifically, if the developer requests to operate at a particular ThresholdoutAUC budget, a specific load factor and noise rate combination can be determined that limits AUC variation while meeting budget demand.

Assessment of an alpha version of Ommo tracking system in a surgical environment: a preliminary study

Pedro Lobo, António Real, Patrick Murray, et al.

Show abstract

Tracking systems play a crucial role in providing feedback to healthcare professionals during medical interventions, especially in minimally invasive procedures that require navigation such as laparoscopy. In this study, we aim to assess the accuracy and precision of a recent technology, a version of Ommo prototype, comprising a permanent magnet-based signal generator and tracking sensors (active). An assessment platform was constructed using a KUKA robot (LBR iiwa 7 R800) with a deposition repeatability accuracy of ±0.1mm. This platform guided an active tracking sensor along a grid of points within the working area of the permanent magnet-based signal generator. The positions and orientations of the Ommo system were compared with those of the KUKA at each point, and the average accuracy and precision in terms of position and orientation were evaluated. Additionally, the potential interference of the Ommo tracking system caused by laparoscopic instruments, specifically a straight camera, and laparoscopic forceps and scissors, was evaluated. The results indicate promising efficiency, with a mean accuracy and precision of 1.3632 and 0.9640mm in terms of position and 0.0188 and 0.0101 radians in terms of orientation, respectively. Regarding instrument interference, different materials of construction resulted in varying levels of disruption. Further studies are warranted to assess its performance in diverse scenarios and larger working areas, particularly to evaluate the degradation of location and orientation accuracy at greater distances.

Realtime dual camera localization and orientation tracking for calibrating magnetically navigated capsule endoscopy

John Bohatch, Yuankai Huo

Show abstract

Magnetically navigated capsule endoscopy has transformed gastrointestinal diagnostics by enabling navigable and non-invasive observations of internal health. A crucial aspect of research and development has been the pre-operation synchronization of the magnetic navigation system with the capsule’s camera. This study explores an advanced method for controlling the capsule by employing a dual camera system to accurately track its precise 3D spatial location. Utilizing an overhead camera to determine planar coordinates and a side camera for vertical displacement, we have developed a methodology that allows for real-time visualization of the capsule’s position and orientation. Preliminary findings demonstrate the effectiveness of this approach in accurately detecting and manipulating the capsule’s location. These results offer significant implications for medical procedures, providing a higher degree of control over capsule endoscopes and advancing the potential for more precise interventions. Our research highlights the value of external camera systems in improving endoscopic technology and paves the way for future advancements in minimally invasive diagnostics. The source code has been made publicly available through https://github.com/hrlblab/capsule_endoscopy_vision.

Weighted circle fusion: ensembling circle representation from different object detection results

Jialin Yue, Tianyuan Yao, Ruining Deng, et al.

Show abstract

Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn’t crucial. Unfortunately, this widely adopted strategy is not readily available for combining circle representations. In this paper, we propose Weighted Circle Fusion (WCF), a simple approach for merging predictions from various circle detection models. Our method leverages confidence scores associated with each proposed bounding circle to generate averaged circles. We evaluate our method on a proprietary dataset for glomerular detection in whole slide imaging (WSI) and find a performance gain of 5% compared to existing ensemble methods. Additionally, we assess the efficiency of two annotation methods—fully manual annotation and a human-in-the-loop (HITL) approach—in labeling 200,000 glomeruli. The HITL approach, which integrates machine learning detection with human verification, demonstrated remarkable improvements in annotation efficiency. The Weighted Circle Fusion technique not only enhances object detection precision but also notably reduces false detections, presenting a promising direction for future research and application in pathological image analysis. The source code has been made publicly available at https://github.com/hrlblab/WeightedCircleFusion

Task-focused knowledge transfer from natural images for CT image quality assessment

Kazi Ramisa Rifa, Md. Atik Ahamed, Jie Zhang, et al.

Show abstract

Radiation dose in computed tomography (CT) and image quality are closely correlated. Good quality CT images better help radiologists diagnose diseases. Although increasing radiation dose improves image quality, it comes with various health risks for patients. Therefore, good-quality CT images at lower doses are required to balance the trade-off. However, assessing the quality of low-dose CT images requires feedback from different radiologists, which is time-consuming and laborious. Although several studies demonstrate automated CT image quality assessment (IQA), complete reference-free tools are rare. Moreover, most of the existing deep learning methods rely on the availability of large CT datasets with IQA scores as a proxy to radiologists’. However, it can be challenging to obtain large-labeled datasets and the proxy IQA scores might not correlate well to the diagnostic quality followed by clinicians. To achieve an assessment closely related to radiologists’ feedback, we propose a novel, automated, and reference-free CT image quality assessment method, namely Task-Focused Knowledge Transfer (TFKT) for IQA estimation leveraging natural images of similar tasks and an effective hybrid CNN-Transformer model. Extensive evaluations demonstrate the proposed TFKT’s effectiveness in accurately predicting in-domain radiologists’ provided IQA prediction and evaluating out-of-domain clinical images of pediatric CT exams.

Enhancing breast arterial calcification segmentation: a comparative study of AI and human reader variability

Wenbo Li, Jay Phil Yoo, Yaru Tao, et al.

Show abstract

Breast artery calcification (BAC) is linked to a higher risk of cardiovascular diseases and can be detected through mammography. Traditionally, segmentation of BAC by human readers is time-consuming and prone to inter- and intra-observer variability. This study evaluates these variabilities and proposes a novel ensemble deep learning model combining nnU-Net and ResNet152 architectures to reduce them. Results demonstrate strong correlations (R²=0.96 among human readers and R²=0.97 between the AI and readers) and consistent performance from the AI model, minimizing variability and providing a reliable, standardized approach to BAC segmentation.

Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed domain datasets

Aneesh Rangnekar, Nishant Nadkarni, Jue Jiang, et al.

Show abstract

Medical image foundation models have shown the ability to segment organs and tumors with minimal fine-tuning. These models are typically evaluated on task-specific in-distribution (ID) datasets. However, reliable performance on ID datasets does not guarantee robust generalization on out-of-distribution (OOD) datasets. Importantly, once deployed for clinical use, it is impractical to have ‘ground truth’ delineations to assess ongoing performance drifts, especially when images fall into the OOD category due to different imaging protocols. Hence, we introduced a comprehensive set of computationally fast metrics to evaluate the performance of multiple foundation models (Swin UNETR, SimMIM, iBOT, SMIT) trained with self-supervised learning (SSL). All models were fine-tuned on identical datasets for lung tumor segmentation from computed tomography (CT) scans. SimMIM, iBOT, and SMIT used identical architecture, pretraining, and fine-tuning datasets to assess performance variations with the choice of pretext tasks used in SSL. The evaluation was performed on two public lung cancer datasets (LRAD: n=140, 5Rater: n=21) with different image acquisitions and tumor stages compared to training data (n=317 public resource with stage III-IV lung cancers) and a public non-cancer dataset containing volumetric CT scans of patients with pulmonary embolism (n=120). All models produced similarly accurate tumor segmentation on the lung cancer testing datasets. SMIT produced the highest F1-score (LRAD: 0.60, 5Rater: 0.64) and lowest entropy (LRAD: 0.06, 5Rater: 0.12), indicating higher tumor detection rate and confident segmentations. In the OOD dataset, SMIT misdetected the least number of tumors, marked by a median volume occupancy of 5.67cc compared to the best method SimMIM of 9.97cc. Our analysis shows that additional metrics such as entropy and volume occupancy may help better understand model performance on mixed domain datasets.

Improvement in breast lesion classification utilizing deep learning and treatment response assessment maps (TRAMs)

Jerry Wang, Bowen Jing, Baowei Fei

Show abstract

Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has been widely used for breast lesion diagnosis. However, standard DCE-MRI-based diagnosis has low specificity, leading to unnecessary biopsies and other invasive procedures. A treatment response assessment map (TRAM) involves subtracting the T1-weighted DCE-MRI approximately five minutes after the injection of the contrast agent from a delayed-phase T1-MRI. TRAM characterizes the spatial distribution of contrast accumulation and clearance, potentially aiding in differentiating between benign and malignant lesions. Meanwhile, deep learning-based modeling has shown promising results in many medical imaging diagnostic tasks. In this project, we developed a deep learning model dedicated to breast lesion classification based on TRAM. We used a 3D convolutional residual network (ResNet18) to learn image representations from TRAM. The ResNet18-extracted features were then fed to a fully connected classifier for lesion classification. The TRAM-based model was compared with a model trained on standard multi-phase DCE-MRI. The model trained on TRAM achieved a higher area under the receiver operating characteristic curve (AUROC) (0.870 vs. 0.835), higher sensitivity (0.848 vs. 0.818), and higher specificity (0.823 vs. 0.759) than the model trained on standard DCE-MRI. The presented TRAM-based analyses may be able to aid in the clinical decision-making process during diagnosis and treatment.

Medical Imaging 2025: Image Perception, Observer Performance, and Technology Assessment

Volume Details

Table of Contents

Table of Contents