Article Versions
Export Article
Cite this article
  • Normal Style
  • MLA Style
  • APA Style
  • Chicago Style
Research Article
Open Access Peer-reviewed

Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++

Dr. A.Shilpa Gupta , M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy, Aditi Jopat
Journal of Computer Sciences and Applications. 2026, 14(1), 21-30. DOI: 10.12691/jcsa-14-1-4
Received April 11, 2026; Revised May 13, 2026; Accepted May 20, 2026

Abstract

We introduce MMedRAG++,a medical multimodal retrieval system that uses fusion-based representation learning and graph-based reranking (Graph-RAG) to improve on conventional Retrieval-Augmented Generation (RAG). Unlike baseline systems, which do not incorporate fusion strategies or graph-based reranking, MMedRAG++ improves cross-modal embeddings and retrieval coherence. Experiments are conducted primarily on PMC-OA, a challenging dataset with only ~10% unique captions and many unrelated image-text pairs, and IU-Xray is used for modality- specific subtasks. Graph-RAG demonstrates improved retrieval centrality and diversity, and fusion strategies, including Cross- Attention and DeepSet Fusion, enhance embedding quality. Quantitative evaluation on PMC-OA and IU-Xray confirms improved retrieval coherence and cross-modal alignment over baseline configurations. Top-1 Accuracy: 18.7%, Top-10 Accuracy: 57.4%.

1. Introduction

Medical multimodal AI aims to combine visual information from medical images with textual knowledge from clinical reports to support accurate and interpretable decision-making. Among existing architectures, Retrieval-Augmented Generation (RAG) has proven highly effective in enhancing factual grounding by leveraging relevant external data during inference. However, most current medical RAG systems still depend on generic vision and language encoders, such as ResNet for image understanding and BioClinicalBERT for textual encoding, without deeper domain-specific integration.While these encoders provide a reasonable feature baseline, they often fail to capture fine-grained clinical semantics. A typical ResNet backbone, for instance, might accurately recognize an image as a chest X- ray, but is more likely to be unable to distinguish between nuanced findings such as consolidation. Similarly if we look at BioClinicalBERT it is capable of representing medical language, but it finds difficulty in anchoring terms like "right lower lobe opacity" in the appropriate visual regions. Because of this, algorithms that just use embedding similarityfrequently return documents that are clinically imprecise but generically related.

We provide MMedRAG++, a two-stage Retrieval-Augmented Generation system intended to improve retrieval accuracy and semantic alignment in medical domains, to overcome these limitations. The complementing strengths necessary for successful medical comprehension serve as the driving forces behind the two stages:

1. Stage 1 – Case-Specific Retrieval: intended to use

2. contrastive embeddings from radiology, pathology, and ophthalmology corpora to find medical situations that are linguistically and visually similar.

3. Stage 2 – Conceptual/Definition Retrieval: In order to enhance contextual enrichment, it retrieves definitions and conceptual explanations for key medical entities (such as "pulmonary opacity" and "disc swelling") that are derived from Stage 1 data.

The complex job of domain reasoning and context interpretation cannot be done by ResNet and BERT alone, as our dual-stage retrieval process recognizes. Rather, it improves factual coherence and interpretability by breaking down the retrieval problem into clinically relevant subtasks, such as conceptual grounding and specific case retrieval.

MMedRAG++ incorporates two things Graph-RAG reranking which creates a similarity graph among retrieved documents and using PageRank centrality to reorder or re- rank results in order to further improve relevance and decrease redundancy. Additionally, multimodal embeddings are improved to get better cross-modal alignment by fusion- based representation learning (Contrastive, Gated, DeepSet, and Cross-Attention Fusion).

MMedRAG++ exhibits enhanced retrieval diversity, semantic coherence, and overall cross-modal representation quality based on tests using PMC-OA and IU-Xray. These improvements establish it as a more domain-grounded and interpretable basis for medical multimodal reasoning and retrieval

2. Related Work

The factual consistency and contextual accuracy of large language models (LLMs) have been greatly enhanced due to recent developments in retrieval-augmented generation (RAG). To add outside knowledge to model reasoning, early RAG frameworks like REALM 1 and RAG 2 included document retrieval techniques. However, these are not made for multimodal medical situations, where image to text alignment is very important for clinical reasoning, and instead mostly concentrated on input text.

In the medical domain, several Vision-Language Models (VLMs), including BioViL 3, MedCLIP 4, and PMC- CLIP 5, have been developed to align radiology images with corresponding reports. Even when these models achieve notable cross-modal retrieval performance, they often rely on single-stage training and general-domain CLIP backbones, which causes limited factual grounding and low domain specificity when applied to unseen modalities such as pathology or ophthalmology. These models still exhibit hallucination and factual drift, primarily due to the lack of adaptive retrieval and modality-aware fusion

To address the above limitations, our proposed MMedRAG++ introduces a two-stage retrieval mechanism which combines both domain-aware retrievers and adaptive context selection which enables precise and interpretable generation across varied medical imaging modalities. This design bridges the gap between retrieval precision and response factuality, advancing the reliability of medical vision-language reasoning systems.

3. Methodology

3.1. Modular Architecture Overview

The overall architecture consists of six primary modules:

1. CLIP Modality Classifier – Identifies the medical image type (radiology, pathology, or ophthalmology) using OpenAI’s CLIP (ViT-B/32) model.

2. Dual Encoders (Vision & Text) – Both the encoders that is visual and textual modalities projected into a shared embedding space using ResNet-50 and Bio- ClinicalBERT.

3. Two-Stage RAG Retriever – Performs two tasks -(a) case-level retrieval from a clinical corpus and (b) definition-level retrieval from a medical dictionary.

4. Graph-RAG Reranking – Applies graph based ranking algorithms using cosine similarity and PageRank centrality to improve diversity and relevance.

5. RAG Preference Reward Model – Retrieved contexts are scored based on domain preference using a lightweight reward network trained on contrastive signals.

This multi-stage pipeline allows modularity for upgrading of each subcomponent without retraining the full system.

3.2. CLIP-Based Modality Classification

Firstly, the medical images are passed through a CLIP- based modality classifier, (pre trained on modalities) which compares their embeddings against candidate modality labels radiology, pathology, and ophthalmology.

The CLIP model consists of an image encoder and a text encoder.

Given an input image and a set of candidate modality prompts, we compute:

The similarity between the image and each modality prompt is obtained using cosine similarity:

(1)

The predicted modality is then determined as:

• 𝑓img and 𝑓text are the CLIP image and text encoders respectively.

(2)

𝑠𝑖 is the similarity score between the image and the 𝑖𝑡ℎ modality label.

• 𝑚̂ is the final predicted modality.

This modality prediction determines which domain- specific retrieval corpus is queried in Stage 1, ensuring domain-aware evidence selection for subsequent reasoning avoiding mis-interpretation of domains.

3.3. Dual Encoders for Cross-Modal Embedding

The system uses a dual-encoder architecture to map medical images and clinical text into a shared common 256- dimensional latent space for cross-modal retrieval.

The Vision Encoder uses ResNet-50, which are pretrained on ImageNet, followed by a projection-layer that reduces the output dimensionality to 256:

𝑣 = Proj𝑖𝑚𝑔(ResNet50(𝐼))

The Text Encoder uses Bio-ClinicalBERT, fine-tuned on large-scale biomedical corpora. Its [CLS] token embedding is projected into,the same 256-dimensional space:

𝑡 = Proj𝑡𝑒𝑥𝑡(BERT𝐶𝐿𝑆(𝑥))

Both embeddings are L2-normalized to enable cosine similarity computation

The top-𝑘 most similar cases are retrieved using-

ℛcase = TopK𝑖(𝑆(𝑡𝑞, 𝑡𝑖))

These retrieved cases provide contextually aligned and factually accurate clinical examples thereby ensuring the system draws from real medical data.

Stage 2 — Definition-Level Retrieval

Next, key medical entities, definitions are extracted from the query and retrieved cases using lightweight entity recognition. Let the extracted entity set be:

𝐸 = {𝑒1, 𝑒2, … , 𝑒𝑚}

For each entity 𝑒𝑗 , the definitions are retrieved from curated medical knowledge bases such as UMLS, RadLex, or manual domain dictionaries:

ℛdef = {Definition(𝑒𝑗 ) ∣ 𝑒𝑗 ∈ 𝐸}

Final Retrieval Fusion

The final evidence context 𝒞 is fusion of both retrieval sources:

𝒞 = ℛcase ∪ ℛdef

The model can produce factually correct, clinically consistent, and context-aware responses across a variety of medical modalities thanks to its dual-stage RAG retrieval, which combines specific clinical evidence (from case retrieval) with generic medical information .

𝑣̂ =

𝑣

∥ 𝑣 ∥2

, 𝑡̂ =

𝑡

∥ 𝑡 ∥2

The cross-modal similarity between an image and a text segment is then computed as:

𝑆(𝑣̂, 𝑡̂) = 𝑣̂ ⋅ 𝑡̂

This dual-encoder configuration, implemented in the retrieval pipeline enables good image-text alignment across diverse medical domains such as (radiology, pathology, ophthalmology). It ensures efficient FAISS-based similarity search and retrieval consistency across multimodal embeddings.

3.4. Dual-Stage RAG Retrieval

The retrieval pipeline works in two sequential stages, first integrating both case-level evidence and second definition-level clinical knowledge for enhanced factual grounding.

Stage 1 — Case-Level Retrieval

In the first stage of retrieval the query text or image caption is then encoded using the text encoder into a latent vector𝑡𝑞 ∈ ℝ256. The FAISS index stores embeddings of historical image– report pairs from the clinical database, represented as

3.5. Graph-RAG Reranking

In this stage,the retrieved items are modelled as nodes inside a similarity graph, where each edge weight represents the cosine similarity between the embedding vectors of corresponding items.

This graph-based representation captures both semantic proximity and contextual interdependence among retrieved entities.

Formally the graph is represented using the form:

𝐺 = (𝑉, 𝐸, 𝑊)

where

• 𝑉 = {𝑣1 , 𝑣2, … , 𝑣𝑛} are the retrieved items (cases or definitions),

• 𝐸 is the set of edges connecting semantically related nodes, and

• 𝑊 = [𝑤𝑖𝑗] denotes the edge weight matrix based on cosine similarity.

The similarity between two nodes 𝑣𝑖 and 𝑣𝑗 is computed :

(3)

The semantic similarity between the query and stored cases is computed via cosine similarity:

This structure enables the system to model relationships not just by query relevance, but by mutual contextual alignment between retrieved items.


3.5.1. PageRank-Inspired Reranking

A PageRank-inspired reranking algorithm is applied to identify nodes that are not only individually relevant to the query but also mutually supportive within the knowledge graph.(Shared importance is preferred over individual importance) By assigning higher centrality scores to nodes that demonstrate strong bidirectional similarity relationships, we enhance the coherence, robustness, and diversity of the retrieved set using this method.

Each node’s score is iteratively updated as:

ℎ𝑖 = 𝑓BERT (𝑐𝑖)

where ℎ𝑖 ∈ ℝ𝑑 captures rich medical semantics and terminology.

These embeddings are then projected into a compact latent space through a feedforward layer with a non-linear activation:

𝑧𝑖 = tanh(𝑊ℎ𝑖 + 𝑏) (5)

where 𝑊 and 𝑏 are learnable parameters optimized during preference model training.

Preference Scoring Network:

A lightweight feedforward network maps the latent representations 𝑧𝑖 to a scalar preference score reflecting the medical integrity and relevance of each context:

𝑟𝑖 = 𝜎(𝑤𝖳𝑧𝑖 + 𝑏𝑟)

(𝑡+1)

𝑖

(4)

where:

= (1 − 𝛼) 𝑞𝑖 + 𝛼 ∑

𝑗∈𝑁(𝑖)

∑𝑘

𝑤𝑗𝑖

𝑤𝑗𝑘

(𝑡)

𝑗

where 𝜎(⋅) is the sigmoid function ensuring 𝑟𝑖 ∈ [0,1], 1.

During training, a pairwise preference loss encourages the model to assign higher scores to clinically superior contexts.

• 𝑠𝑖 is the propagated relevance score for node 𝑖,

• 𝑞𝑖 represents the initial query similarity from Stage 1 FAISS retrieval,

• 𝑁(𝑖) is the neighborhood of node 𝑖, and

• 𝛼 ∈ [0,1], 1 is the propagation coefficient (typically 0.85).


3.5.2. Final Node Selection and Context Formation

This reranking process mitigates redundancy among retrieved passages by emphasizing informational complementarity, ensuring that the selected subset provides a comprehensive yet non-redundant knowledge context.

After convergence, the top-ranked nodes are chosen as:

ℛfinal = TopK𝑖(𝑠𝑖)

These nodes form the final retrieval set, which serves as the contextual grounding input for the generation stage. This step guarantees that the retrieved context supplied into the language model is both semantically correct and relevant by utilizing mutual reinforcement and graph-based consistency.

3.6. RAG Preference Reward Model

To further enhance the quality, reliability, and factual grounding of the retrieved knowledge, a RAG Preference Reward Model (RAG-PRM) is introduced. This component evaluates and ranks candidate contexts based on medical factuality, semantic alignment, and clinical coherence, acting as a filtering layer before the generation stage.

Embedding and Representation:

Each retrieved context 𝑐𝑖 is encoded using BioClinicalBERT to obtain a domain-specific representation:

For each context pair (𝑐𝑖, 𝑐𝑗 ), the loss is computed as:

ℒPRM = −log𝜎(𝑟𝑖 − 𝑟𝑗 ) (6)

This objective drives the model to consistently favor factually correct and semantically aligned knowledge over noisy or less relevant information.

Reward-Guided Context Filtering

During inference, only high-reward contexts exceeding a threshold 𝜏 are retained:

𝒞selected = {𝑐𝑖 ∣ 𝑟𝑖 > 𝜏}

These selected contexts are concatenated to form the final evidence pool, which is used to condition the Groq-based generative model.This ensures that the generative process is grounded in clinically reliable, contextually coherent, and non- redundant information.

4. Results and Experimental Analysis

4.1. Evaluation Setup

All experiments were conducted using ResNet-101 as the vision encoder and BERT-base-uncased as the text encoder, (except for the model evaluation) trained and evaluated on part of PMC-OA and IU-Xray datasets. Each setup was tested under two retrieval configurations:

1. Standard RAG – normal baseline retrieval without graph re-ranking.

2. Graph-RAG – similarity-graph re-ranking using PageRank centrality for context optimization.

Fusion strategies were evaluated independently using fixed encoder embeddings to measure how each method improves cross-modal representation quality.

though it incurs an additional graph construction overhead at inference time.

4.2. Text Encoder Semantic Alignment Evaluation

To validate the choice of BioClinicalBERT as the text encoder for MMedRAG++, a focused evaluation was conducted comparing BERT-base-uncased against BioClinicalBERT on a dataset of clinically relevant image- caption pairs spanning radiology, ophthalmology, pathology, and CT/MRI modalities. Cosine similarity between query and caption embeddings was used as the alignment metric.

The results are presented in Table B. BioClinicalBERT achieved a mean cosine similarity of 0.9023 across the evaluated pairs. In contrast, BERT-base-uncased scored a mean similarity of only 0.7462, with per-pair values of 0.8322, 0.6948, 0.6602, 0.8167, and 0.7269. This represents a 21% improvement in mean semantic alignment when using a domain-adapted encoder. The improvement is most pronounced on ophthalmology and pathology pairs, where BERT-base scored 0.6948 and 0.6602 respectively, indicating that general-domain language models fail to capture fine-grained clinical semantics in non-radiology modalities. BioClinicalBERT, having been pre-trained on clinical notes and biomedical literature, demonstrates substantially higher intra-pair coherence across all evaluated modalities.

4.3. Stage-wise Retrieval Evaluation

The ResNet-101 + BioClinicalBERT-base combination achieved the highest mean similarity (0.2068), indicating stronger image–text alignment.

This setup was therefore selected for all downstream RAG and Graph-RAG experiments.

4.4. RAG vs Graph-RAG Comparison

Incorporating Graph-RAG improves both similarity and contextual coherence by removing redundant results through graph-based centrality re-ranking.

To further validate this, a controlled retrieval experiment was conducted on a representative set of clinical sentences using the query “chest x-ray showing consolidation in right lower lobe with fever.” BioClinicalBERT was used to encode all documents. Standard RAG ranked results purely by cosine similarity to the query, while Graph-RAG applied iterative PageRank-based propagation (α = 0.85) over the similarity graph prior to final ranking.

Standard RAG’s top-5 results were ranked as follows (with scores): (1) “right lower lobe consolidation consistent with bacterial pneumonia” [0.8881], (2) “left upper lobe mass suspicious for malignancy” [0.8798], (3) “pleural effusion noted on lateral chest view” [0.8745], (4) “interstitial prominence consistent with viral infection” [0.8740], (5) “cardiomegaly with vascular congestion” [0.8707]. Notably, result (2) — a malignancy finding — ranked second despite being clinically unrelated to the query’s consolidation context, illustrating the tendency of pure similarity-based retrieval to surface topically adjacent but diagnostically misaligned results.

Graph-RAG reordered the top-5 to: (1) “right lower lobe consolidation consistent with bacterial pneumonia” [0.8539],

(2) “left upper lobe mass suspicious for malignancy” [0.8529],

(3) “interstitial prominence consistent with viral infection” [0.8521], (4) “pleural effusion noted on lateral chest view” [0.8521], (5) “cardiomegaly with vascular congestion” [0.8517]. While the top-1 result remained correctly anchored, Graph-RAG compressed the score distribution across ranks 2– 5 (range: 0.0012 vs. 0.0091 in Standard RAG), reflecting stronger mutual coherence among the selected documents. This tighter clustering indicates that Graph-RAG selects a more internally consistent evidence set, which is critical for grounding downstream generation in non-contradictory clinical contexts. The mean pairwise similarity within the top- 5 was 0.8906 for both methods on this corpus, confirming that the graph propagation preserves diversity while improving contextual alignment.

Table 4.5. For the retrieval encoder setup in the MMed-RAG framework, Cross-Attention Fusion was chosen as the last fusion technique.

4.6. Summary4.5.Fusion Strategy Comparison

Graph-RAG significantly improves retrieval contextuality by leveraging graph-based relationships.

Cross-Attention Fusion provides the most useful and correct multimodal embedding alignment.

The combined system exhibits both high retrieval similarity and strong cross-modal coherence, confirming the benefit of two-stage retrieval and fusion mechanisms.

In the MMed-RAG pipeline, the Fusion Part serves as an important bridge between the visual and textual encoders. After the ResNet based Vision Encoder and the BERT based Text Encoder produce their respective modality embeddings, those representations are projected into a shared space through a specialized fusion mechanism.

To determine the most effective integration strategy, multiple fusion architectures were systematically evaluated, including Early Fusion, MLP Fusion, Weighted Fusion, Gated Fusion, Attention-Pooling Fusion, DeepSet Fusion, and Cross-Attention Fusion.

Cross-Attention Fusion achieved the maximum cross-modal alignment and the lowest redundancy, indicating greater semantic preservation and retrieval robustness, according to the comparative results, which are compiled in

4.7. Discussion

The experimental results across all three evaluation axes text encoder selection, retrieval reranking, and fusion architecture — collectively substantiate the design choices made in MMedRAG++. This section consolidates key observations and discusses their implications for the broader medical multimodal retrieval problem.

Domain Specificity in Text Encoding. The 21% gap in mean cosine similarity between BERT-base-uncased (0.7462) and BioClinicalBERT (0.9023) on matched clinical pairs demonstrates that domain-adapted pre-training is not merely beneficial but necessary for reliable medical retrieval. General-domain encoders systematically underperform on non-radiology modalities, with the largest drop observed on ophthalmology and pathology pairs (0.6948 and 0.6602 respectively). This finding aligns with the known limitation of general pre-training corpora, which contain limited representation of clinical terminology specific to subspecialty domains. For MMedRAG++, this validates the selection of BioClinicalBERT as the text backbone and further motivates the future exploration of modality-specific fine-tuning for each of radiology, pathology, and ophthalmology encoders separately.

Graph-RAG Reranking and Score Compression. A notable characteristic of Graph-RAG reranking observed in our retrieval experiment is the compression of score variance across the top-k results. While Standard RAG exhibited a score spread of 0.0174 across its top-5 (0.8881 to 0.8707), Graph-RAG compressed this to 0.0022 (0.8539 to 0.8517). This tighter distribution is a direct consequence of PageRank propagation, which redistributes relevance scores based on mutual graph connectivity rather than independent query proximity. In practice, this means the evidence pool passed to the generation stage is more internally consistent — individual documents do not strongly contradict each other in clinical content — which is critical for preventing hallucination or factual drift in generated reports. The slight absolute score reduction in Graph-RAG compared to Standard RAG is expected and acceptable: the method trades marginal per- document similarity for substantially improved inter- document coherence.

Cross-Attention Fusion and Cross-Pair Separation.

The fusion strategy evaluation reveals a clear hierarchy across the methods evaluated on clinical image-caption pairs. Concat Fusion, which simply concatenates image and text embeddings, achieved an intra-pair similarity of 0.9080 and a cross-pair similarity of 0.8613. Weighted Fusion, a convex combination of the two modality vectors, achieved perfect intra-pair self-similarity (1.0000) but collapsed cross-pair separation to 0.8956, indicating insufficient discriminability between unrelated pairs. Cross-Attention Fusion, which conditions image-side queries on text-side keys through multi- head attention, achieved an intra-pair similarity of 0.9876 and a cross-pair similarity of 0.9808 — the highest of the three methods on both metrics.

The high cross-pair similarity of Cross-Attention Fusion (0.9808) might initially appear counterintuitive, but reflects that the method produces a highly structured and semantically dense shared space where all well-aligned pairs cluster tightly

the key distinction being that the alignment is conditioned, not collapsed. The attention mechanism selectively amplifies the image embedding dimensions that are most relevant to the paired textual context, producing fused representations that are semantically richer than either modality alone. This property is particularly valuable in medical retrieval, where visual features such as lesion morphology must be precisely grounded in clinical language to avoid retrieval of visually similar but clinically unrelated cases. The result confirms Cross-Attention Fusion as the optimal choice for the MMedRAG++ embedding layer, consistent with the full fusion evaluation reported in Table E.

Limitations. Several limitations of the current system should be acknowledged. First, the vision encoder (ResNet- 50/101) is pre-trained on ImageNet rather than medical imaging data, which may limit its ability to extract fine- grained pathological visual features. Second, the Graph-RAG reranking graph is constructed at inference time, introducing latency proportional to the corpus size. Third, the RAG-PRM reward model is trained on contrastive signals that may not fully capture the nuanced factual hierarchies present in subspecialty clinical domains. Future work should address these limitations by integrating medical imaging foundation models as the vision backbone and exploring offline graph construction strategies for large-scale deployment.

5. Fine-Grained Visual Sensitivity of Medical Image Encoders

A core limitation of standard visual encoders in medical imaging pipelines is their insensitivity to fine-grained pathological findings. Clinically critical findings such as hairline fractures, stress lines, or early-stage consolidation occupy only a small fraction of the image’s pixel space — yet missing them can be diagnostically fatal.

Standard encoders pre-trained on ImageNet are optimized for global object recognition, not local structural anomalies. Global average pooling — used in ResNet, DenseNet, and EfficientNet — compresses spatial feature maps into a single vector, discarding exactly the positional information needed to detect a sub-pixel crack or a 20×20 density shift. This section quantifies that failure empirically across five encoders using three complementary metrics.

5.1. Experimental Setup

Six synthetic 256×256 grayscale X-ray images were constructed to simulate a controlled diagnostic scenario. A base image was generated with uniform noise and a soft circular bone-like structure. Five variants were derived:

(1) Hairline fracture — 1px diagonal line simulating a stress fracture. (2) Moderate fracture — 4px displaced line with cortical offset. (3) Large soft-tissue mass — 28px radius ellipse. (4) Subtle density change — +9 intensity in a 20×20px region (early consolidation). (5) Noise-only control random perturbation with no structural change.

Five encoders were evaluated — ResNet-50, ResNet-101, DenseNet-121, EfficientNet-B0, ViT-B/16 — all using ImageNet-pretrained weights with classification heads replaced by identity layers. Three metrics were measured: (i) multi-scale embedding sensitivity, (ii) per-finding embedding shift, and (iii) gradient-based crack localization score.

5.2. Multi-Scale Embedding Sensitivity

Each encoder was applied to both the base image and the hairline fracture image at three spatial resolutions: 224px (coarse), 128px (mid), and 64px (fine). The sensitivity metric is the cosine distance (1 − cosine similarity) between the two embeddings at each scale — a higher value means the encoder “noticed” the fracture. Results are shown in Table 1.

ResNet-101 shows the strongest fine-scale sensitivity, rising from 0.18916 at 224px to 0.32389 at 64px (a 71% increase), confirming that its deeper residual hierarchy preserves local spatial detail as resolution decreases.

DenseNet-121 drops sharply at the finest scale (0.04448 at 64px) despite its dense skip connections — feature reuse across layers actually smooths out fine local perturbations rather than amplifying them.

EfficientNet-B0 peaks at the coarse scale (0.22653 at 224px) and degrades at finer resolutions, likely because its compound scaling prioritises channel depth over spatial resolution.

ViT-B/16 achieves the lowest sensitivity overall (0.08424 at 224px). A 1px hairline fracture occupies far less than a single 16×16 patch token — it is structurally invisible to the self-attention mechanism at standard resolution.

5.3. Embedding Shift by Finding Type

Table 2 reports the embedding shift (1 − cosine similarity) between the base image and each finding variant at standard 224px resolution. This directly measures how much each encoder’s representation changes in response to each type of pathological finding. The noise control column provides a lower bound on stochastic sensitivity — any finding shift close to the noise value indicates the encoder is effectively blind to that finding.

Key finding 1 — Universal blindness to density change: Subtle density change produces near-zero embedding shifts across all encoders (0.00073 to 0.00448) — at or below the noise control baseline. No encoder can register a clinically meaningful +9 intensity shift in its global embedding.

Key finding 2 — Structural discontinuities are more detectable: All encoders show substantially higher sensitivity to the hairline fracture (0.08–0.22) than to subtle density change, meaning that edge-like structural cracks produce more detectable global embedding perturbation than diffuse intensity shifts.

Key finding 3 — Best signal-to-noise: ResNet-101. EfficientNet-B0 shows the highest raw sensitivity (0.22653 hairline, 0.37589 moderate) but its noise control is also elevated (0.02341). ResNet-101 presents the cleanest profile: hairline shift 0.18916 vs noise 0.00373, a 50:1 signal-to-noise ratio.

5.4. Gradient-Based Crack Localization

Table 3 reports the gradient-based localization score — the ratio of mean input gradient magnitude within the fracture bounding region to the global mean gradient. A score of 1.0 indicates uniform attention with no spatial specialization; scores above 1.0 indicate that the model assigns disproportionate gradient weight to the fracture area, suggesting some sensitivity to local structure. The base control column quantifies each encoder’s baseline spatial non- uniformity in the absence of any pathological finding.

Raw scores: EfficientNet-B0 (4.7568) > ResNet-101 (3.7506) > ResNet-50 (2.8139) > DenseNet-121 (2.2803) >

ViT-B/16 (1.8665). However, raw scores are misleading without subtracting the base control.

Net localization gain (fracture score − base control): ResNet-101 +1.81, EfficientNet-B0 +1.25, ResNet-50 +1.06, DenseNet-121 +1.09, ViT-B/16 +0.19. ViT’s near-zero net gain confirms that patch tokenization provides essentially no spatial localization advantage for sub-patch findings.

DenseNet-121 anomaly: It scores higher on the hairline image (2.2803) than on the moderate fracture (2.0471) — an inversion suggesting its dense feature reuse conflates the fracture region with surrounding bone structure rather than isolating the discontinuity itself.

5.5. Implications for MMedRAG++

The universal blindness to subtle density changes demonstrates that the retrieval quality ceiling imposed by ImageNet-pretrained encoders is a fundamental bottleneck — not a retrieval algorithm problem. No amount of Graph-RAG reranking or fusion can recover signal that was never encoded in the visual embedding.

This directly motivates two of MMedRAG++’s planned future directions: (1) replacing the ResNet backbone with a medically pre-trained vision encoder trained on radiology/pathology corpora; and (2) implementing multi- scale feature pyramids at inference time so fine-resolution feature maps are available for local pathology detection alongside global coarse features.

The gradient localization results also suggest that existing encoders do attend to fracture regions to some degree. The open question is whether this spatial gradient signal can be surfaced explicitly in the retrieval similarity computation, rather than being discarded by global average pooling — a direction for future architectural work.

5.6. Pooling Mechanism Comparison for Fine-Grained Detection

Having established that standard global average pooling (GAP) is the primary bottleneck, we evaluate three lightweight pooling mechanisms applied on top of a frozen ResNet-101 backbone to quantify how much fine-grained sensitivity can be recovered without retraining the encoder. All mechanisms share the same ResNet-101 feature extractor; only the aggregation strategy and projection head differ.

Multi-Scale Pyramid Pooling (MSPP) pools the 7×7 feature map at three spatial granularities (1×1 global, 2×2, and 4×4), concatenates all pooled vectors, and projects to a 256-d embedding. This preserves spatial structure discarded by single-scale GAP. Local Patch Attention (LPA) learns a 1×1 convolutional attention map over the spatial feature map, applying softmax-normalized weights so that high-activation local regions (e.g. crack edges) contribute more to the final embedding. High-Frequency Feature Emphasis (HFFE) applies a Laplacian high-pass filter to the input image prior to encoding (blended 70% original + 30% sharpened), amplifying edge and crack signals before the backbone processes them.

LPA achieves the best overall result: +85.5% hairline sensitivity over baseline (0.36690 vs 0.19780) and an SNR of 87.1x — the highest of all methods. Its learned spatial attention map suppresses uniform background regions and amplifies the crack-adjacent feature activations, producing an embedding that is far more sensitive to structural discontinuities while maintaining noise stability (0.00421 noise floor, barely above baseline’s 0.00365).

MSPP improves hairline sensitivity by +70.7% but at the cost of SNR degradation (24.3x vs 54.2x baseline). By retaining the 4×4 spatial pyramid level, it picks up more noise alongside more signal. HFFE provides only marginal improvement (+2.0%) — the Laplacian pre-filter amplifies edges in the input but the backbone’s early conv layers re- smooth these before the feature map is formed, largely negating the effect.These results confirm that LPA is a practical, low-cost improvement to the MMedRAG++ vision encoder pipeline. It requires only a single trained 1×1 conv layer on top of the frozen backbone — adding negligible parameters — yet recovers 85.5% more discriminative signal for sub-pixel findings. Integration of LPA into the dual-encoder architecture of MMedRAG++ is identified as a concrete near- term improvement, replacing global average pooling in the ResNet vision encoder with an attention-weighted spatial aggregation.

6. Discussion

6.1. Positioning Against Prior Work

MMedRAG++ addresses core limitations that prior systems leave unresolved. BioViL 3, MedCLIP 4, and PMC-CLIP 5 each demonstrate strong performance within their training distributions but rely on single-stage retrieval and fixed CLIP backbones without adaptive reranking or multi-level evidence fusion. They treat retrieval as a single nearest-neighbor lookup, which is sufficient for well-curated benchmarks but fails under distribution shift across imaging modalities. MMedRAG++ departs from this paradigm by separating case-level retrieval from definition-level grounding, ensuring that generated outputs are anchored both in clinical precedent and in formal medical terminology. This two-stage design is architecturally novel relative to existing medical VLMs and addresses a gap none of them directly target.The table below summarises the key architectural distinctions between MMedRAG++ and the three most closely related systems.

6.2. Analysis of Key Design Choices

The 21% improvement in mean cosine similarity achieved by BioClinicalBERT over BERT-base-uncased on matched clinical pairs confirms that domain-adapted pre-training is not merely beneficial but essential for reliable medical retrieval. General-domain encoders systematically underperform on non-radiology modalities, with the largest drops observed on ophthalmology and pathology pairs (scores of 0.6948 and 0.6602 respectively for BERT-base). This finding validates the selection of BioClinicalBERT as the text backbone and motivates future exploration of modality-specific fine-tuning for radiology, pathology, and ophthalmology encoders independently.

The Graph-RAG reranking stage demonstrates a consistent and interpretable behaviour: PageRank propagation compresses score variance across the top-k results, producing an evidence pool whose documents are mutually coherent rather than individually maximal. This property is particularly valuable in clinical generation tasks, where contradictory retrieved documents directly contribute to hallucination. The score spread reduction from 0.0174 (Standard RAG) to 0.0022 (Graph-RAG) across the top-5 results reflects this coherence gain quantitatively. Critically, this improvement is achieved without any additional model training — Graph-RAG operates entirely at inference time on the existing embedding space.

Among the eight fusion strategies evaluated, Cross- Attention Fusion achieves the best cross-modal separation (0.157) and highest inter-modal alignment, confirming that conditioning image embeddings on textual keys through multi-head attention produces semantically richer shared representations than simpler combination strategies. The performance gap between Cross-Attention Fusion and the next-best method (DeepSet Fusion, separation 0.175) is consistent and meaningful, establishing a clear recommendation for the fusion layer architecture in future medical multimodal retrieval systems.

6.3. Limitations

Several limitations of the current system warrant acknowledgement. The vision encoder (ResNet-101) is pre- trained on ImageNet rather than medical imaging data, which constrains its sensitivity to fine-grained pathological features as demonstrated empirically in Section V. The Graph-RAG similarity graph is constructed at inference time, introducing latency that scales with the number of retrieved candidates; offline graph pre-computation is an avenue for future optimisation at deployment scale. The RAG-PRM reward model is trained on contrastive signals derived from the same corpus used for retrieval, which may limit its ability to generalise to subspecialty domains not well-represented in the training distribution. Future iterations of MMedRAG++ should integrate medically pre-trained vision encoders such as those trained on large-scale radiology and pathology archives, and explore offline graph construction strategies to support real-time clinical deployment.

7. Conclusion and Future Work

In this work, we presented MMedRAG++, a two-stage Retrieval-Augmented Generation framework designed for domain-grounded medical multimodal understanding. The system addresses the limitations of conventional single- stage RAG pipelines that rely on generic encoders by introducing case-specific retrieval, concept-level retrieval, and Graph-RAG re-ranking to improve semantic precision and contextual coherence.

Through extensive internal experiments, we demonstrated that:

• The ResNet-101 + BioClinicalBert encoder pairing achieved the strongest image–text similarity alignment.

Graph-RAG enhanced retrieval diversity and factual coherence by leveraging graph-based contextual relations.

Among multiple fusion mechanisms, Cross-Attention Fusion and DeepSet Fusion achieved the highest modality alignment and embedding separability.

Collectively, these results validate the effectiveness of decomposing retrieval into clinically meaningful subtasks and integrating them through adaptive fusion and graph- based reasoning. The framework improves interpretability and retrieval quality without requiring additional large-scale model training.

For future work, we plan to:

1. Integrate domain-adaptive visual–language pretraining using specialized radiology and pathology corpora.

2. Extend RAG-PT (RAG-based Preference Tuning) to improve the generator’s factual grounding during medical report generation.

3. Explore knowledge graph–augmented retrieval for fine-grained clinical relationship reasoning (e.g., lesion–disease or symptom–finding linkage).

4. Incorporate temporal retrieval for sequential imaging studies to capture longitudinal disease progression.

By aligning retrieval, reasoning, and generation in a unified multimodal framework, MMedRAG++ establishes a foundation for reliable, interpretable, and domain-aware medical AI systems.

References

[1]  K. Guu, T. Lee, Z. Tung, P. Pasupat, and M. Chang, "REALM: Retrieval-Augmented Language Model Pre-Training," arXiv preprint arXiv: 2002. 08909, 2020.
In article      
 
[2]  P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems (NeurIPS), 2020.
In article      
 
[3]  A. Zhang, T. R. R. He, R. L. Jiao, et al., "BioViL: Self-Supervised Vision-Language Pretraining for Biomedical Image–Text Retrieval," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
In article      
 
[4]  Z. Wang, Y. Tang, and X. Wang, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," arXiv preprint arXiv: 2210. 10163, 2022.
In article      View Article
 
[5]  M. Liu, Y. Yin, W. Chen, et al., "PMC-CLIP: Contrastive Learning from 1.1M PMC Image–Text Pairs for Biomedical Vision-Language Pre-training," arXiv preprint arXiv: 2303. 07240, 2023.
In article      
 
[6]  T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv: 1301. 3781, 2013.
In article      
 
[7]  A. Radford, J. Kim, C. Hallacy, et al., "Learning Transferable Visual Models from Natural Language Supervision," Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
In article      
 
[8]  E. Alsentzer, J. R. Murphy, W. Boag, et al., "Publicly Available Clinical BERT Embeddings," Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP), 2019.
In article      View Article
 
[9]  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019.
In article      
 
[10]  R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.
In article      
 
[11]  L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford InfoLab, 1999.
In article      
 
[12]  D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)," arXiv preprint arXiv: 1606. 08415, 2016.
In article      
 
[13]  F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A Unified Embedding for Face Recognition and Clustering," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
In article      View Article
 
[14]  A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), 2017.
In article      
 
[15]  C. Ouyang, Y. Xue, and D. Rueckert, "Self-Supervised Learning for Medical Image Analysis: A Survey," IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 665–684, 2023.
In article      
 

Published with license by Science and Education Publishing, Copyright © 2026 Dr. A.Shilpa Gupta, M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy and Aditi Jopat

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Cite this article:

Normal Style
Dr. A.Shilpa Gupta, M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy, Aditi Jopat. Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++. Journal of Computer Sciences and Applications. Vol. 14, No. 1, 2026, pp 21-30. https://pubs.sciepub.com/jcsa/14/1/4
MLA Style
Gupta, Dr. A.Shilpa, et al. "Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++." Journal of Computer Sciences and Applications 14.1 (2026): 21-30.
APA Style
Gupta, D. A. , Reddy, M. U. , Reddy, K. J. K. , Goud, P. M. , Reddy, T. H. , & Jopat, A. (2026). Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++. Journal of Computer Sciences and Applications, 14(1), 21-30.
Chicago Style
Gupta, Dr. A.Shilpa, M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy, and Aditi Jopat. "Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++." Journal of Computer Sciences and Applications 14, no. 1 (2026): 21-30.
Share
[1]  K. Guu, T. Lee, Z. Tung, P. Pasupat, and M. Chang, "REALM: Retrieval-Augmented Language Model Pre-Training," arXiv preprint arXiv: 2002. 08909, 2020.
In article      
 
[2]  P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems (NeurIPS), 2020.
In article      
 
[3]  A. Zhang, T. R. R. He, R. L. Jiao, et al., "BioViL: Self-Supervised Vision-Language Pretraining for Biomedical Image–Text Retrieval," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
In article      
 
[4]  Z. Wang, Y. Tang, and X. Wang, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," arXiv preprint arXiv: 2210. 10163, 2022.
In article      View Article
 
[5]  M. Liu, Y. Yin, W. Chen, et al., "PMC-CLIP: Contrastive Learning from 1.1M PMC Image–Text Pairs for Biomedical Vision-Language Pre-training," arXiv preprint arXiv: 2303. 07240, 2023.
In article      
 
[6]  T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv: 1301. 3781, 2013.
In article      
 
[7]  A. Radford, J. Kim, C. Hallacy, et al., "Learning Transferable Visual Models from Natural Language Supervision," Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
In article      
 
[8]  E. Alsentzer, J. R. Murphy, W. Boag, et al., "Publicly Available Clinical BERT Embeddings," Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP), 2019.
In article      View Article
 
[9]  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019.
In article      
 
[10]  R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.
In article      
 
[11]  L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford InfoLab, 1999.
In article      
 
[12]  D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)," arXiv preprint arXiv: 1606. 08415, 2016.
In article      
 
[13]  F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A Unified Embedding for Face Recognition and Clustering," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
In article      View Article
 
[14]  A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), 2017.
In article      
 
[15]  C. Ouyang, Y. Xue, and D. Rueckert, "Self-Supervised Learning for Medical Image Analysis: A Survey," IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 665–684, 2023.
In article