Image-aware generative medical visual question answering based on image caption prompts_Journal of Biomedical Engineering

Authors：

WANG Rui , MENG Jiana ,  YU Yuhai , HAN Siwei , LI Xinghao

Computer Science and Engineering College, Dalian Minzu University, Dalian, Liaoning 116650, P. R. China;

Corresponding author：

YU Yuhai, Email: yuyh@dlnu.edu.cn

Keywords：

Computer-aided diagnosis; Medical visual question answering; Image captions prompt; Image-aware generative model

DOI：

10.7507/1001-5515.202412040

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Medical visual question answering (MVQA) plays a crucial role in the fields of computer-aided diagnosis and telemedicine. Due to the limited size and uneven annotation quality of the MVQA datasets, most existing methods rely on additional datasets for pre-training and use discriminant formulas to predict answers from a predefined set of labels. This approach makes the model prone to overfitting in low resource domains. To cope with the above problems, we propose an image-aware generative MVQA method based on image caption prompts. Firstly, we combine a dual visual feature extractor with a progressive bilinear attention interaction module to extract multi-level image features. Secondly, we propose an image caption prompt method to guide the model to better understand the image information. Finally, the image-aware generative model is used to generate answers. Experimental results show that our proposed method outperforms existing models on the MVQA task, realizing efficient visual feature extraction, as well as flexible and accurate answer outputs with small computational costs in low-resource domains. It is of great significance for achieving personalized precision medicine, reducing medical burden, and improving medical diagnosis efficiency.

Citation： WANG Rui, MENG Jiana, YU Yuhai, HAN Siwei, LI Xinghao. Image-aware generative medical visual question answering based on image caption prompts. Journal of Biomedical Engineering, 2025, 42(3): 560-566, 574. doi: 10.7507/1001-5515.202412040 Copy

1.	Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology//Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online: ACL, 2020: 60-69..
2.	Dai W, Hou L, Shang L, et al. Enabling multimodal generation on clip via vision-language knowledge distillation. arXiv preprint, 2022, arXiv: 2203.06386..
3.	Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2016: 21-29..
4.	Joshi V, Mitra P, Bose S. Multi-modal multi-head self-attention for medical VQA. Multimedia Tools and Applications, 2024, 83(14): 42585-42608..
5.	Liu B, Zhan L M, Wu X M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 210-220..
6.	Chen Z, Du Y, Hu J, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training//International Conference on Medical Image Computing and Computer-Assisted Intervention, Cham: Springer Nature Switzerland, 2022: 679-689..
7.	Ossowski T, Hu J. Multimodal prompt retrieval for generative visual question answering. arXiv preprint, 2023, arXiv: 2306.17675..
8.	Chen J, Yang D, Jiang Y, et al. MISS: a generative pre-training and fine-tuning approach for Med-VQA//International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024: 299-313..
9.	Marino K, Chen X, Parikh D, et al. Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2021: 14111-14121..
10.	Lin B, Chen Z, Li M, et al. Towards medical artificial general intelligence via knowledge-enhanced multimodal pretraining. arXiv preprint, 2023, arXiv: 2304.14204..
11.	Zhan J, Dai J, Ye J, et al. AnyGPT: unified multimodal LLM with discrete sequence modeling. arXiv preprint, 2024, arXiv: 2402.12226..
12.	Du Z, Qian Y, Liu X, et al. GLM: general language model pretraining with autoregressive blank infilling. arXiv preprint, 2021, arXiv: 2103.10360..
13.	Gu T, Yang K, Liu D, et al. LaPA: latent prompt assist model for medical visual question answering//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2024: 4971-4980..
14.	Liu J, Hu T, Zhang Y, et al. Parameter-efficient transfer learning for medical visual question answering. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023, 8(4): 2816-2826..
15.	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551..
16.	Kim J H, Jun J, Zhang B T. Bilinear attention networks. Advances in Neural Information Processing Systems, 2018, 31: 1-11..
17.	Li J, Li D, Xiong C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation//International conference on machine learning, Stockholm: PMLR, 2022: 12888-12900..
18.	Guo J, Li J, Li D, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Online: CVPR, 22023: 10867-10877..
19.	Lau J J, Gayen S, Ben Abacha A, et al. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018, 5: 180251..
20.	Liu B, Zhan L M, Xu L, et al. SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering//2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice: IEEE, 2021: 1650-1654..
21.	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint, 2020, arXiv: 2010.11929..
22.	Eslami S, de Melo G, Meinel C. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?. arXiv preprint, 2021, arXiv: 2112.13906..
23.	Nguyen B D, Do T T, Nguyen B X, et al. Overcoming data limitation in medical visual question answering//22nd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2019), Cham: Springer International Publishing, 2019: 522-530..
24.	Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 64-74..
25.	Gong H, Chen G, Mao M, et al. VQAMix: conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging, 2022, 41(11): 3332-3343..
26.	Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowledge-Based Systems, 2022, 255: 109763..
27.	Lin W, Zhao Z, Zhang X, et al. PMC-CLIP: contrastive language-image pre-training using biomedical documents//International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCI 2023). Cham: Springer Nature Switzerland, 2023: 525-536..
28.	Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization//Proceedings of the IEEE International Conference on Computer Vision, Online: ICCV, 2017: 618-626..

1. Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology//Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online: ACL, 2020: 60-69..
2. Dai W, Hou L, Shang L, et al. Enabling multimodal generation on clip via vision-language knowledge distillation. arXiv preprint, 2022, arXiv: 2203.06386..
3. Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2016: 21-29..
4. Joshi V, Mitra P, Bose S. Multi-modal multi-head self-attention for medical VQA. Multimedia Tools and Applications, 2024, 83(14): 42585-42608..
5. Liu B, Zhan L M, Wu X M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 210-220..
6. Chen Z, Du Y, Hu J, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training//International Conference on Medical Image Computing and Computer-Assisted Intervention, Cham: Springer Nature Switzerland, 2022: 679-689..
7. Ossowski T, Hu J. Multimodal prompt retrieval for generative visual question answering. arXiv preprint, 2023, arXiv: 2306.17675..
8. Chen J, Yang D, Jiang Y, et al. MISS: a generative pre-training and fine-tuning approach for Med-VQA//International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024: 299-313..
9. Marino K, Chen X, Parikh D, et al. Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2021: 14111-14121..
10. Lin B, Chen Z, Li M, et al. Towards medical artificial general intelligence via knowledge-enhanced multimodal pretraining. arXiv preprint, 2023, arXiv: 2304.14204..
11. Zhan J, Dai J, Ye J, et al. AnyGPT: unified multimodal LLM with discrete sequence modeling. arXiv preprint, 2024, arXiv: 2402.12226..
12. Du Z, Qian Y, Liu X, et al. GLM: general language model pretraining with autoregressive blank infilling. arXiv preprint, 2021, arXiv: 2103.10360..
13. Gu T, Yang K, Liu D, et al. LaPA: latent prompt assist model for medical visual question answering//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2024: 4971-4980..
14. Liu J, Hu T, Zhang Y, et al. Parameter-efficient transfer learning for medical visual question answering. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023, 8(4): 2816-2826..
15. Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551..
16. Kim J H, Jun J, Zhang B T. Bilinear attention networks. Advances in Neural Information Processing Systems, 2018, 31: 1-11..
17. Li J, Li D, Xiong C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation//International conference on machine learning, Stockholm: PMLR, 2022: 12888-12900..
18. Guo J, Li J, Li D, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Online: CVPR, 22023: 10867-10877..
19. Lau J J, Gayen S, Ben Abacha A, et al. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018, 5: 180251..
20. Liu B, Zhan L M, Xu L, et al. SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering//2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice: IEEE, 2021: 1650-1654..
21. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint, 2020, arXiv: 2010.11929..
22. Eslami S, de Melo G, Meinel C. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?. arXiv preprint, 2021, arXiv: 2112.13906..
23. Nguyen B D, Do T T, Nguyen B X, et al. Overcoming data limitation in medical visual question answering//22nd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2019), Cham: Springer International Publishing, 2019: 522-530..
24. Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 64-74..
25. Gong H, Chen G, Mao M, et al. VQAMix: conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging, 2022, 41(11): 3332-3343..
26. Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowledge-Based Systems, 2022, 255: 109763..
27. Lin W, Zhao Z, Zhang X, et al. PMC-CLIP: contrastive language-image pre-training using biomedical documents//International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCI 2023). Cham: Springer Nature Switzerland, 2023: 525-536..
28. Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization//Proceedings of the IEEE International Conference on Computer Vision, Online: ICCV, 2017: 618-626..

Journal of Biomedical Engineering

Image-aware generative medical visual question answering based on image caption prompts

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content