| 1. | 
				                                                                				                                                                       
				                                                                        Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology//Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online: ACL, 2020: 60-69..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 2. | 
				                                                                				                                                                       
				                                                                        Dai W, Hou L, Shang L, et al. Enabling multimodal generation on clip via vision-language knowledge distillation. arXiv preprint, 2022, arXiv: 2203.06386..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 3. | 
				                                                                				                                                                       
				                                                                        Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2016: 21-29..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 4. | 
				                                                                				                                                                       
				                                                                        Joshi V, Mitra P, Bose S. Multi-modal multi-head self-attention for medical VQA. Multimedia Tools and Applications, 2024, 83(14): 42585-42608..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 5. | 
				                                                                				                                                                       
				                                                                        Liu B, Zhan L M, Wu X M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 210-220..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 6. | 
				                                                                				                                                                       
				                                                                        Chen Z, Du Y, Hu J, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training//International Conference on Medical Image Computing and Computer-Assisted Intervention, Cham: Springer Nature Switzerland, 2022: 679-689..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 7. | 
				                                                                				                                                                       
				                                                                        Ossowski T, Hu J. Multimodal prompt retrieval for generative visual question answering. arXiv preprint, 2023, arXiv: 2306.17675..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 8. | 
				                                                                				                                                                       
				                                                                        Chen J, Yang D, Jiang Y, et al. MISS: a generative pre-training and fine-tuning approach for Med-VQA//International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024: 299-313..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 9. | 
				                                                                				                                                                       
				                                                                        Marino K, Chen X, Parikh D, et al. Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2021: 14111-14121..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 10. | 
				                                                                				                                                                       
				                                                                        Lin B, Chen Z, Li M, et al. Towards medical artificial general intelligence via knowledge-enhanced multimodal pretraining. arXiv preprint, 2023, arXiv: 2304.14204..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 11. | 
				                                                                				                                                                       
				                                                                        Zhan J, Dai J, Ye J, et al. AnyGPT: unified multimodal LLM with discrete sequence modeling. arXiv preprint, 2024, arXiv: 2402.12226..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 12. | 
				                                                                				                                                                       
				                                                                        Du Z, Qian Y, Liu X, et al. GLM: general language model pretraining with autoregressive blank infilling. arXiv preprint, 2021, arXiv: 2103.10360..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 13. | 
				                                                                				                                                                       
				                                                                        Gu T, Yang K, Liu D, et al. LaPA: latent prompt assist model for medical visual question answering//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online: CVPR, 2024: 4971-4980..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 14. | 
				                                                                				                                                                       
				                                                                        Liu J, Hu T, Zhang Y, et al. Parameter-efficient transfer learning for medical visual question answering. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023, 8(4): 2816-2826..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 15. | 
				                                                                				                                                                       
				                                                                        Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 16. | 
				                                                                				                                                                       
				                                                                        Kim J H, Jun J, Zhang B T. Bilinear attention networks. Advances in Neural Information Processing Systems, 2018, 31: 1-11..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 17. | 
				                                                                				                                                                       
				                                                                        Li J, Li D, Xiong C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation//International conference on machine learning, Stockholm: PMLR, 2022: 12888-12900..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 18. | 
				                                                                				                                                                       
				                                                                        Guo J, Li J, Li D, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Online: CVPR, 22023: 10867-10877..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 19. | 
				                                                                				                                                                       
				                                                                        Lau J J, Gayen S, Ben Abacha A, et al. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018, 5: 180251..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 20. | 
				                                                                				                                                                       
				                                                                        Liu B, Zhan L M, Xu L, et al. SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering//2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice: IEEE, 2021: 1650-1654..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 21. | 
				                                                                				                                                                       
				                                                                        Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint, 2020, arXiv: 2010.11929..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 22. | 
				                                                                				                                                                       
				                                                                        Eslami S, de Melo G, Meinel C. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?. arXiv preprint, 2021, arXiv: 2112.13906..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 23. | 
				                                                                				                                                                       
				                                                                        Nguyen B D, Do T T, Nguyen B X, et al. Overcoming data limitation in medical visual question answering//22nd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2019), Cham: Springer International Publishing, 2019: 522-530..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 24. | 
				                                                                				                                                                       
				                                                                        Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering//24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), Cham: Springer International Publishing, 2021: 64-74..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 25. | 
				                                                                				                                                                       
				                                                                        Gong H, Chen G, Mao M, et al. VQAMix: conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging, 2022, 41(11): 3332-3343..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 26. | 
				                                                                				                                                                       
				                                                                        Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowledge-Based Systems, 2022, 255: 109763..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 27. | 
				                                                                				                                                                       
				                                                                        Lin W, Zhao Z, Zhang X, et al. PMC-CLIP: contrastive language-image pre-training using biomedical documents//International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCI 2023). Cham: Springer Nature Switzerland, 2023: 525-536..
				                                                                 | 
				                                                            
				                                                        
				                                                            
				                                                                | 28. | 
				                                                                				                                                                       
				                                                                        Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization//Proceedings of the IEEE International Conference on Computer Vision, Online: ICCV, 2017: 618-626..
				                                                                 |