Paper 14037-40
Parameter-efficient visible-to-thermal adaptation of YOLO-world via modality prompts and LoRA
29 April 2026 • 2:20 PM - 2:40 PM EDT | National Harbor 10
Abstract
Thermal (infrared) perception is essential for robust detection under low illumination and is widely used in automotive driving scenes to improve recognition under nighttime and other low-visibility conditions. However, modern open-vocabulary detectors are predominantly trained on visible imagery and often degrade under cross-modality transfer. This paper investigates whether two parameter-efficient adaptation strategies, input-space modality prompting (ModPrompt) and weight-space low-rank adaptation (LoRA), are complementary for adapting a pretrained open-vocabulary detector, YOLO-World, to thermal imagery. Experiments are conducted on the FLIR-IR aligned benchmark (IR only) with three categories: person, bicycle, and car. On the FLIR-IR aligned benchmark, ModPrompt and LoRA each improve over zero-shot transfer, while their combination yields the best accuracy (AP50 = 75.5, AP75 = 36.9, AP = 41.1). On our evaluation machine (Ubuntu 18.04.6 LTS, Intel Xeon Gold 6248R, NVIDIA A100-PCIE-40GB), LoRA provides the best efficiency among the tested PEFT settings (0.0850 s/train step, 0.0288 s/inference step), whereas LoRA + ModPrompt achieves the best detection accuracy at higher computational cost. We therefore view LoRA as the efficiency-oriented option and LoRA + ModPrompt as the accuracy-oriented option, and leave evaluation with faster encoder–decoder translators (e.g., lightweight U-Net variants) as future work.
Presenter
Shotaro Miwa
Mitsubishi Electric Corp. (Japan)
I am the Chief Researcher currently at the National Institute of Advanced Industrial Science and Technology, previously. Mitsubishi Electric Corporation, a Visiting Researcher at the University of Alberta. My research focuses on computer vision, machine learning, deep learning, and deep reinforcement learning.
I received my Ph.D. in computer vision from Osaka University after completing my undergraduate studies at the University of Tokyo. I have extensive experience across a range of fields, including surveillance systems, security, defense technologies, and robotics.
Currently, I am working on the development of foundation models and their applications in computer vision and robotics, with the goal of advancing the capabilities of these technologies.