Enhancing Visual Spatial Reasoning through Multi-Task Learning

Jun 16, 2024 · 1 min read

Developed a vision-language model incorporating spatial features like depth maps, 3D coordinates, and edge maps through multi-task learning, improving spatial understanding and achieving state-of-the-art results on the Visual Spatial Reasoning (VSR) dataset.

Last updated on Jul 8, 2024

LXMERT OPENCV Edge Mapping Depth Mapping Trasnformers

Authors

Chashi Mahiul Islam

Doctoral Student, Dept. of Computer Science

Adversarial Attacks on Aligned Large Language Models Jun 8, 2024 →