Reliable Multimodal Models

Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA) or image captioning. These tasks are especially important for assisting people with visual impairments, such as assisting in daily routines or interacting with visual content on the web. To provide such utility, users must be able to trust the output of these tools as they may be basing decisions or actions on the output. While improving the accuracy of approaches may be an important factor for trusting models, models are imperfect and will inevitably produce some incorrect outputs. A reliable model needs to prevent these incorrect outputs from misleading users. However, improving model reliability beyond accuracy on benchmarks has been largely neglected in multimodal research, despite its importance to usage in real settings.

As part of the BAIR Commons, we've so far explored reliability from two angles: (1) learning to say "I don't know" instead of giving a wrong answer, and (2) reducing hallucinations in image captioning.

  1. Our first work, Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly (ECCV 2022)explores this problem in the context of VQA. We promote a problem formulation in which abstention (i.e., predicting "I don't know") is preferable over providing an incorrect answer. We find that existing VQA models have poor out-of-the-box reliability, and that learned confidence estimates can improve this significantly. Please see our paper for more details.

    Paper
    Code 


  2.  Our next work, Simple Token-Level Confidence Improves Caption Correctness (ICCVW 2022), addresses the task of image-caption correspondence, that is, knowing how well a given caption matches an image. We show that token-level confidences from a captioning model are better (e.g., finer-grained) estimated of caption correctness than the standard image-text matching scores from a pretrained model. We show 2 types of confidences: those directly from the captioning model (e.g., softmax score), and those from a learned token confidence estimator that we propose. We use these estimates to reduce objects hallucinations during beam search by simply rejecting captions with low-confidence objects, reaching state-of-the-art hallucination rates.

    Paper 



Contributors:

  • Suzie Petryk (UC Berkeley) - spetryk@berkeley.edu
  • Kate Saenko (Meta)
  • Prof. Joseph Gonzalez (UC Berkeley)
  • Prof. Trevor Darrell (UC Berkeley)

Previous contributors:

  • Marcus Rohrbach (Meta)
  • Anna Rohrbach (UC Berkeley)
  • Spencer Whitehead (Meta)
  • Vedaad Shakib (UC Berkeley)