Hallucinations: Not Necessarily Harmful – A New AI Framework Optimizing Image Segmentation

The Unconventional Role of AI Hallucination in Research

The AIxiv Column and Research Background
The AIxiv column is a platform of Synced where academic and technical contents are published. Over the past several years, it has received and reported more than 2,000 pieces of content, covering top – tier laboratories in universities and enterprises worldwide, effectively facilitating academic exchange and dissemination. If you have outstanding work to share, you are welcome to submit or contact for coverage. The submission email addresses are [email protected] and [email protected]. The author of this article, Jian Hu, is a Ph.D. student at Queen Mary University of London, under the supervision of Professor Shaogang Gong. This article is completed under the guidance of Professor Gong and Professor Junchi Yan.

The Challenge and New Perspective in AI
In the field of artificial intelligence, the “hallucination” phenomenon of large pre – trained models (such as GPT and LLaVA) is often regarded as a difficult challenge to overcome, especially when performing precise tasks like image segmentation. However, the latest research “Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation” published in NeurIPS 2024 presents an interesting view: these hallucinations can actually be transformed into useful information sources, thus reducing the dependence on manual prompts.

Research Motivation and the Problem of Existing Methods

The Complexity of General – Prompt Segmentation Tasks
This research focuses on a challenging task: the task – generic promptable segmentation setting. In this framework, only a general prompt within the task is provided to describe the entire task, without specifically indicating the specific objects to be segmented in each image. For example, in the camouflaged animal segmentation task, only the task description like “camouflaged animal” is given, without informing the specific animal names in different images. The model needs to accomplish two main tasks: first, effectively infer the specific target objects to be segmented based on the image content; second, accurately determine the specific positions and segmentation shapes of the target objects.
Although large – scale segmentation models like SAM can effectively segment objects when relatively accurate position descriptions are provided, in complex tasks such as camouflaged sample segmentation or medical image segmentation, obtaining such accurate descriptions is not easy. Previous studies, such as GenSAM [1], proposed using multi – modal large – scale models (MLLMs) like LLaVA/BLIP2 to infer segmentation prompts for specific samples to guide the segmentation process. However, this method often leads to problems in scenarios like camouflaged sample segmentation due to the existence of object co – occasion bias. For example, in an image of only a grassland, if lions usually co – occur with grasslands in the training data, LLaVA may be biased to predict the existence of camouflaged lions in the grassland, even if there are no lions in the actual image. This assumed preference is especially problematic in the camouflaged animal segmentation task as it may cause the model to misidentify non – existent camouflaged animals.

The Potential Value of Hallucination
But is such a phenomenon necessarily bad? Not really. Considering that cheetahs do often appear in such grasslands, although they may not be present in a specific image. This so – called “hallucination” is actually the empirical common sense obtained by the model through large – scale data training. Although this inference does not match the current example, it does reflect the norm in the real world. Furthermore, this common sense brought by hallucination may help in a more in – depth analysis of the image content and the discovery of information related to the image but not obvious. If this information is verified, it may contribute to more effective execution of downstream tasks.

The Implementation of the ProMaC Framework

The Overall Structure of ProMaC
As shown in Figure 2, this research proposes a cyclic – optimization ProMaC framework, which consists of two parts: the multi – scale chain of thought prompting module that utilizes hallucinations to infer sample – specific prompts from task – general prompts and the mask semantic alignment module that aligns the generated masks with the task semantics. The former infers relatively accurate sample – specific prompts to guide SAM for segmentation, and the latter aligns the generated masks with the task semantics. The aligned masks can then act as prompts to feed back to the first module to verify the information obtained from hallucinations. Through cyclic optimization, accurate masks are gradually obtained.

Multi – scale Chain of Thought Prompting
It mainly accomplishes two tasks: collecting as much task – related candidate knowledge as possible and generating accurate sample – specific prompts. To this end, the input image is cut into image patches of different scales. The different visibility levels of task – related objects in each image patch stimulate the hallucinations of the MLLM. This prompts the model to explore the connection between the image data and related tasks through prior knowledge in each image patch, and then predict potential bounding boxes and names of target objects and background images. But only the correct information is worth retaining. For this purpose, a Visual Contrastive Reasoning module is introduced. This module first uses image editing techniques to create contrast images. These contrast images are generated by removing the mask parts identified in the previous iteration, creating pictures containing only task – irrelevant backgrounds. Then, by subtracting the output prediction values of the original image from those of the background image, the negative impact caused by the object co – existence bias can be eliminated, thus confirming the truly effective sample – specific prompts.

Mask Semantic Alignment
The obtained sample – specific prompts are sent to the mask generator to produce accurate masks. First, the sample – specific prompts are input into the segmentation module (SAM) to generate a mask. However, SAM lacks semantic understanding ability. It mainly identifies the objects to be segmented based on the given prompts and the surrounding textures. Therefore, CLIP is adopted to evaluate the semantic similarity between the masks generated on different image patches for the same prompt and the target objects. This method helps to ensure the accuracy and relevance of the segmentation results. The normalized similarity is used as a weight to weighted – synthesize the final mask. This mask helps to generate better background images in the next iteration, thereby guiding more effective prompt generation. This can fully utilize hallucinations to extract task – related information in the image, verify it, and generate more accurate prompts. In this way, better prompts can improve the quality of the masks, forming a mutually – promoting improvement process.

Experimental Results and the New Perspective
This research has conducted experiments on challenging tasks (e.g., camouflaged animal detection, medical image detection). The ProMaC framework provides a new perspective that hallucinations are not necessarily harmful. If they can be utilized, they can also provide assistance for downstream tasks.