We adapt this method from interactive segmentation [109, 70], although unlike interactive segmentation whose aim is to eventually predict a valid mask after enough user input, our aim is to always predict a valid mask for any prompt even when the prompt is ambiguous. — Zero-shot transfer. Intuitively, our pre-training task en- dows the model with the ability to respond appropriately to any prompt at inference time, and thus downstream tasks can be solved by engineering appropriate prompts. For ex- ample, if one has a bounding box detector for cats, cat in- stance segmentation can be solved by providing the detec- tor’s box output as a prompt to our model. In general, a wide array of practical segmentation tasks can be cast as prompt- ing.