Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text ...
Abstract: Open vocabulary object detection (OVD), which detects novel categories through detectors trained on base categories, has achieved remarkable advancement attributable to large-scale ...
Abstract: Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video ...