Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Man Liu^1,2, Huihui Bai^1,2^🖂, Feng Li³^🖂, Chunjie Zhang^1,2, Yunchao Wei^1,2,
Meng Wang³, Tat-Seng Chua⁴, Yao Zhao^1,2
¹ Institute of Information Science, Beijing Jiaotong University
² Beijing Key Laboratory of Advanced Information Science and Network Technology
³ Hefei University of Technology ⁴ National University of Singapore
{manliu, hhbai, cjzhang, yunchao.wei, yzhao}@bjtu.edu.cn,
fengli@hfut.edu.cn, eric.mengwang@gmail.com, chuats@comp.nus.edu.sg Corresponding author.

Abstract

Zero-shot learning (ZSL) aims to explore the semantic-visual interactions to discover comprehensive knowledge transferred from seen categories to classify unseen categories. Recently, prompt engineering has emerged in ZSL, demonstrating impressive potential as it enables the zero-shot transfer of diverse visual concepts to downstream tasks. However, these methods are still not well generalized to broad unseen domains. A key reason is that the fixed adaption of learnable prompts on seen domains makes it tend to over-emphasize the primary visual features observed during training. In this work, we propose a Prompt-to-Prompt generation methodology (P2P), which addresses this issue by further embracing the instruction-following technique to distill instructive visual prompts for comprehensive transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts and then inversely rectify the visual representations with the guidance of the learned instruction prompts. This enforces the compensation for missing visual details to primary contexts and further eliminates the cross-modal disparity, endowing unseen domain generalization. Through extensive experimental results, we demonstrate the efficacy of P2P in achieving superior performance over state-of-the-art methods.

1 Introduction

Humans naturally interact with the world through various channels such as vision and language, enabling them to identify unseen objects based on prior knowledge. Leveraging this cognitive capability, zero-shot learning (ZSL) [40, 28] endeavors to classify objects from the unseen domain carrying knowledge from seen categories. One of the mainstream methods in ZSL is the embedding-based approach [29, 34, 57, 37, 58, 47, 22, 35, 8, 7, 32], which learns the semantic-visual alignment in a joint embedding space. Prior to such a cross-modal alignment, as depicted in Figure 1(a), embedding-based models typically commence with a visual encoder, such as Vision Transformer (ViT) [18] or ResNet [20] pre-trained on ImageNet [17], to initialize visual features. However, this can introduce distribution discrepancy when applied to downstream ZSL benchmarks as wider categories are unseen during pre-training, thereby resulting in cross-dataset bias that produces vague or even misleading visual representations [45].

To address this limitation, some methods [56, 64] propose to incorporate attention mechanisms into ZSL frameworks to improve visual representations derived from pre-trained encoders. For example, AREN [56] and SGMA [64] discover essential regions within the image under the guidance of the attention. Inspired by the global capability of transformers [46], TransZero [7] and TransZero++ [6] devise an attribute-guided transformer to disentangle the complex geometry relationships for feature augmentation. ZSLViT [10] introduces semantic-guided attention to establish semantic-visual correspondences through a progressive learning strategy. With the rise of powerful zero-shot capability of prompt engineering [26, 16], current works [24, 63, 62, 51] investigate to introduce prompt to efficiently adapt the pre-trained vision encoder to downstream datasets. As illustrated in Figure 1(b), these methods refine visual features by injecting learnable visual prompts into the pre-trained encoder, aiming to alleviate the cross-dataset bias. However, the learnable prompts often exhibit a tendency to overfit to the base seen classes, concentrating solely on primary visual features essential for recognizing seen categories [62]. Consequently, they may lack the capacity to capture crucial visual details necessitated to wider unseen classes. These missing details, complementary to primary content, play a pivotal role in facilitating cross-domain semantic transferring.

Refer to caption — Figure 1: The motivation of P2P. (a) Embedding-based methods suffer from vague visual representations due to cross-dataset bias in pre-trained visual encoders, resulting in suboptimal visual-semantic alignment for ZSL. (b) While prompt-based methods mitigate cross-dataset bias, they overfit to the base seen classes and focus exclusively on primary visual features, omitting crucial details necessary for unseen categories. (c) Our P2P rectifies the visual representation by compensating for missing details through instructive prompts, thus ensuring sufficient visual clues for knowledge transfer cross-domains.

To this end, we introduce a Prompt-to-Prompt (P2P) generation methodology (see Figure 1(c)) to facilitate comprehensive knowledge discovery for both seen and unseen domains by distilling instructive visual prompts. By leveraging the idea of instruction-following, P2P distills instructive prompts under the guidance of semantic-related instruction, which encompasses the complementary details overlooked by the learnable visual prompts. In this way, P2P attends to the missing visual information beyond the primary content. Acknowledging the inherent disparity between visual images and text instruction [44, 13], we devise a grounding transformer (G-Former) wherein prompt-conditioned visual features and textual features are represented as combinations of modal-sharing tokens shared between modalities. This harmonizes the granularity of cross-modal information with shared semantic concepts, leading to enhanced multi-modal understanding and narrowing of the cross-modal gap. Through residual extraction for instructive prompts in the unified granularity space, P2P rectifies the visual representation by discerning guidance-conditioned details, thus enriching semantic features and improving generalization to zsl.

In sum, our work makes the following contributions: 1) We propose a Prompt-to-Prompt generation methodology for zero-shot learning, dubbed P2P, which employs the instruction and prompt learning technology to distill the instructive prompts, improving the acquisition of comprehensive semantic knowledge transfer cross seen and unseen domains. 2) P2P rectifies the visual features with the guidance of the learned instruction prompts in a cross-modal space, generating instructive prompts through residual extraction to compensate for the complementary details omitted by the learnable visual prompts. 3) Extensive experimental results show that our proposed method leads to better performance in ZSL benchmarks.

2 Related Work

Zero-Shot Learning. Predominantly, there are two principal methodologies concerning ZSL, i.e., the generative-based ZSL and the embedding-based ZSL. The former approaches utilize category attribute prototypes to synthesize visual features of additional unseen categories through generative adversarial nets [23, 48, 19], variational auto-encoders[25, 12, 14], or a combination of both [39, 11]. Although these methods compensate for the absence of the unseen domain during training, introducing extra data converts the ZSL problem into a fully supervised task. The embedding-based method represents the other mainstream branch for GZSL, achieved through the projection and alignment of information from visual and semantic domains. Early embedding-based works [1, 2, 53, 60] directly map the global visual and semantic features into a common space for category predictions. Global visual information, however, falls short of capturing the subtle but substantial differences between the categories, thus weakening discriminative representations. Recent efforts have focused on highlighting crucial visual regions through attention techniques. Some works [29, 64] crop and zoom in on significant local areas using coordinate positions obtained by attention mechanisms. Distinctive visual features are also emphasized by graph networks [57, 21] or attention guidance [56, 34, 37, 32]. Subsequent investigations [58, 49, 47, 22, 35, 8, 31, 38, 10] have sought to semantic-guided technique by localizing attribute-related regions via attribute descriptor. Despite these advancements, these methods still exhibit limitations in visual feature extraction within ZSL benchmarks, even with sophisticated manual designs for visual-semantic adaption. Recent studies [63, 62, 51] leverage the prompt emerging and showcase zero-shot capability. They use learnable prompts to efficiently adapt the pre-training vision encoder to downstream ZSL datasets for refined visual representation. In this paper, we focus on leveraging instruction-following technique to discern essential details omitted by the learnable prompts.

Prompt Learning. Prompt learning emerges as a pivotal methodology within the realm of natural language processing (NLP), facilitating the adaption of Large Language Models (LLMs) to various downstream tasks and scenarios [33]. By merely prepending language instruction to the input text as incontext information, LLMs can understand a target task. Following this idea, subsequent works [26, 16] delves into the capabilities of prompt-based LLMs in zero-shot and few-shot scenarios. The great successes of prompt learning in LLMs have recently sparked interest in computer vision [24, 3, 30, 50]. VPT [24] pioneered the integration of prompts with a vision encoder to enable better parameter-efficient adaptation of models via injecting learnable prompts rather than using specific manual prompts. Building upon these advancements, recent studies such as CoOp [63], CoCoOp [62], and SHIP [51] have achieved notable success in general vision-language comprehension, showcasing their ability to undertake visual few-shot scenarios. While these prompt-based models have made significant strides in recognition tasks, their applicability to numerous real-world vision tasks is challenging. Additionally, the introduction of learnable prompts might lead to overfitting, causing the model to focus solely on primary visual contents sufficient for seen classes while omitting crucial visual details necessary for recognizing unseen classes. In this study, we embrace the concept of the prompt learning, employing instruct-specific guidance to distill instructive prompts. This approach allows us to flexibly uncover missing details that complement the primary visual features, thereby facilitating knowledge transfer for zero-shot learning.

3 Prompt-to-Prompt Generation

In this paper, we aim to enhance discriminative performance for both seen and unseen domains through the comprehensive discovery of transferable knowledge facilitated by instructive visual prompts. To achieve this goal, a Prompt-to-Prompt (P2P) generation methodology is proposed, which involves three steps. 1) P2P abstracts the primary visual features by introducing task-specific learnable prompts into the input space to alleviate the cross-dataset bias. 2) By injecting the instruction, P2P distills instructive prompts based on the semantic-related instruction from prompt-conditioned visual features and text instructions through the G-Former. 3) P2P rectifies the visual representations by the instructive prompts, addressing missing visual details that complement the primary content. Through the integration of learnable prompts and instructive prompts, P2P is expected to facilitate sufficient knowledge transfer for ZSL.

3.1 Problem Formulation

ZSL aims to discern the novel image categories within the unseen domain $\mathcal{D}^{u}$ , leveraging the knowledge derived from the seen domain data $\mathcal{D}^{s}$ . Here, $\mathcal{D}^{s}=\{(x,y,a_{y})|x\in\mathcal{X}^{s},y\in\mathcal{Y}^{s},a_{y}\in% \mathcal{A}^{s}\}$ consists of the images $x$ in $\mathcal{X}^{s}$ , their corresponding label $y$ , and the associated category prototype $a_{y}$ from $\mathcal{A}^{u}$ . Similarly, the unseen domain data is defined as $\mathcal{D}^{u}=\{(x^{u},u,a_{u})\}$ , where $x^{u}\in\mathcal{X}^{u}$ , $u\in\mathcal{Y}^{u}$ , $a_{u}\in\mathcal{A}^{u}$ , with $\mathcal{A}=\mathcal{A}^{s}\cup\mathcal{A}^{u}$ . Note that the category space is disjoint between the seen and unseen domains, i.e., $\mathcal{Y}^{s}\cap\mathcal{Y}^{u}=\varnothing$ , $\mathcal{Y}^{s}\cup\mathcal{Y}^{u}=\mathcal{Y}$ . Utilizing the seen data $\mathcal{D}^{s}$ during the training stage, a fundamental embedding-based ZSL framework aims to acquire a mapping function that bridges the image space $\mathcal{X}^{s}$ with the attribute space $\mathcal{A}^{s}$ . This mapping enables the model to generalize its knowledge to the unseen domain, effectively establishing connections between $\mathcal{X}^{u}$ and $\mathcal{A}^{u}$ for category inference. In scenarios where the testing phase includes both seen and unseen classes, the task of conventional ZSL extends to Generalized Zero-Shot Learning (GZSL), rendering it more applicable to real-world scenarios. Given an input image representation $f(x)$ during training, the optimization of a basic embedding-based framework is achieved through the visual-semantic alignment by the following equation:

{L_{cls}}=-\sum\limits_{x\in{X^{s}}}{\log\frac{{\exp\left\langle{\mathcal{M}(f% (x)),{a_{y}}}\right\rangle}}{{\sum\limits_{\hat{y}\in{\mathcal{Y}^{S}}}{\exp}% \left\langle{\mathcal{M}(f(x)),{a_{\hat{y}}}}\right\rangle}}}

(1)

where $\mathcal{M}(\cdot)$ is a mapping function, which is generally implemented via Global Average Pooling (GAP) and linear projection. $\left\langle\cdot\right\rangle$ represents the cosine similarity for category decision.

3.2 Visual Prompt Embedding

Utilizing a pre-trained Transformer model, e.g., ViT [18], disparities between ImageNet and ZSL benchmarks may engender a cross-dataset bias. This bias can result in suboptimal visual representations, ultimately leading to undesirable visual-semantic interactions within the ZSL domain. Drawing inspiration from prior prompt learning methodologies [24], we opt to generate visual features under prompts for subsequent ZSL image recognition tasks, rather than straightforwardly deriving the visual features. However, prompt engineering poses a non-trivial challenge, necessitating domain expertise and being extremely time-consuming due to the iterative nature of trial and error. In line with the approach outlined in [24], we introduce a set of $P$ continuous embeddings with a length of $T$ , referred to as visual prompts, into the input space alongside the image $x$ . As shown in Figure 2, these visual prompts $P$ are seamlessly injected and prepended into the input sequence of the initial Transformer layer of ViT [18]. This process is formally expressed as:

\left[{\bar{P},\bar{E}}\right]=L\left({\left[{P,E}\right]}\right)

(2)

here, $E$ denotes the visual representation of patches extracted from the input image $x$ , which are embedded into the latent space alongside positional encoding, represented as $E=\text{Embed}(x)$ . The operation $\left[\cdot\right]$ signifies concatenation along the sequence length dimension. $L$ denotes the cascaded layers comprising ViT. For clarity, the class token has been omitted. Compared to hand-crafted prompts, the learnable prompts $P$ introduce only a small number of task-specific learnable parameters. $P$ optimizes the original ViT for improved applicability to downstream ZSL datasets, resulting in an enhanced visual representation $\bar{E}$ .

A fundamental approach for ZSL involves projecting the acquired features $\left[{\bar{P},\bar{E}}\right]$ into a mapping space and aligning them with the category prototype through cosine similarity. These learnable prompts enable ZSL to focus solely on the primary visual contents sufficient for seen classes, often overlooking other crucial visual details essential for unseen classes. To address this limitation, we propose deriving instruction-specific guidance to discern the missing details according to the text instruction. As illustrated in Figure 2, the sharing attributes are applied to serve as instruction, allowing us to reconstruct the missing details within the instruction through subsequent cross-modal embedding and instructive prompt generation.

3.3 Cross-modal Embedding

Given the inherent differences in semantic levels and granularity between visual images and textual instructions [44, 13], the reconstructed features, without explicit alignment of visual and semantic concepts, are likely to represent sub-optimally. Thus, we introduce a grounding transformer (G-Former) into the cross-modal embedding process to generate instruction-specific guidance instead of directly using the instruction. Firstly, we employ GloVe [42] to obtain the text tokens of the input instructions, denoted as $S=\mathrm{Glove}(text)$ . Subsequently, these text tokens $S$ are encoded through the G-Former, which grounds the input tokens $S$ to a modal-sharing token $R$ . Specifically, the instruction-specific guidance is generated as follows:

\tilde{S}={\rm softmax}({\rm GMP}(Q_{S}\cdot K_{{R}}^{\mathsf{T}}))\cdot V_{{R}}

(3)

where $Q_{S}$ , $K_{R}$ , and $V_{R}$ are the query, key, and value learned from $S$ and $R$ , respectively. $\rm GMP$ denotes the max-pooling operation posed on the patch-level relevance between $S$ and $R$ . This operation aids in eliminating the influence of irrelevant noisy patches that exhibit low relevance to modal-sharing tokens, thereby refining a compact instruct-specific guidance $\tilde{S}$ .

Similarly, the G-Former encodes visual tokens into the modal-sharing space by utilizing $\bar{E}$ as the query and producing the corresponding visual tokens through the following process:

\tilde{E}={\rm softmax}({\rm GMP}(Q_{\bar{E}}\cdot K_{{R}}^{\mathsf{T}}))\cdot V% _{{R}}

(4)

As a result, both the image and text tokens are represented as combinations of the common modal-sharing token $R$ . This explicit alignment ensures that the granularities of cross-modal information $\tilde{S}$ and $\tilde{E}$ are harmonized, thereby aiding the subsequent residual extraction process by narrowing the semantic-visual gap.

3.4 Instructive Prompt Generation

Upon obtaining the instruction-specific guidance $\tilde{S}$ , we proceed to rectify the visual representations by capture residual features for the compensation of missing visual details in the cross-modal space. Instead of relying on the commonly used attention mechanism for feature inference and recover, we adopt a simpler approach inspired by LLaVa [30], which utilizes a straightforward linear layer to establish a connection between image features and word embedding space. Here, we concatenate the guidance $\tilde{S}$ with the visual features $\tilde{E}$ and feed them into a Zero-linear layer, acting as a filter for details selection with the guidance of learned instruction guidance $\tilde{S}$ . This process can be formulated as:

H=ZLinear\left({\left[{\tilde{S},\tilde{E}}\right]}\right)

(5)

where $ZLinear$ represents the zero linear layers with the weight initialized to zeros [61], effectively eliminating harmful noise at the beginning of the training process. To ensure the the residual feature $H$ is a meaningful representation with a consistent semantic concept akin to the category prototypes, we align $H$ with its category prototypes $a_{y}$ using a consistency loss $\mathcal{L}_{cons}$ :

\mathcal{L}_{cons}=\|{\rm MLP}(a_{y})-H\|

(6)

where MLP refers to a multi-layer perception that projects $a_{y}$ into the cross-modal embedding space for better alignment. Considering that the residual visual extraction provides informative representations related to the instruction, we enhance the visual representation by providing the missing features to achieve complete semantic enrichment and complement, thus forming instructive prompts. Specifically, we compose the instructive prompts by incorporating each original prompt token with the integration of residual visual details $H$ :

\tilde{P}={\left[{\bar{p}_{1}+H,\bar{p}_{2}+H,...,+\bar{p}_{T}+H}\right]}

(7)

The instructive prompts $\tilde{P}$ attend to the missing details for the input image $x$ under the instruction $\tilde{P}$ , which encompasses the complementary details overlooked by the learnable visual prompts. Subsequently, we merge the instructive prompts $\tilde{P}$ with the original primary visual tokens through connection to obtain the final rectified visual representation $f(x)$ for $x$ :

f(x)=\left[\tilde{P},\bar{E}\right]

(8)

Together with the original primary output, P2P is expected to provide an improved comprehensive transferable knowledge discovery.

3.5 Model Optimization and Inference

Optimization. The overall objective loss function of P2P is formulated as follows:

\mathcal{L}=\mathcal{L}_{cls}+\lambda_{cons}\mathcal{L}_{cons}+\lambda_{deb}% \mathcal{L}_{deb}

(9)

where $\lambda_{cons}$ and $\lambda_{deb}$ serve as the hyper-parameters controlling the weights of semantic consistency loss $\mathcal{L}_{cons}$ and the debiasing loss $\mathcal{L}_{deb}$ , respectively. As in Eq. 9, we also apply a debiasing loss $\mathcal{L}_{deb}$ to mitigate the seen-unseen bias following [31, 32]. It aims to balance the score dependency in the seen-unseen domain, pursuing the distribution consistency concerning both mean and variance:

{{\cal L}_{deb}}=\|{\alpha_{s}}-{\alpha_{u}}\|_{2}^{2}+\|{\beta_{s}}-{\beta_{u% }}\|_{2}^{2}

(10)

where $\alpha_{s}$ and $\beta_{s}$ represent the mean and variance, respectively, of the seen prediction score $\left\langle{\mathcal{M}(f(x)),{a_{\hat{y}(\hat{y}\in\mathcal{Y}^{s})}}}\right\rangle$ . Similarly, $\alpha_{u}$ and $\beta_{u}$ denote the mean and variance, respectively, of the unseen prediction score $\left\langle{\mathcal{M}(f(x)),{a_{\hat{y}(\hat{y}\in\mathcal{Y}^{u})}}}\right\rangle$ .

Inference. During training, the model merely learns about the knowledge of seen categories, while unseen categories are inferred at testing time:

\tilde{y}=\arg\max_{\hat{y}\in\mathcal{Y}^{u}}\left(\left\langle{\mathcal{M}(f% (x)),{a_{\hat{y}}}}\right\rangle\right)

(11)

In the GZSL setting, both seen and unseen categories are encompassed. To jointly define the category, calibrated stacking (CS) [5] is applied:

\tilde{y}=\arg\max_{\hat{y}\in\mathcal{Y}}\left(\left\langle{\mathcal{M}(f(x))% ,{a_{\hat{y}}}}\right\rangle-\gamma\mathbb{I}_{\left[\hat{y}\in\mathcal{Y}^{s}% \right]}\right)

(12)

$\mathbb{I}_{\mathcal{Y}^{S}}(\cdot)$ represents an indicator function, yielding a result of 1 when $\hat{y}\in\mathcal{Y^{S}}$ and 0 otherwise. The calibrated factor $\gamma$ is employed to trade-off the calibration degree on seen categories and determine the category $\tilde{y}$ of an input visual sample $x$ .

4 Experiments

4.1 Experimental Settings

Datasets. We assess the performance of our P2P across three standard benchmark datasets: Caltech-USCD Birds-200-2011 (CUB) [52], SUN Attribute (SUN) [41], Animals with Attributes2 (AwA2) [54]. The categorization into seen and unseen categories follows the Proposed Split (PS) [54]. The CUB dataset consists of 11,788 images illustrating 200 bird classes, with a split of 150/50 for seen/unseen classes and characterized by 312 attributes. SUN, a vast scene dataset, contains 14,340 images spanning 717 classes, divided into seen/unseen classes at 645/72, and annotated with 102 attributes. AwA2, although coarser with only 50 animal classes (seen/unseen classes = 40/10), impressively boasts a total of 37,322 images and is described by 85 attributes.

Evaluation Metrics. We access top-1 accuracy in both the ZSL and GZSL settings. In the ZSL scenario, we exclusively evaluate the accuracy on unseen classes, denoted as $acc$ . In GZSL setting, following the approach outlined in [54], we employ the harmonic mean(as $H=2\times S\times U/(S+U)$ ) to evaluate the performance of our framework, where $S$ and $U$ represent the top-1 accuracy of the seen and unseen classes, respectively.

Implementation Details. Unlike previous approaches in ZSL that employ ResNet [20] models as visual backbones, we opt for the ViT-Base model [18] as our visual feature extractor. We maintain an input image resolution of $224\times 224$ , with a patch size of $16\times 16$ . Our framework is implemented using PyTorch and executed on an Nvidia GeForce RTX 3090 GPU.

Table 1: Results (

\%

) of the state-of-the-art ZSL and GZSL modes on CUB, SUN and AwA2, including generative and embedding-based methods). The best and second-best results are marked in red and blue, respectively. The symbol “

*

” denotes ViT-based methods. Results indicated with “

**

” are taken from [10].

Methods	Venue	CUB				SUN				AwA2
		ZSL	GZSL			ZSL	GZSL			ZSL	GZSL
		$acc$	$U$	$S$	$H$	$acc$	$U$	$S$	$H$	$acc$	$U$	$S$	$H$
Generative-based Methods
f-VAEGAN [55]	CVPR’19	61.0	48.4	60.1	53.6	64.7	45.1	38.0	41.3	71.1	57.6	70.6	63.5
OCD-CVAE [25]	CVPR’20	–	44.8	59.9	51.3	–	44.8	42.9	43.8	–	59.5	73.4	65.7
Composer [23]	NeurIPS’20	69.4	56.4	63.8	59.9	62.6	55.1	22.0	31.4	71.5	62.1	77.3	68.8
TF-VAEGAN [39]	ECCV’20	64.9	52.8	64.7	58.1	66.0	45.6	40.7	43.0	72.2	59.8	75.1	66.6
GCM-CF [59]	CVPR’21	–	61.0	59.7	60.3	–	47.9	37.8	42.2	–	60.4	75.1	67.0
SDGZSL [14]	ICCV’21	75.5	59.9	66.4	63.0	–	–	–	–	72.1	64.6	73.6	68.8
CE-GZSL [19]	CVPR’21	77.5	63.9	66.8	65.3	63.3	48.8	38.6	43.1	70.4	63.1	78.6	70.0
ICCE [27]	CVPR’22	78.4	67.3	65.5	66.4	–	–	–	–	72.7	65.3	82.3	72.8
FREE [11]	ICCV’21	–	55.7	59.9	57.7	–	47.4	37.2	41.7	–	60.4	75.4	67.1
HSVA [12]	NeurIPS’21	62.8	52.7	58.3	55.3	63.8	48.6	39.0	43.3	–	59.3	76.6	66.8
LBP [36]	TPAMI’21	61.9	42.7	71.6	53.5	63.2	39.2	36.9	38.1	–	–	–	–
FREE+ESZSL [4]	ICLR’22	–	51.6	60.4	55.7	–	48.2	36.5	41.5	–	51.3	78.0	61.8
f-VAEGAN+DSP [9]	ICML’23	62.8	62.5	73.1	67.4	68.6	57.7	41.3	48.1	71.6	63.7	88.8	74.2
SHIP** [51]	ICCV’23	–	55.3	58.9	57.1	–	–	–	–	–	–	–	–
Embedding-based Methods
SGMA [64]	NeurIPS’19	71.0	36.7	71.3	48.5	–	–	–	–	68.8	37.6	87.1	52.5
AREN [56]	CVPR’19	71.8	38.9	78.7	52.1	60.6	19.0	38.8	25.5	67.9	15.6	92.9	26.7
LFGAA [34]	ICCV’19	67.6	36.2	80.9	50.0	61.5	18.5	40.0	25.3	68.1	27.0	93.4	41.9
APN [58]	NeurIPS’20	72.0	65.3	69.3	67.2	61.6	41.9	34.0	37.6	68.4	57.1	72.4	63.9
DAZLE [22]	CVPR’20	66.0	56.7	59.6	58.1	59.4	52.3	24.3	33.2	67.9	60.3	75.7	67.1
DVBE [37]	CVPR’20	–	53.2	60.2	56.5	–	45.0	37.2	40.7	–	63.6	70.8	67.0
GEM-ZSL [35]	CVPR’21	77.8	64.8	77.1	70.4	62.8	38.1	35.7	36.9	67.3	64.8	77.5	70.6
DPPN [49]	NeurIPS’21	77.8	70.2	77.1	73.5	61.5	47.9	35.8	41.0	73.3	63.1	86.8	73.1
CLIP [43]	ICML’21	–	55.2	54.8	55.0	–	–	–	–	–	–	–	–
CoOP** [63]	IJCV’22	–	49.2	63.8	55.6	–	–	–	–	–	–	–	–
MSDN [8]	CVPR’22	76.1	68.7	67.5	68.1	65.8	52.2	34.2	41.3	70.1	62.0	74.5	67.7
TransZero [7]	AAAI’22	76.8	69.3	68.3	68.8	65.6	52.6	33.4	40.8	70.1	61.3	82.3	70.2
TransZero++ [6]	TPAMI’22	78.3	67.5	73.6	70.4	67.6	48.6	37.8	42.5	72.6	64.6	82.7	72.5
DUET* [15]	AAAI’23	72.3	62.9	72.8	67.5	64.4	45.7	45.8	45.8	69.9	63.7	84.7	72.7
I2MVFormer* [38]	CVPR’23	42.1	32.4	63.1	42.8	–	–	–	–	73.6	66.6	82.9	73.8
ZSLViT* [10]	CVPR’24	78.9	69.4	78.2	73.6	68.3	45.9	48.3	47.3	70.2	66.1	84.6	74.2
P2P* (Ours)	–	80.3	73.1	76.4	74.7	70.4	58.6	45.2	51.0	75.2	70.3	80.1	74.9

4.2 Comparison with State-of-the-Art Methods

We evaluate the proposed approach and recently proposed state-of-the-art methods, of which the results are given in Table 1.

Results of Conventional Zero-Shot Learning. For conventional ZSL, our method surpasses the best one by 1.4%, 1.8%, and 1.9% for $acc$ on CUB, SUN and AwA2 datasets, respectively. The results demonstrate that learning instructive prompts based on pre-trained visual encoders can effectively enhance visual representations and improve knowledge transfer to the unseen domain, achieving state-of-the-art $acc$ performance of 80.3%, 70.4%, and 75.2% on CUB, SUN, and AwA2, respectively. Compared to some recent methods[35, 56, 7, 6, 10], which refine visual features obtained from pre-trained visual encoders using attention mechanisms, P2P achieves substantial improvements in $acc$ , with gains exceeding 1.4%, 2.1%, and 2.6% on CUB, SUN, and AwA2, respectively. This reveals that with the assistance of prompt-learning and instruction-following, P2P can improve category discriminative power for the unseen domain.

Results of Generalized Zero-Shot Learning. Table 1 also reports the results of various methods in the GZSL setting. Results show that the P2P can achieve stats-of-the-art $H$ performance across all datasets, e.g., 74.7%, 51.0%, and 74.9% on CUB, SUN, and AwA2, respectively. Notably, our P2P also significantly outperforms other prompt-based methods (CoOP [63], and SHIP [51]) with substantial margins of 19.7%, 18.9%, and 17.4% on CUB dataset. These results demonstrate that exploring missing details complementary to prompt-conditioned primary content can effectively enhance visual representations and cross-domain transferability.

4.3 Ablation Study

Table 2: Ablation study of P2P under the ZSL and GZSL setting on CUB, SUN and AwA2 datasets, respectively.

Methods	CUB				SUN				AwA2
	ZSL	GZSL			ZSL	GZSL			ZSL	GZSL
	$acc$	$U$	$S$	$H$	$acc$	$U$	$S$	$H$	$acc$	$U$	$S$	$H$
P2P w/o prompts $P$	70.2	67.0	69.1	68.0	64.5	53.8	25.8	34.8	70.7	61.0	88.2	72.1
P2P w/o instruction $S$	75.8	72.3	72.7	72.5	68.0	58.1	38.5	46.2	73.2	65.5	85.1	74.0
P2P w/o G-Former	78.8	75.6	72.1	73.8	68.8	43.6	54.9	48.6	73.7	66.8	81.8	73.5
P2P(full)	80.3	73.1	76.4	74.7	70.4	58.6	45.2	51.0	75.2	70.3	80.1	74.9

Effect of components in P2P. P2P aims to learn comprehensive representations with the integration of primary contents learned by learnable prompts and missing details discovered under the guidance of the instruction. Thus we evaluate key components, i.e., the learnable prompts $P$ and instruction $S$ , as shown in Table 2. When prompts are removed, the model can be regarded as the baseline in which we directly use original visual features extracted from vanilla ViT and project them into semantic space for category inference. P2P overwhelms the baseline with large $acc$ / $H$ margins of 20.1%/6.7%, 5.9%/6.2%, and 4.5%/2.8% on CUB, SUN, and AwA2 datasets, respectively, P2P performs poorer significantly than its full model when no instruction is employed, i.e., the $acc$ / $H$ drop by 4.5%/2.2% on CUB, 2.4%/4.8% on SUN, and 2.0%/0.9% on AWA2. Additionally, G-Former enhances the residual extraction by providing a unified visual-semantic space, resulting in $acc$ / $H$ improvements of 1.5%/0.9%, 1.6%/2.4%, 1.5%/1.4% on CUB, SUN, and AwA2 datasets, respectively.

Effect of $\lambda_{cons}$ and $\lambda_{deb}$ . $\lambda_{cons}$ and $\lambda_{deb}$ are the hyper-parameters that balance $\mathcal{L}_{cons}$ and $\mathcal{L}_{deb}$ , respectively. Here, we evaluate the effect of $\lambda_{cons}$ and $\lambda_{deb}$ as shown in Figure 3. As $\lambda_{cons}$ rises from 0.0 to 2.0, i.e., semantic consistency $\mathcal{L}_{cons}$ is introduced into P2P, $H$ increases across all datasets. The best $H$ is obtained when $\lambda_{cons}$ = 1.0. This proves the effectiveness of semantic consistency in residual visual representation, aligning closely with the category prototype for recognition. When $\lambda_{cons}$ > 1.0, H starts to drop. Thus, we set $\lambda_{cons}$ = 1.0 for optimal results. Additionally, when we gradually increase the value of $\lambda_{deb}$ , more attention is paid to pursuing the consistent distribution between seen and unseen predictions, resulting in improved unseen accuracy $U$ and ultimately achieving better $H$ performance.

Effect of $T$ . $T$ is the length of the learnable visual prompts $P$ . Here, we sweep prompt length $T\in\{1,3,5,7,9\}$ to evaluate the effect of the prompt $P$ on recognition performance. Figure 4 shows the values of $acc$ and $H$ as $T$ varies. We observe that the best performance is achieved when $T$ is around 5. Notably, even with as few as only one prompt, P2P still significantly outperforms the baseline (1- $th$ row in Table 2). Thus, we set $T$ = 5 for CUB, SUN and AwA2.

Qualitative Results. As illustrated in Figure 5, we present the t-SNE visualization of visual features for seen and unseen classes on CUB and AwA2, learned by the P2P w/o prompt $P$ , P2P without instruction $S$ , and our full P2P. The results reveal that visual features extracted from the P2P w/o $P$ lack distinctiveness within certain classes, while those acquired by P2P with the assistance of learnable prompts demonstrate superior quality with a more compact and discriminative distribution. This observation intuitively suggests that prompts play a crucial role in enabling the application of pre-trained ViT to downstream ZSL tasks by generating high-quality features for seen classes. Furthermore, compared to P2P w/o $S$ , the visual features learned by our full P2P showcase desirable distinguishability with higher inter-class discrepancy and clearer decision boundaries. This improvement can be attributed to the fact that our P2P identifies missing detail tokens according to the instruction, thereby complementing the primary content and resulting in a more comprehensive visual representation. As a result, P2P achieves significant performance improvements in both seen and unseen classes.

5 Conclusion

In this paper, we propose P2P, prompt-to-Prompt generation methodology designed to achieve comprehensive semantic knowledge transfer from the seen to unseen domain. Beyond focusing on primary visual features for base seen classes, P2P distills instructive prompts through the discovery of missing residual details following the task-specific instruction. Additionally, we introduce a grounding transformer (G-Former) to unify the visual-semantic space by embedding image-instruction tokens through cross-modal sharing semantics. G-Former significantly improves cross-modal understanding, equipping P2P with the enhanced capability to discern and recover guidance-conditioned visual details. By integrating these details with primary visual content, P2P acquires sufficient knowledge for both seen and unseen domains, showcasing notable zero-shot performance.

References

[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In CVPR, pages 819–826, 2013.
[2] Zeynep Akata, Scott E. Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, pages 2927–2936, 2015.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
[4] Samet Cetin, Orhun Bugra Baran, and Ramazan Gokberk Cinbis. Closed-form sample probing for learning generative models in zero-shot learning. In ICLR, 2022.
[5] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, 2016.
[6] Shiming Chen, Ziming Hong, Wenjin Hou, Guo-Sen Xie, Yibing Song, Jian Zhao, Xinge You, Shuicheng Yan, and Ling Shao. Transzero++: Cross attribute-guided transformer for zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2022.
[7] Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. Transzero: Attribute-guided transformer for zero-shot learning. In AAAI, pages 330–338, 2022.
[8] Shiming Chen, Ziming Hong, Guosen Xie, Wenhan Wang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. Msdn: Mutually semantic distillation network for zero-shot learning. In CVPR, pages 7612–7621, 2022.
[9] Shiming Chen, Wenjin Hou, Ziming Hong, Xiaohan Ding, Yibing Song, Xinge You, Tongliang Liu, and Kun Zhang. Evolving semantic prototype improves generative zero-shot learning. In ICML, 2023.
[10] Shiming Chen, Wenjin Hou, Salman Khan, and Fahad Shahbaz Khan. Progressive semantic-guided vision transformer for zero-shot learning. In CVPR, 2024.
[11] Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, Xinge You, Feng Zheng, and Ling Shao. Free: Feature refinement for generalized zero-shot learning. In ICCV, pages 122–131, 2021.
[12] Shiming Chen, Guo-Sen Xie, Yang Yang Liu, Qinmu Peng, Baigui Sun, Hao Li, Xinge You, and Ling Shao. Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. In NeurIPS, pages 16622–16634, 2021.
[13] Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In CVPR, pages 15095–15104, 2023.
[14] Zhi Chen, Yadan Luo, Ruihong Qiu, Sen Wang, Zi-Yu Huang, Jingjing Li, and Zheng Zhang. Semantics disentangling for generalized zero-shot learning. In ICCV, pages 8712–8720, 2021.
[15] Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin Fang, Jeff Z Pan, Wenting Song, and Huajun Chen. Duet: Cross-modal semantic grounding for contrastive zero-shot learning. In AAAI, pages 405–413, 2023.
[16] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518, 2023.
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[19] Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. Contrastive embedding for generalized zero-shot learning. In CVPR, pages 2371–2381, 2021.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[21] Yang Hu, Guihua Wen, Adriane Chapman, Pei Yang, Mingnan Luo, Yingxue Xu, Dan Dai, and Wendy Hall. Graph-based visual-semantic entanglement network for zero-shot image recognition. IEEE Transactions on Multimedia, 24:2473–2487, 2021.
[22] D. Huynh and E. Elhamifar. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR, pages 4482–4492, 2020.
[23] Dat Huynh and Ehsan Elhamifar. Compositional zero-shot learning via fine-grained dense feature composition. In NeurIPS, pages 19849–19860, 2020.
[24] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
[25] Rohit Keshari, R. Singh, and Mayank Vatsa. Generalized zero-shot learning via over-complete distribution. In CVPR, pages 13297–13305, 2020.
[26] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
[27] Xia Kong, Zuodong Gao, Xiaofan Li, Ming Hong, Jun Liu, Chengjie Wang, Yuan Xie, and Yanyun Qu. En-compactness: Self-distillation embedding & contrastive generation for generalized zero-shot learning. In CVPR, pages 9306–9315, 2022.
[28] Christoph H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958, 2009.
[29] Y. Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent features for zero-shot recognition. In CVPR, pages 7463–7471, 2018.
[30] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024.
[31] Man Liu, Feng Li, Chunjie Zhang, Yunchao Wei, Huihui Bai, and Yao Zhao. Progressive semantic-visual mutual adaption for generalized zero-shot learning. In CVPR, pages 15337–15346, 2023.
[32] Man Liu, Chunjie Zhang, Huihui Bai, and Yao Zhao. Part-object progressive refinement network for zero-shot learning. IEEE Transactions on Image Processing, 2024.
[33] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
[34] Yang Liu, Jishun Guo, Deng Cai, and X. He. Attribute attention for semantic disambiguation in zero-shot learning. In ICCV, pages 6697–6706, 2019.
[35] Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and T. Harada. Goal-oriented gaze estimation for zero-shot learning. In CVPR, pages 3794–3803, 2021.
[36] Zhiwu Lu, Jiechao Guan, Aoxue Li, Tao Xiang, An Zhao, and Ji-Rong Wen. Zero and few shot learning with semantic feature synthesis and competitive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:2510–2523, 2021.
[37] Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Z. Zha, and Yongdong Zhang. Domain-aware visual bias eliminating for generalized zero-shot learning. In CVPR, pages 12661–12670, 2020.
[38] Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, pages 15169–15179, 2023.
[39] Sanath Narayan, A. Gupta, F. Khan, Cees G. M. Snoek, and L. Shao. Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, pages 479–495, 2020.
[40] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In NeurIPS, 2009.
[41] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, pages 2751–2758, 2012.
[42] Jeffrey Pennington, R. Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
[44] Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In ACM MM, pages 5566–5574, 2022.
[45] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In CVPR, 2011.
[46] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[47] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. Attribute prototype network for any-shot learning. International Journal of Computer Vision, 130:1735–1753, 2022.
[48] M. R. Vyas, Hemanth Venkateswara, and S. Panchanathan. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In ECCV, pages 70–86, 2020.
[49] Chaoqun Wang, Shaobo Min, Xuejin Chen, Xiaoyan Sun, and Houqiang Li. Dual progressive prototype network for generalized zero-shot learning. In NeurIPS, pages 2936–2948, 2021.
[50] Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. In ICML, 2024.
[51] Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts. In ICCV, pages 3032–3042, 2023.
[52] P. Welinder, S. Branson, T. Mita, C. Wah, Florian Schroff, Serge J. Belongie, and P. Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, Caltech,, 2010.
[53] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
[54] Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2251–2265, 2019.
[55] Yongqin Xian, Saurabh Sharma, B. Schiele, and Zeynep Akata. F-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, pages 10267–10276, 2019.
[56] Guo-Sen Xie, L. Liu, Xiaobo Jin, F. Zhu, Zheng Zhang, J. Qin, Yazhou Yao, and L. Shao. Attentive region embedding network for zero-shot learning. In CVPR, pages 9376–9385, 2019.
[57] Guo-Sen Xie, L. Liu, Xiaobo Jin, F. Zhu, Zheng Zhang, Yazhou Yao, J. Qin, and L. Shao. Region graph embedding network for zero-shot learning. In ECCV, pages 562–580, 2020.
[58] Wenjia Xu, Yongqin Xian, Jiuniu Wang, B. Schiele, and Zeynep Akata. Attribute prototype network for zero-shot learning. In NeurIPS, pages 21969–21980, 2020.
[59] Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xiansheng Hua. Counterfactual zero-shot and open-set visual recognition. In CVPR, pages 15404–15414, 2021.
[60] L. Zhang, Tao Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR, pages 3010–3019, 2017.
[61] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023.
[62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
[63] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
[64] Yizhe Zhu, Jianwen Xie, Z. Tang, Xi Peng, and A. Elgammal. Semantic-guided multi-attention localization for zero-shot learning. In NeurIPS, 2019.