Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Man Liu1,2, Huihui Bai1,2🖂, Feng Li3🖂, Chunjie Zhang1,2, Yunchao Wei1,2,
Meng Wang3, Tat-Seng Chua4, Yao Zhao1,2
1
Institute of Information Science, Beijing Jiaotong University
2 Beijing Key Laboratory of Advanced Information Science and Network Technology
3 Hefei University of Technology  4 National University of Singapore
{manliu, hhbai, cjzhang, yunchao.wei, yzhao}@bjtu.edu.cn,
fengli@hfut.edu.cn, eric.mengwang@gmail.com,  chuats@comp.nus.edu.sg
Corresponding author.
Abstract

Zero-shot learning (ZSL) aims to explore the semantic-visual interactions to discover comprehensive knowledge transferred from seen categories to classify unseen categories. Recently, prompt engineering has emerged in ZSL, demonstrating impressive potential as it enables the zero-shot transfer of diverse visual concepts to downstream tasks. However, these methods are still not well generalized to broad unseen domains. A key reason is that the fixed adaption of learnable prompts on seen domains makes it tend to over-emphasize the primary visual features observed during training. In this work, we propose a Prompt-to-Prompt generation methodology (P2P), which addresses this issue by further embracing the instruction-following technique to distill instructive visual prompts for comprehensive transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts and then inversely rectify the visual representations with the guidance of the learned instruction prompts. This enforces the compensation for missing visual details to primary contexts and further eliminates the cross-modal disparity, endowing unseen domain generalization. Through extensive experimental results, we demonstrate the efficacy of P2P in achieving superior performance over state-of-the-art methods.

1 Introduction

Humans naturally interact with the world through various channels such as vision and language, enabling them to identify unseen objects based on prior knowledge. Leveraging this cognitive capability, zero-shot learning (ZSL) [40, 28] endeavors to classify objects from the unseen domain carrying knowledge from seen categories. One of the mainstream methods in ZSL is the embedding-based approach [29, 34, 57, 37, 58, 47, 22, 35, 8, 7, 32], which learns the semantic-visual alignment in a joint embedding space. Prior to such a cross-modal alignment, as depicted in Figure 1(a), embedding-based models typically commence with a visual encoder, such as Vision Transformer (ViT) [18] or ResNet [20] pre-trained on ImageNet [17], to initialize visual features. However, this can introduce distribution discrepancy when applied to downstream ZSL benchmarks as wider categories are unseen during pre-training, thereby resulting in cross-dataset bias that produces vague or even misleading visual representations [45].

To address this limitation, some methods [56, 64] propose to incorporate attention mechanisms into ZSL frameworks to improve visual representations derived from pre-trained encoders. For example, AREN [56] and SGMA [64] discover essential regions within the image under the guidance of the attention. Inspired by the global capability of transformers [46], TransZero [7] and TransZero++ [6] devise an attribute-guided transformer to disentangle the complex geometry relationships for feature augmentation. ZSLViT [10] introduces semantic-guided attention to establish semantic-visual correspondences through a progressive learning strategy. With the rise of powerful zero-shot capability of prompt engineering [26, 16], current works [24, 63, 62, 51] investigate to introduce prompt to efficiently adapt the pre-trained vision encoder to downstream datasets. As illustrated in Figure 1(b), these methods refine visual features by injecting learnable visual prompts into the pre-trained encoder, aiming to alleviate the cross-dataset bias. However, the learnable prompts often exhibit a tendency to overfit to the base seen classes, concentrating solely on primary visual features essential for recognizing seen categories [62]. Consequently, they may lack the capacity to capture crucial visual details necessitated to wider unseen classes. These missing details, complementary to primary content, play a pivotal role in facilitating cross-domain semantic transferring.

Refer to caption
Figure 1: The motivation of P2P. (a) Embedding-based methods suffer from vague visual representations due to cross-dataset bias in pre-trained visual encoders, resulting in suboptimal visual-semantic alignment for ZSL. (b) While prompt-based methods mitigate cross-dataset bias, they overfit to the base seen classes and focus exclusively on primary visual features, omitting crucial details necessary for unseen categories. (c) Our P2P rectifies the visual representation by compensating for missing details through instructive prompts, thus ensuring sufficient visual clues for knowledge transfer cross-domains.

To this end, we introduce a Prompt-to-Prompt (P2P) generation methodology (see Figure 1(c)) to facilitate comprehensive knowledge discovery for both seen and unseen domains by distilling instructive visual prompts. By leveraging the idea of instruction-following, P2P distills instructive prompts under the guidance of semantic-related instruction, which encompasses the complementary details overlooked by the learnable visual prompts. In this way, P2P attends to the missing visual information beyond the primary content. Acknowledging the inherent disparity between visual images and text instruction [44, 13], we devise a grounding transformer (G-Former) wherein prompt-conditioned visual features and textual features are represented as combinations of modal-sharing tokens shared between modalities. This harmonizes the granularity of cross-modal information with shared semantic concepts, leading to enhanced multi-modal understanding and narrowing of the cross-modal gap. Through residual extraction for instructive prompts in the unified granularity space, P2P rectifies the visual representation by discerning guidance-conditioned details, thus enriching semantic features and improving generalization to zsl.

In sum, our work makes the following contributions: 1) We propose a Prompt-to-Prompt generation methodology for zero-shot learning, dubbed P2P, which employs the instruction and prompt learning technology to distill the instructive prompts, improving the acquisition of comprehensive semantic knowledge transfer cross seen and unseen domains. 2) P2P rectifies the visual features with the guidance of the learned instruction prompts in a cross-modal space, generating instructive prompts through residual extraction to compensate for the complementary details omitted by the learnable visual prompts. 3) Extensive experimental results show that our proposed method leads to better performance in ZSL benchmarks.

2 Related Work

Zero-Shot Learning. Predominantly, there are two principal methodologies concerning ZSL, i.e., the generative-based ZSL and the embedding-based ZSL. The former approaches utilize category attribute prototypes to synthesize visual features of additional unseen categories through generative adversarial nets [23, 48, 19], variational auto-encoders[25, 12, 14], or a combination of both [39, 11]. Although these methods compensate for the absence of the unseen domain during training, introducing extra data converts the ZSL problem into a fully supervised task. The embedding-based method represents the other mainstream branch for GZSL, achieved through the projection and alignment of information from visual and semantic domains. Early embedding-based works [1, 2, 53, 60] directly map the global visual and semantic features into a common space for category predictions. Global visual information, however, falls short of capturing the subtle but substantial differences between the categories, thus weakening discriminative representations. Recent efforts have focused on highlighting crucial visual regions through attention techniques. Some works [29, 64] crop and zoom in on significant local areas using coordinate positions obtained by attention mechanisms. Distinctive visual features are also emphasized by graph networks [57, 21] or attention guidance [56, 34, 37, 32]. Subsequent investigations [58, 49, 47, 22, 35, 8, 31, 38, 10] have sought to semantic-guided technique by localizing attribute-related regions via attribute descriptor. Despite these advancements, these methods still exhibit limitations in visual feature extraction within ZSL benchmarks, even with sophisticated manual designs for visual-semantic adaption. Recent studies [63, 62, 51] leverage the prompt emerging and showcase zero-shot capability. They use learnable prompts to efficiently adapt the pre-training vision encoder to downstream ZSL datasets for refined visual representation. In this paper, we focus on leveraging instruction-following technique to discern essential details omitted by the learnable prompts.

Prompt Learning. Prompt learning emerges as a pivotal methodology within the realm of natural language processing (NLP), facilitating the adaption of Large Language Models (LLMs) to various downstream tasks and scenarios [33]. By merely prepending language instruction to the input text as incontext information, LLMs can understand a target task. Following this idea, subsequent works [26, 16] delves into the capabilities of prompt-based LLMs in zero-shot and few-shot scenarios. The great successes of prompt learning in LLMs have recently sparked interest in computer vision [24, 3, 30, 50]. VPT [24] pioneered the integration of prompts with a vision encoder to enable better parameter-efficient adaptation of models via injecting learnable prompts rather than using specific manual prompts. Building upon these advancements, recent studies such as CoOp [63], CoCoOp [62], and SHIP [51] have achieved notable success in general vision-language comprehension, showcasing their ability to undertake visual few-shot scenarios. While these prompt-based models have made significant strides in recognition tasks, their applicability to numerous real-world vision tasks is challenging. Additionally, the introduction of learnable prompts might lead to overfitting, causing the model to focus solely on primary visual contents sufficient for seen classes while omitting crucial visual details necessary for recognizing unseen classes. In this study, we embrace the concept of the prompt learning, employing instruct-specific guidance to distill instructive prompts. This approach allows us to flexibly uncover missing details that complement the primary visual features, thereby facilitating knowledge transfer for zero-shot learning.

3 Prompt-to-Prompt Generation

In this paper, we aim to enhance discriminative performance for both seen and unseen domains through the comprehensive discovery of transferable knowledge facilitated by instructive visual prompts. To achieve this goal, a Prompt-to-Prompt (P2P) generation methodology is proposed, which involves three steps. 1) P2P abstracts the primary visual features by introducing task-specific learnable prompts into the input space to alleviate the cross-dataset bias. 2) By injecting the instruction, P2P distills instructive prompts based on the semantic-related instruction from prompt-conditioned visual features and text instructions through the G-Former. 3) P2P rectifies the visual representations by the instructive prompts, addressing missing visual details that complement the primary content. Through the integration of learnable prompts and instructive prompts, P2P is expected to facilitate sufficient knowledge transfer for ZSL.

3.1 Problem Formulation

ZSL aims to discern the novel image categories within the unseen domain 𝒟usuperscript𝒟𝑢\mathcal{D}^{u}caligraphic_D start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, leveraging the knowledge derived from the seen domain data 𝒟ssuperscript𝒟𝑠\mathcal{D}^{s}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Here, 𝒟s={(x,y,ay)|x𝒳s,y𝒴s,ay𝒜s}superscript𝒟𝑠conditional-set𝑥𝑦subscript𝑎𝑦formulae-sequence𝑥superscript𝒳𝑠formulae-sequence𝑦superscript𝒴𝑠subscript𝑎𝑦superscript𝒜𝑠\mathcal{D}^{s}=\{(x,y,a_{y})|x\in\mathcal{X}^{s},y\in\mathcal{Y}^{s},a_{y}\in% \mathcal{A}^{s}\}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { ( italic_x , italic_y , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } consists of the images x𝑥xitalic_x in 𝒳ssuperscript𝒳𝑠\mathcal{X}^{s}caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, their corresponding label y𝑦yitalic_y, and the associated category prototype aysubscript𝑎𝑦a_{y}italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT from 𝒜usuperscript𝒜𝑢\mathcal{A}^{u}caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. Similarly, the unseen domain data is defined as 𝒟u={(xu,u,au)}superscript𝒟𝑢superscript𝑥𝑢𝑢subscript𝑎𝑢\mathcal{D}^{u}=\{(x^{u},u,a_{u})\}caligraphic_D start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_u , italic_a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) }, where xu𝒳usuperscript𝑥𝑢superscript𝒳𝑢x^{u}\in\mathcal{X}^{u}italic_x start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, u𝒴u𝑢superscript𝒴𝑢u\in\mathcal{Y}^{u}italic_u ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, au𝒜usubscript𝑎𝑢superscript𝒜𝑢a_{u}\in\mathcal{A}^{u}italic_a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, with 𝒜=𝒜s𝒜u𝒜superscript𝒜𝑠superscript𝒜𝑢\mathcal{A}=\mathcal{A}^{s}\cup\mathcal{A}^{u}caligraphic_A = caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∪ caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. Note that the category space is disjoint between the seen and unseen domains, i.e., 𝒴s𝒴u=superscript𝒴𝑠superscript𝒴𝑢\mathcal{Y}^{s}\cap\mathcal{Y}^{u}=\varnothingcaligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∩ caligraphic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = ∅, 𝒴s𝒴u=𝒴superscript𝒴𝑠superscript𝒴𝑢𝒴\mathcal{Y}^{s}\cup\mathcal{Y}^{u}=\mathcal{Y}caligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∪ caligraphic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = caligraphic_Y. Utilizing the seen data 𝒟ssuperscript𝒟𝑠\mathcal{D}^{s}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT during the training stage, a fundamental embedding-based ZSL framework aims to acquire a mapping function that bridges the image space 𝒳ssuperscript𝒳𝑠\mathcal{X}^{s}caligraphic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT with the attribute space 𝒜ssuperscript𝒜𝑠\mathcal{A}^{s}caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. This mapping enables the model to generalize its knowledge to the unseen domain, effectively establishing connections between 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT and 𝒜usuperscript𝒜𝑢\mathcal{A}^{u}caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT for category inference. In scenarios where the testing phase includes both seen and unseen classes, the task of conventional ZSL extends to Generalized Zero-Shot Learning (GZSL), rendering it more applicable to real-world scenarios. Given an input image representation f(x)𝑓𝑥f(x)italic_f ( italic_x ) during training, the optimization of a basic embedding-based framework is achieved through the visual-semantic alignment by the following equation:

Lcls=xXslogexp(f(x)),ayy^𝒴Sexp(f(x)),ay^subscript𝐿𝑐𝑙𝑠subscript𝑥superscript𝑋𝑠𝑓𝑥subscript𝑎𝑦subscript^𝑦superscript𝒴𝑆𝑓𝑥subscript𝑎^𝑦{L_{cls}}=-\sum\limits_{x\in{X^{s}}}{\log\frac{{\exp\left\langle{\mathcal{M}(f% (x)),{a_{y}}}\right\rangle}}{{\sum\limits_{\hat{y}\in{\mathcal{Y}^{S}}}{\exp}% \left\langle{\mathcal{M}(f(x)),{a_{\hat{y}}}}\right\rangle}}}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∑ start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG end_POSTSUBSCRIPT ⟩ end_ARG (1)

where ()\mathcal{M}(\cdot)caligraphic_M ( ⋅ ) is a mapping function, which is generally implemented via Global Average Pooling (GAP) and linear projection. delimited-⟨⟩\left\langle\cdot\right\rangle⟨ ⋅ ⟩ represents the cosine similarity for category decision.

Refer to caption
Figure 2: The framework of our P2P. Following semantic-related instruction, P2P distills the instructive prompts to rectify the visual representations, thus addressing missing visual details that complement the primary content caused by the learnable prompt.

3.2 Visual Prompt Embedding

Utilizing a pre-trained Transformer model, e.g., ViT [18], disparities between ImageNet and ZSL benchmarks may engender a cross-dataset bias. This bias can result in suboptimal visual representations, ultimately leading to undesirable visual-semantic interactions within the ZSL domain. Drawing inspiration from prior prompt learning methodologies [24], we opt to generate visual features under prompts for subsequent ZSL image recognition tasks, rather than straightforwardly deriving the visual features. However, prompt engineering poses a non-trivial challenge, necessitating domain expertise and being extremely time-consuming due to the iterative nature of trial and error. In line with the approach outlined in  [24], we introduce a set of P𝑃Pitalic_P continuous embeddings with a length of T𝑇Titalic_T, referred to as visual prompts, into the input space alongside the image x𝑥xitalic_x. As shown in Figure 2, these visual prompts P𝑃Pitalic_P are seamlessly injected and prepended into the input sequence of the initial Transformer layer of ViT [18]. This process is formally expressed as:

[P¯,E¯]=L([P,E])¯𝑃¯𝐸𝐿𝑃𝐸\left[{\bar{P},\bar{E}}\right]=L\left({\left[{P,E}\right]}\right)[ over¯ start_ARG italic_P end_ARG , over¯ start_ARG italic_E end_ARG ] = italic_L ( [ italic_P , italic_E ] ) (2)

here, E𝐸Eitalic_E denotes the visual representation of patches extracted from the input image x𝑥xitalic_x, which are embedded into the latent space alongside positional encoding, represented as E=Embed(x)𝐸Embed𝑥E=\text{Embed}(x)italic_E = Embed ( italic_x ). The operation []delimited-[]\left[\cdot\right][ ⋅ ] signifies concatenation along the sequence length dimension. L𝐿Litalic_L denotes the cascaded layers comprising ViT. For clarity, the class token has been omitted. Compared to hand-crafted prompts, the learnable prompts P𝑃Pitalic_P introduce only a small number of task-specific learnable parameters. P𝑃Pitalic_P optimizes the original ViT for improved applicability to downstream ZSL datasets, resulting in an enhanced visual representation E¯¯𝐸\bar{E}over¯ start_ARG italic_E end_ARG.

A fundamental approach for ZSL involves projecting the acquired features [P¯,E¯]¯𝑃¯𝐸\left[{\bar{P},\bar{E}}\right][ over¯ start_ARG italic_P end_ARG , over¯ start_ARG italic_E end_ARG ] into a mapping space and aligning them with the category prototype through cosine similarity. These learnable prompts enable ZSL to focus solely on the primary visual contents sufficient for seen classes, often overlooking other crucial visual details essential for unseen classes. To address this limitation, we propose deriving instruction-specific guidance to discern the missing details according to the text instruction. As illustrated in Figure 2, the sharing attributes are applied to serve as instruction, allowing us to reconstruct the missing details within the instruction through subsequent cross-modal embedding and instructive prompt generation.

3.3 Cross-modal Embedding

Given the inherent differences in semantic levels and granularity between visual images and textual instructions [44, 13], the reconstructed features, without explicit alignment of visual and semantic concepts, are likely to represent sub-optimally. Thus, we introduce a grounding transformer (G-Former) into the cross-modal embedding process to generate instruction-specific guidance instead of directly using the instruction. Firstly, we employ GloVe [42] to obtain the text tokens of the input instructions, denoted as S=Glove(text)𝑆Glove𝑡𝑒𝑥𝑡S=\mathrm{Glove}(text)italic_S = roman_Glove ( italic_t italic_e italic_x italic_t ). Subsequently, these text tokens S𝑆Sitalic_S are encoded through the G-Former, which grounds the input tokens S𝑆Sitalic_S to a modal-sharing token R𝑅Ritalic_R. Specifically, the instruction-specific guidance is generated as follows:

S~=softmax(GMP(QSKR𝖳))VR~𝑆softmaxGMPsubscript𝑄𝑆superscriptsubscript𝐾𝑅𝖳subscript𝑉𝑅\tilde{S}={\rm softmax}({\rm GMP}(Q_{S}\cdot K_{{R}}^{\mathsf{T}}))\cdot V_{{R}}over~ start_ARG italic_S end_ARG = roman_softmax ( roman_GMP ( italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) ) ⋅ italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (3)

where QSsubscript𝑄𝑆Q_{S}italic_Q start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, KRsubscript𝐾𝑅K_{R}italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, and VRsubscript𝑉𝑅V_{R}italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT are the query, key, and value learned from S𝑆Sitalic_S and R𝑅Ritalic_R, respectively. GMPGMP\rm GMProman_GMP denotes the max-pooling operation posed on the patch-level relevance between S𝑆Sitalic_S and R𝑅Ritalic_R. This operation aids in eliminating the influence of irrelevant noisy patches that exhibit low relevance to modal-sharing tokens, thereby refining a compact instruct-specific guidance S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG.

Similarly, the G-Former encodes visual tokens into the modal-sharing space by utilizing E¯¯𝐸\bar{E}over¯ start_ARG italic_E end_ARG as the query and producing the corresponding visual tokens through the following process:

E~=softmax(GMP(QE¯KR𝖳))VR~𝐸softmaxGMPsubscript𝑄¯𝐸superscriptsubscript𝐾𝑅𝖳subscript𝑉𝑅\tilde{E}={\rm softmax}({\rm GMP}(Q_{\bar{E}}\cdot K_{{R}}^{\mathsf{T}}))\cdot V% _{{R}}over~ start_ARG italic_E end_ARG = roman_softmax ( roman_GMP ( italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_E end_ARG end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) ) ⋅ italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (4)

As a result, both the image and text tokens are represented as combinations of the common modal-sharing token R𝑅Ritalic_R. This explicit alignment ensures that the granularities of cross-modal information S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG and E~~𝐸\tilde{E}over~ start_ARG italic_E end_ARG are harmonized, thereby aiding the subsequent residual extraction process by narrowing the semantic-visual gap.

3.4 Instructive Prompt Generation

Upon obtaining the instruction-specific guidance S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG, we proceed to rectify the visual representations by capture residual features for the compensation of missing visual details in the cross-modal space. Instead of relying on the commonly used attention mechanism for feature inference and recover, we adopt a simpler approach inspired by LLaVa [30], which utilizes a straightforward linear layer to establish a connection between image features and word embedding space. Here, we concatenate the guidance S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG with the visual features E~~𝐸\tilde{E}over~ start_ARG italic_E end_ARG and feed them into a Zero-linear layer, acting as a filter for details selection with the guidance of learned instruction guidance S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG. This process can be formulated as:

H=ZLinear([S~,E~])𝐻𝑍𝐿𝑖𝑛𝑒𝑎𝑟~𝑆~𝐸H=ZLinear\left({\left[{\tilde{S},\tilde{E}}\right]}\right)italic_H = italic_Z italic_L italic_i italic_n italic_e italic_a italic_r ( [ over~ start_ARG italic_S end_ARG , over~ start_ARG italic_E end_ARG ] ) (5)

where ZLinear𝑍𝐿𝑖𝑛𝑒𝑎𝑟ZLinearitalic_Z italic_L italic_i italic_n italic_e italic_a italic_r represents the zero linear layers with the weight initialized to zeros [61], effectively eliminating harmful noise at the beginning of the training process. To ensure the the residual feature H𝐻Hitalic_H is a meaningful representation with a consistent semantic concept akin to the category prototypes, we align H𝐻Hitalic_H with its category prototypes aysubscript𝑎𝑦a_{y}italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT using a consistency loss conssubscript𝑐𝑜𝑛𝑠\mathcal{L}_{cons}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT:

cons=MLP(ay)Hsubscript𝑐𝑜𝑛𝑠normMLPsubscript𝑎𝑦𝐻\mathcal{L}_{cons}=\|{\rm MLP}(a_{y})-H\|caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = ∥ roman_MLP ( italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - italic_H ∥ (6)

where MLP refers to a multi-layer perception that projects aysubscript𝑎𝑦a_{y}italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT into the cross-modal embedding space for better alignment. Considering that the residual visual extraction provides informative representations related to the instruction, we enhance the visual representation by providing the missing features to achieve complete semantic enrichment and complement, thus forming instructive prompts. Specifically, we compose the instructive prompts by incorporating each original prompt token with the integration of residual visual details H𝐻Hitalic_H:

P~=[p¯1+H,p¯2+H,,+p¯T+H]~𝑃subscript¯𝑝1𝐻subscript¯𝑝2𝐻subscript¯𝑝𝑇𝐻\tilde{P}={\left[{\bar{p}_{1}+H,\bar{p}_{2}+H,...,+\bar{p}_{T}+H}\right]}over~ start_ARG italic_P end_ARG = [ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H , over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_H , … , + over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_H ] (7)

The instructive prompts P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG attend to the missing details for the input image x𝑥xitalic_x under the instruction P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG, which encompasses the complementary details overlooked by the learnable visual prompts. Subsequently, we merge the instructive prompts P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG with the original primary visual tokens through connection to obtain the final rectified visual representation f(x)𝑓𝑥f(x)italic_f ( italic_x ) for x𝑥xitalic_x:

f(x)=[P~,E¯]𝑓𝑥~𝑃¯𝐸f(x)=\left[\tilde{P},\bar{E}\right]italic_f ( italic_x ) = [ over~ start_ARG italic_P end_ARG , over¯ start_ARG italic_E end_ARG ] (8)

Together with the original primary output, P2P is expected to provide an improved comprehensive transferable knowledge discovery.

3.5 Model Optimization and Inference

Optimization. The overall objective loss function of P2P is formulated as follows:

=cls+λconscons+λdebdebsubscript𝑐𝑙𝑠subscript𝜆𝑐𝑜𝑛𝑠subscript𝑐𝑜𝑛𝑠subscript𝜆𝑑𝑒𝑏subscript𝑑𝑒𝑏\mathcal{L}=\mathcal{L}_{cls}+\lambda_{cons}\mathcal{L}_{cons}+\lambda_{deb}% \mathcal{L}_{deb}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT (9)

where λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and λdebsubscript𝜆𝑑𝑒𝑏\lambda_{deb}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT serve as the hyper-parameters controlling the weights of semantic consistency loss conssubscript𝑐𝑜𝑛𝑠\mathcal{L}_{cons}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and the debiasing loss debsubscript𝑑𝑒𝑏\mathcal{L}_{deb}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT, respectively. As in Eq. 9, we also apply a debiasing loss debsubscript𝑑𝑒𝑏\mathcal{L}_{deb}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT to mitigate the seen-unseen bias following [31, 32]. It aims to balance the score dependency in the seen-unseen domain, pursuing the distribution consistency concerning both mean and variance:

deb=αsαu22+βsβu22subscript𝑑𝑒𝑏superscriptsubscriptnormsubscript𝛼𝑠subscript𝛼𝑢22superscriptsubscriptnormsubscript𝛽𝑠subscript𝛽𝑢22{{\cal L}_{deb}}=\|{\alpha_{s}}-{\alpha_{u}}\|_{2}^{2}+\|{\beta_{s}}-{\beta_{u% }}\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT = ∥ italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

where αssubscript𝛼𝑠\alpha_{s}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and βssubscript𝛽𝑠\beta_{s}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represent the mean and variance, respectively, of the seen prediction score (f(x)),ay^(y^𝒴s)𝑓𝑥subscript𝑎^𝑦^𝑦superscript𝒴𝑠\left\langle{\mathcal{M}(f(x)),{a_{\hat{y}(\hat{y}\in\mathcal{Y}^{s})}}}\right\rangle⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ( over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ⟩. Similarly, αusubscript𝛼𝑢\alpha_{u}italic_α start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and βusubscript𝛽𝑢\beta_{u}italic_β start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denote the mean and variance, respectively, of the unseen prediction score (f(x)),ay^(y^𝒴u)𝑓𝑥subscript𝑎^𝑦^𝑦superscript𝒴𝑢\left\langle{\mathcal{M}(f(x)),{a_{\hat{y}(\hat{y}\in\mathcal{Y}^{u})}}}\right\rangle⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ( over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ⟩.

Inference. During training, the model merely learns about the knowledge of seen categories, while unseen categories are inferred at testing time:

y~=argmaxy^𝒴u((f(x)),ay^)~𝑦subscript^𝑦superscript𝒴𝑢𝑓𝑥subscript𝑎^𝑦\tilde{y}=\arg\max_{\hat{y}\in\mathcal{Y}^{u}}\left(\left\langle{\mathcal{M}(f% (x)),{a_{\hat{y}}}}\right\rangle\right)over~ start_ARG italic_y end_ARG = roman_arg roman_max start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG end_POSTSUBSCRIPT ⟩ ) (11)

In the GZSL setting, both seen and unseen categories are encompassed. To jointly define the category, calibrated stacking (CS) [5] is applied:

y~=argmaxy^𝒴((f(x)),ay^γ𝕀[y^𝒴s])~𝑦subscript^𝑦𝒴𝑓𝑥subscript𝑎^𝑦𝛾subscript𝕀delimited-[]^𝑦superscript𝒴𝑠\tilde{y}=\arg\max_{\hat{y}\in\mathcal{Y}}\left(\left\langle{\mathcal{M}(f(x))% ,{a_{\hat{y}}}}\right\rangle-\gamma\mathbb{I}_{\left[\hat{y}\in\mathcal{Y}^{s}% \right]}\right)over~ start_ARG italic_y end_ARG = roman_arg roman_max start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG ∈ caligraphic_Y end_POSTSUBSCRIPT ( ⟨ caligraphic_M ( italic_f ( italic_x ) ) , italic_a start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG end_POSTSUBSCRIPT ⟩ - italic_γ blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ) (12)

𝕀𝒴S()subscript𝕀superscript𝒴𝑆\mathbb{I}_{\mathcal{Y}^{S}}(\cdot)blackboard_I start_POSTSUBSCRIPT caligraphic_Y start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) represents an indicator function, yielding a result of 1 when y^𝒴𝒮^𝑦superscript𝒴𝒮\hat{y}\in\mathcal{Y^{S}}over^ start_ARG italic_y end_ARG ∈ caligraphic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and 0 otherwise. The calibrated factor γ𝛾\gammaitalic_γ is employed to trade-off the calibration degree on seen categories and determine the category y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG of an input visual sample x𝑥xitalic_x.

4 Experiments

4.1 Experimental Settings

Datasets. We assess the performance of our P2P across three standard benchmark datasets: Caltech-USCD Birds-200-2011 (CUB) [52], SUN Attribute (SUN) [41], Animals with Attributes2 (AwA2) [54]. The categorization into seen and unseen categories follows the Proposed Split (PS) [54]. The CUB dataset consists of 11,788 images illustrating 200 bird classes, with a split of 150/50 for seen/unseen classes and characterized by 312 attributes. SUN, a vast scene dataset, contains 14,340 images spanning 717 classes, divided into seen/unseen classes at 645/72, and annotated with 102 attributes. AwA2, although coarser with only 50 animal classes (seen/unseen classes = 40/10), impressively boasts a total of 37,322 images and is described by 85 attributes.

Evaluation Metrics. We access top-1 accuracy in both the ZSL and GZSL settings. In the ZSL scenario, we exclusively evaluate the accuracy on unseen classes, denoted as acc𝑎𝑐𝑐accitalic_a italic_c italic_c. In GZSL setting, following the approach outlined in [54], we employ the harmonic mean(as H=2×S×U/(S+U)𝐻2𝑆𝑈𝑆𝑈H=2\times S\times U/(S+U)italic_H = 2 × italic_S × italic_U / ( italic_S + italic_U )) to evaluate the performance of our framework, where S𝑆Sitalic_S and U𝑈Uitalic_U represent the top-1 accuracy of the seen and unseen classes, respectively.

Implementation Details. Unlike previous approaches in ZSL that employ ResNet [20] models as visual backbones, we opt for the ViT-Base model [18] as our visual feature extractor. We maintain an input image resolution of 224×224224224224\times 224224 × 224, with a patch size of 16×16161616\times 1616 × 16. Our framework is implemented using PyTorch and executed on an Nvidia GeForce RTX 3090 GPU.

Table 1: Results  (%percent\%%) of the state-of-the-art ZSL and GZSL modes on CUB, SUN and AwA2, including generative and embedding-based methods). The best and second-best results are marked in red and blue, respectively. The symbol “*” denotes ViT-based methods. Results indicated with “**∗ ∗” are taken from [10].
Methods Venue CUB SUN AwA2
ZSL GZSL ZSL GZSL ZSL GZSL
acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H
Generative-based Methods
f-VAEGAN [55] CVPR’19 61.0 48.4 60.1 53.6 64.7 45.1 38.0 41.3 71.1 57.6 70.6 63.5
OCD-CVAE [25] CVPR’20 44.8 59.9 51.3 44.8 42.9 43.8 59.5 73.4 65.7
Composer [23] NeurIPS’20 69.4 56.4 63.8 59.9 62.6 55.1 22.0 31.4 71.5 62.1 77.3 68.8
TF-VAEGAN [39] ECCV’20 64.9 52.8 64.7 58.1 66.0 45.6 40.7 43.0 72.2 59.8 75.1 66.6
GCM-CF [59] CVPR’21 61.0 59.7 60.3 47.9 37.8 42.2 60.4 75.1 67.0
SDGZSL [14] ICCV’21 75.5 59.9 66.4 63.0 72.1 64.6 73.6 68.8
CE-GZSL [19] CVPR’21 77.5 63.9 66.8 65.3 63.3 48.8 38.6 43.1 70.4 63.1 78.6 70.0
ICCE [27] CVPR’22 78.4 67.3 65.5 66.4 72.7 65.3 82.3 72.8
FREE [11] ICCV’21 55.7 59.9 57.7 47.4 37.2 41.7 60.4 75.4 67.1
HSVA [12] NeurIPS’21 62.8 52.7 58.3 55.3 63.8 48.6 39.0 43.3 59.3 76.6 66.8
LBP [36] TPAMI’21 61.9 42.7 71.6 53.5 63.2 39.2 36.9 38.1
FREE+ESZSL [4] ICLR’22 51.6 60.4 55.7 48.2 36.5 41.5 51.3 78.0 61.8
f-VAEGAN+DSP [9] ICML’23 62.8 62.5 73.1 67.4 68.6 57.7 41.3 48.1 71.6 63.7 88.8 74.2
SHIP** [51] ICCV’23 55.3 58.9 57.1
Embedding-based Methods
SGMA [64] NeurIPS’19 71.0 36.7 71.3 48.5 68.8 37.6 87.1 52.5
AREN [56] CVPR’19 71.8 38.9 78.7 52.1 60.6 19.0 38.8 25.5 67.9 15.6 92.9 26.7
LFGAA [34] ICCV’19 67.6 36.2 80.9 50.0 61.5 18.5 40.0 25.3 68.1 27.0 93.4 41.9
APN [58] NeurIPS’20 72.0 65.3 69.3 67.2 61.6 41.9 34.0 37.6 68.4 57.1 72.4 63.9
DAZLE [22] CVPR’20 66.0 56.7 59.6 58.1 59.4 52.3 24.3 33.2 67.9 60.3 75.7 67.1
DVBE [37] CVPR’20 53.2 60.2 56.5 45.0 37.2 40.7 63.6 70.8 67.0
GEM-ZSL [35] CVPR’21 77.8 64.8 77.1 70.4 62.8 38.1 35.7 36.9 67.3 64.8 77.5 70.6
DPPN [49] NeurIPS’21 77.8 70.2 77.1 73.5 61.5 47.9 35.8 41.0 73.3 63.1 86.8 73.1
CLIP [43] ICML’21 55.2 54.8 55.0
CoOP** [63] IJCV’22 49.2 63.8 55.6
MSDN [8] CVPR’22 76.1 68.7 67.5 68.1 65.8 52.2 34.2 41.3 70.1 62.0 74.5 67.7
TransZero [7] AAAI’22 76.8 69.3 68.3 68.8 65.6 52.6 33.4 40.8 70.1 61.3 82.3 70.2
TransZero++ [6] TPAMI’22 78.3 67.5 73.6 70.4 67.6 48.6 37.8 42.5 72.6 64.6 82.7 72.5
DUET* [15] AAAI’23 72.3 62.9 72.8 67.5 64.4 45.7 45.8 45.8 69.9 63.7 84.7 72.7
I2MVFormer* [38] CVPR’23 42.1 32.4 63.1 42.8 73.6 66.6 82.9 73.8
ZSLViT* [10] CVPR’24 78.9 69.4 78.2 73.6 68.3 45.9 48.3 47.3 70.2 66.1 84.6 74.2
P2P* (Ours) 80.3 73.1 76.4 74.7 70.4 58.6 45.2 51.0 75.2 70.3 80.1 74.9

4.2 Comparison with State-of-the-Art Methods

We evaluate the proposed approach and recently proposed state-of-the-art methods, of which the results are given in Table 1.

Results of Conventional Zero-Shot Learning. For conventional ZSL, our method surpasses the best one by 1.4%, 1.8%, and 1.9% for acc𝑎𝑐𝑐accitalic_a italic_c italic_c on CUB, SUN and AwA2 datasets, respectively. The results demonstrate that learning instructive prompts based on pre-trained visual encoders can effectively enhance visual representations and improve knowledge transfer to the unseen domain, achieving state-of-the-art acc𝑎𝑐𝑐accitalic_a italic_c italic_c performance of 80.3%, 70.4%, and 75.2% on CUB, SUN, and AwA2, respectively. Compared to some recent methods[35, 56, 7, 6, 10], which refine visual features obtained from pre-trained visual encoders using attention mechanisms, P2P achieves substantial improvements in acc𝑎𝑐𝑐accitalic_a italic_c italic_c, with gains exceeding 1.4%, 2.1%, and 2.6% on CUB, SUN, and AwA2, respectively. This reveals that with the assistance of prompt-learning and instruction-following, P2P can improve category discriminative power for the unseen domain.

Results of Generalized Zero-Shot Learning. Table 1 also reports the results of various methods in the GZSL setting. Results show that the P2P can achieve stats-of-the-art H𝐻Hitalic_H performance across all datasets, e.g., 74.7%, 51.0%, and 74.9% on CUB, SUN, and AwA2, respectively. Notably, our P2P also significantly outperforms other prompt-based methods (CoOP [63], and SHIP [51]) with substantial margins of 19.7%, 18.9%, and 17.4% on CUB dataset. These results demonstrate that exploring missing details complementary to prompt-conditioned primary content can effectively enhance visual representations and cross-domain transferability.

4.3 Ablation Study

Table 2: Ablation study of P2P under the ZSL and GZSL setting on CUB, SUN and AwA2 datasets, respectively.
Methods CUB SUN AwA2
ZSL GZSL ZSL GZSL ZSL GZSL
acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H acc𝑎𝑐𝑐accitalic_a italic_c italic_c U𝑈Uitalic_U S𝑆Sitalic_S H𝐻Hitalic_H
P2P w/o prompts P𝑃Pitalic_P 70.2 67.0 69.1 68.0 64.5 53.8 25.8 34.8 70.7 61.0 88.2 72.1
P2P w/o instruction S𝑆Sitalic_S 75.8 72.3 72.7 72.5 68.0 58.1 38.5 46.2 73.2 65.5 85.1 74.0
P2P w/o G-Former 78.8 75.6 72.1 73.8 68.8 43.6 54.9 48.6 73.7 66.8 81.8 73.5
P2P(full) 80.3 73.1 76.4 74.7 70.4 58.6 45.2 51.0 75.2 70.3 80.1 74.9

Effect of components in P2P. P2P aims to learn comprehensive representations with the integration of primary contents learned by learnable prompts and missing details discovered under the guidance of the instruction. Thus we evaluate key components, i.e., the learnable prompts P𝑃Pitalic_P and instruction S𝑆Sitalic_S, as shown in Table 2. When prompts are removed, the model can be regarded as the baseline in which we directly use original visual features extracted from vanilla ViT and project them into semantic space for category inference. P2P overwhelms the baseline with large acc𝑎𝑐𝑐accitalic_a italic_c italic_c/H𝐻Hitalic_H margins of 20.1%/6.7%, 5.9%/6.2%, and 4.5%/2.8% on CUB, SUN, and AwA2 datasets, respectively, P2P performs poorer significantly than its full model when no instruction is employed, i.e., the acc𝑎𝑐𝑐accitalic_a italic_c italic_c/H𝐻Hitalic_H drop by 4.5%/2.2% on CUB, 2.4%/4.8% on SUN, and 2.0%/0.9% on AWA2. Additionally, G-Former enhances the residual extraction by providing a unified visual-semantic space, resulting in acc𝑎𝑐𝑐accitalic_a italic_c italic_c/H𝐻Hitalic_H improvements of 1.5%/0.9%, 1.6%/2.4%, 1.5%/1.4% on CUB, SUN, and AwA2 datasets, respectively.

Refer to caption
Figure 3: Effect of loss hyper-parameters on (a) CUB (b) SUN and (c) AwA2 datasets.

Effect of λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and λdebsubscript𝜆𝑑𝑒𝑏\lambda_{deb}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT. λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and λdebsubscript𝜆𝑑𝑒𝑏\lambda_{deb}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT are the hyper-parameters that balance conssubscript𝑐𝑜𝑛𝑠\mathcal{L}_{cons}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and debsubscript𝑑𝑒𝑏\mathcal{L}_{deb}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT, respectively. Here, we evaluate the effect of λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT and λdebsubscript𝜆𝑑𝑒𝑏\lambda_{deb}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT as shown in Figure 3. As λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT rises from 0.0 to 2.0, i.e., semantic consistency conssubscript𝑐𝑜𝑛𝑠\mathcal{L}_{cons}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT is introduced into P2P, H𝐻Hitalic_H increases across all datasets. The best H𝐻Hitalic_H is obtained when λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = 1.0. This proves the effectiveness of semantic consistency in residual visual representation, aligning closely with the category prototype for recognition. When λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT > 1.0, H starts to drop. Thus, we set λconssubscript𝜆𝑐𝑜𝑛𝑠\lambda_{cons}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = 1.0 for optimal results. Additionally, when we gradually increase the value of λdebsubscript𝜆𝑑𝑒𝑏\lambda_{deb}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_b end_POSTSUBSCRIPT, more attention is paid to pursuing the consistent distribution between seen and unseen predictions, resulting in improved unseen accuracy U𝑈Uitalic_U and ultimately achieving better H𝐻Hitalic_H performance.

Refer to caption
Figure 4: Effect of the length (T𝑇Titalic_T) of prompts P𝑃Pitalic_P on (a) CUB (b) SUN and (c) AwA2 datasets.

Effect of T𝑇Titalic_T. T𝑇Titalic_T is the length of the learnable visual prompts P𝑃Pitalic_P. Here, we sweep prompt length T{1,3,5,7,9}𝑇13579T\in\{1,3,5,7,9\}italic_T ∈ { 1 , 3 , 5 , 7 , 9 } to evaluate the effect of the prompt P𝑃Pitalic_P on recognition performance. Figure 4 shows the values of acc𝑎𝑐𝑐accitalic_a italic_c italic_c and H𝐻Hitalic_H as T𝑇Titalic_T varies. We observe that the best performance is achieved when T𝑇Titalic_T is around 5. Notably, even with as few as only one prompt, P2P still significantly outperforms the baseline (1-th𝑡thitalic_t italic_h row in Table 2). Thus, we set T𝑇Titalic_T = 5 for CUB, SUN and AwA2.

Refer to caption
Figure 5: t-SNE visualizations of visual features for seen classes and unseen classes. The 10 colors denote 10 different seen/unseen classes randomly selected from CUB and AwA2.

Qualitative Results. As illustrated in Figure 5, we present the t-SNE visualization of visual features for seen and unseen classes on CUB and AwA2, learned by the P2P w/o prompt P𝑃Pitalic_P, P2P without instruction S𝑆Sitalic_S, and our full P2P. The results reveal that visual features extracted from the P2P w/o P𝑃Pitalic_P lack distinctiveness within certain classes, while those acquired by P2P with the assistance of learnable prompts demonstrate superior quality with a more compact and discriminative distribution. This observation intuitively suggests that prompts play a crucial role in enabling the application of pre-trained ViT to downstream ZSL tasks by generating high-quality features for seen classes. Furthermore, compared to P2P w/o S𝑆Sitalic_S, the visual features learned by our full P2P showcase desirable distinguishability with higher inter-class discrepancy and clearer decision boundaries. This improvement can be attributed to the fact that our P2P identifies missing detail tokens according to the instruction, thereby complementing the primary content and resulting in a more comprehensive visual representation. As a result, P2P achieves significant performance improvements in both seen and unseen classes.

5 Conclusion

In this paper, we propose P2P, prompt-to-Prompt generation methodology designed to achieve comprehensive semantic knowledge transfer from the seen to unseen domain. Beyond focusing on primary visual features for base seen classes, P2P distills instructive prompts through the discovery of missing residual details following the task-specific instruction. Additionally, we introduce a grounding transformer (G-Former) to unify the visual-semantic space by embedding image-instruction tokens through cross-modal sharing semantics. G-Former significantly improves cross-modal understanding, equipping P2P with the enhanced capability to discern and recover guidance-conditioned visual details. By integrating these details with primary visual content, P2P acquires sufficient knowledge for both seen and unseen domains, showcasing notable zero-shot performance.

References

  • [1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In CVPR, pages 819–826, 2013.
  • [2] Zeynep Akata, Scott E. Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, pages 2927–2936, 2015.
  • [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  • [4] Samet Cetin, Orhun Bugra Baran, and Ramazan Gokberk Cinbis. Closed-form sample probing for learning generative models in zero-shot learning. In ICLR, 2022.
  • [5] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, 2016.
  • [6] Shiming Chen, Ziming Hong, Wenjin Hou, Guo-Sen Xie, Yibing Song, Jian Zhao, Xinge You, Shuicheng Yan, and Ling Shao. Transzero++: Cross attribute-guided transformer for zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2022.
  • [7] Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. Transzero: Attribute-guided transformer for zero-shot learning. In AAAI, pages 330–338, 2022.
  • [8] Shiming Chen, Ziming Hong, Guosen Xie, Wenhan Wang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. Msdn: Mutually semantic distillation network for zero-shot learning. In CVPR, pages 7612–7621, 2022.
  • [9] Shiming Chen, Wenjin Hou, Ziming Hong, Xiaohan Ding, Yibing Song, Xinge You, Tongliang Liu, and Kun Zhang. Evolving semantic prototype improves generative zero-shot learning. In ICML, 2023.
  • [10] Shiming Chen, Wenjin Hou, Salman Khan, and Fahad Shahbaz Khan. Progressive semantic-guided vision transformer for zero-shot learning. In CVPR, 2024.
  • [11] Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, Xinge You, Feng Zheng, and Ling Shao. Free: Feature refinement for generalized zero-shot learning. In ICCV, pages 122–131, 2021.
  • [12] Shiming Chen, Guo-Sen Xie, Yang Yang Liu, Qinmu Peng, Baigui Sun, Hao Li, Xinge You, and Ling Shao. Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. In NeurIPS, pages 16622–16634, 2021.
  • [13] Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In CVPR, pages 15095–15104, 2023.
  • [14] Zhi Chen, Yadan Luo, Ruihong Qiu, Sen Wang, Zi-Yu Huang, Jingjing Li, and Zheng Zhang. Semantics disentangling for generalized zero-shot learning. In ICCV, pages 8712–8720, 2021.
  • [15] Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin Fang, Jeff Z Pan, Wenting Song, and Huajun Chen. Duet: Cross-modal semantic grounding for contrastive zero-shot learning. In AAAI, pages 405–413, 2023.
  • [16] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518, 2023.
  • [17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [19] Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. Contrastive embedding for generalized zero-shot learning. In CVPR, pages 2371–2381, 2021.
  • [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [21] Yang Hu, Guihua Wen, Adriane Chapman, Pei Yang, Mingnan Luo, Yingxue Xu, Dan Dai, and Wendy Hall. Graph-based visual-semantic entanglement network for zero-shot image recognition. IEEE Transactions on Multimedia, 24:2473–2487, 2021.
  • [22] D. Huynh and E. Elhamifar. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR, pages 4482–4492, 2020.
  • [23] Dat Huynh and Ehsan Elhamifar. Compositional zero-shot learning via fine-grained dense feature composition. In NeurIPS, pages 19849–19860, 2020.
  • [24] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
  • [25] Rohit Keshari, R. Singh, and Mayank Vatsa. Generalized zero-shot learning via over-complete distribution. In CVPR, pages 13297–13305, 2020.
  • [26] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
  • [27] Xia Kong, Zuodong Gao, Xiaofan Li, Ming Hong, Jun Liu, Chengjie Wang, Yuan Xie, and Yanyun Qu. En-compactness: Self-distillation embedding & contrastive generation for generalized zero-shot learning. In CVPR, pages 9306–9315, 2022.
  • [28] Christoph H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958, 2009.
  • [29] Y. Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent features for zero-shot recognition. In CVPR, pages 7463–7471, 2018.
  • [30] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024.
  • [31] Man Liu, Feng Li, Chunjie Zhang, Yunchao Wei, Huihui Bai, and Yao Zhao. Progressive semantic-visual mutual adaption for generalized zero-shot learning. In CVPR, pages 15337–15346, 2023.
  • [32] Man Liu, Chunjie Zhang, Huihui Bai, and Yao Zhao. Part-object progressive refinement network for zero-shot learning. IEEE Transactions on Image Processing, 2024.
  • [33] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • [34] Yang Liu, Jishun Guo, Deng Cai, and X. He. Attribute attention for semantic disambiguation in zero-shot learning. In ICCV, pages 6697–6706, 2019.
  • [35] Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and T. Harada. Goal-oriented gaze estimation for zero-shot learning. In CVPR, pages 3794–3803, 2021.
  • [36] Zhiwu Lu, Jiechao Guan, Aoxue Li, Tao Xiang, An Zhao, and Ji-Rong Wen. Zero and few shot learning with semantic feature synthesis and competitive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:2510–2523, 2021.
  • [37] Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Z. Zha, and Yongdong Zhang. Domain-aware visual bias eliminating for generalized zero-shot learning. In CVPR, pages 12661–12670, 2020.
  • [38] Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, and Federico Tombari. I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, pages 15169–15179, 2023.
  • [39] Sanath Narayan, A. Gupta, F. Khan, Cees G. M. Snoek, and L. Shao. Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, pages 479–495, 2020.
  • [40] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In NeurIPS, 2009.
  • [41] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, pages 2751–2758, 2012.
  • [42] Jeffrey Pennington, R. Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
  • [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  • [44] Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In ACM MM, pages 5566–5574, 2022.
  • [45] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In CVPR, 2011.
  • [46] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • [47] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. Attribute prototype network for any-shot learning. International Journal of Computer Vision, 130:1735–1753, 2022.
  • [48] M. R. Vyas, Hemanth Venkateswara, and S. Panchanathan. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In ECCV, pages 70–86, 2020.
  • [49] Chaoqun Wang, Shaobo Min, Xuejin Chen, Xiaoyan Sun, and Houqiang Li. Dual progressive prototype network for generalized zero-shot learning. In NeurIPS, pages 2936–2948, 2021.
  • [50] Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. In ICML, 2024.
  • [51] Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts. In ICCV, pages 3032–3042, 2023.
  • [52] P. Welinder, S. Branson, T. Mita, C. Wah, Florian Schroff, Serge J. Belongie, and P. Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, Caltech,, 2010.
  • [53] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
  • [54] Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2251–2265, 2019.
  • [55] Yongqin Xian, Saurabh Sharma, B. Schiele, and Zeynep Akata. F-vaegan-d2: A feature generating framework for any-shot learning. In CVPR, pages 10267–10276, 2019.
  • [56] Guo-Sen Xie, L. Liu, Xiaobo Jin, F. Zhu, Zheng Zhang, J. Qin, Yazhou Yao, and L. Shao. Attentive region embedding network for zero-shot learning. In CVPR, pages 9376–9385, 2019.
  • [57] Guo-Sen Xie, L. Liu, Xiaobo Jin, F. Zhu, Zheng Zhang, Yazhou Yao, J. Qin, and L. Shao. Region graph embedding network for zero-shot learning. In ECCV, pages 562–580, 2020.
  • [58] Wenjia Xu, Yongqin Xian, Jiuniu Wang, B. Schiele, and Zeynep Akata. Attribute prototype network for zero-shot learning. In NeurIPS, pages 21969–21980, 2020.
  • [59] Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xiansheng Hua. Counterfactual zero-shot and open-set visual recognition. In CVPR, pages 15404–15414, 2021.
  • [60] L. Zhang, Tao Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR, pages 3010–3019, 2017.
  • [61] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023.
  • [62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
  • [63] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  • [64] Yizhe Zhu, Jianwen Xie, Z. Tang, Xi Peng, and A. Elgammal. Semantic-guided multi-attention localization for zero-shot learning. In NeurIPS, 2019.