Few-shot Personalization of LLMs
with Mis-aligned Responses

Jaehyung Kim Yiming Yang
Carnegie Mellon University
jaehyun4@andrew.cmu.edu

Abstract

As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.¹¹1The code will be available at https://github.com/bbuing9/Fermi.

1 Introduction

The recent development of large language models (LLMs) has significantly accelerated progress in various NLP tasks, and yielded real-world applications used by millions of users, such as coding assistants and chatbots [14, 30, 31]. As the use of LLMs by diverse users in real-world applications increases, personalization of LLMs, i.e., steering LLMs’ responses towards the unique needs or preferences of individual users becomes progressively important [6, 26]. However, recent studies show that LLMs’ responses are often biased toward certain groups but not suited for other diverse groups of users, and such biases cannot be fixed by providing simple instructions [24].

To tackle this problem, methods to steer the responses of LLMs have been recently explored and they can be roughly divided into two categories. One category is prompt engineering, which heuristically incorporates the user’s information into the input prompts of LLMs [7, 23]. The other category focuses on learning from other users’ data [12, 29, 36]. But, both categories have limitations: prompt engineering for every user would be too costly and non-trivial, while the learning-based category relies on unrealistic assumption that personal data can be shared without violating privacy considerations.

Refer to caption — Figure 1: An overview of Fermi. Fermi iterates three steps to optimize the prompt from the given user information: (1) scoring new prompts, (2) updating the memory with high-scored prompts, and (3) generating new improved prompts (left). After the optimization, Fermi selectively uses the personalized prompts for the inference, via Retrieval-of-Prompt (right).

This paper addresses those limitations by introducing a new approach, namely Few-shot Personalization of LLMs with mis-aligned responses (Fermi). Our high-level idea is to use LLM to progressively improve its input prompts based on a few examples of previous user opinions and profiles (e.g., demographics) in an iterative process. In addition to the current prompts’ scores measured on given few-shot user opinions [35], Fermi incorporates the mis-aligned responses (i.e., LLM’s responses with those prompts, which are inconsistent with given user opinions) as additional context. The contexts of mis-aligned responses include useful learning signals to update prompts such as the types of wrong predictions with the current prompts (see the empirical evidence in Section 4). Specifically, the iterative process of Fermi consists of three steps: (1) scoring the initial or current prompts with LLM, (2) updating the memory with high-scored prompts in the form of $<$ prompt, score, context $>$ triplets, and (3) generating new improved prompts with LLM based on the updated memory. In addition, we propose Retrieval-or-Prompt, a method to improve the inference on a given test query. Retrieval-or-Prompt selectively uses one of the personalized prompts obtained from the optimization, based on the context of the test query. An overview of Fermi is presented in Figure 1.

We demonstrate the effectiveness of Fermi for few-shot personalization of LLMs, through extensive evaluations on various tasks including question-answering (QA), classification, and regression. For example, we observe that Fermi exhibited 6.8% and 4.1% average accuracy improvements on two multiple-choice QA datasets, constructed to evaluate the personalization of LLMs, compared to the previous state-of-the-art heuristic and optimization approaches, respectively. We also found that the personalized prompts produced with one LLM are also effective on other LLMs, including both API-based and open-sourced ones, which is crucial for efficient deployment in practice. In addition, our in-depth analyses reveal why Fermi is more effective than other prompting methods and what are the important features of prompts for effective personalization of LLMs. We hope our work provides useful insights for the research on LLM personalization, which becomes increasingly emerging and important for the future success of LLMs in real-world applications.

2 Related Works

Few-shot personalization of LLMs.

Few-shot personalization of LLM is to align LLM’s responses to a specific user with a limited number of user information such as user profile (e.g., demographic information) or opinions (e.g., previous responses to questions by user). To this end, one line of prior works has explored how to input given user information into LLM in a heuristical manner, i.e., prompt engineering; for example, Santurkar et al. [24] designs three different templates of input prompt. Salemi et al. [23] leverages the retrieval system [8] to use the given user opinions selectively. Hwang et al. [7] shows that using both user profile and opinions is more effective. On the other hand, another line of prior works has proposed learning from other user’s data; Li et al. [12] selects the relevant users using collaborative filtering, then learns the soft-prompt [13] from the augmented training data from these users’ data. Zhao et al. [36] proposes to train an independent transformer module via meta-learning on several users’ data. However, both approaches have their limitations; prompt engineering incurs the cost of designing the prompt, and could be limited to fully utilizing the user information due to the absence of learning. The learning-based one necessitates other users’ data which is hard to obtain in real-world, due to privacy issues. Therefore, we propose to only learn from target user’s information and find the optimized (i.e., personalized) prompt for that user.

Prompt optimization with LLM.

As the prior works for prompt-tuning, relying on the gradient-based update [3, 11, 25], become inapplicable to the recent API-based LLM due to their black-box nature, other approaches have been recently explored for gradient-free prompt optimization, such as a progressive improvement using heuristic rules or LLMs [18, 35, 38]. For example, Pryzant et al. [19] receives text feedback on how to update the prompts by instructing LLM. Also, after generating initial prompts with LLMs, Zhou et al. [38] generates a semantically similar variant of the prompts with the highest accuracies. Yang et al. [35] iterates evaluation and generation of prompts with two LLMs, to solve the black-box optimization such as prompt optimization; Yang et al. [35] incorporates the past generated prompts with their scores to enable the LLM for the optimization to construct new improved prompts. However, only providing the scores on training examples is insufficient to optimize the prompt for few-shot personalization of LLMs, as the context with mis-aligned responses such as the types or patterns within recursively wrong predictions can’t be captured in scores. Therefore, we propose an efficient way to incorporate such context during the optimization, along with an additional method to improve the inference by considering the context of the given test query.

3 Fermi: Few-shot Personalization of LLMs with Mis-aligned Responses

In this section, we present our framework proposed for Few-shot Personalization of LLMs from mis-aligned responses (Fermi). We first present our problem setup in Section 3.1. Then, in Section 3.2, we present our core component that optimizes the input prompt with a given user information, by using LLM as a black-box optimizer along with the additional contexts from mis-aligned responses. Lastly, we introduce an efficient inference scheme after optimizing prompts with Fermi, by utilizing the context of a test query (Section 3.3).

3.1 Problem description

We first describe the problem setup of our interest under a question-answering (QA) scenario. Our goal is to steer LLM for a specific user using that user’s information, and hence make LLM adaptively answer a given question depending on the user. Formally, let ${q}$ denote the given test question and $\mathcal{M}$ denote the LLM, respectively. Next, for user $u$ , we assume two types of user information: $U_{\tt pro}$ and $U_{\tt opi}$ . $U_{\tt pro}$ indicates explicit profile of $u$ such as demographics information (e.g., region, sex, and age) or ideology (e.g., political affiliation). $U_{\tt opi}$ indicates $N$ few-shot previous opinions by $u$ , which has the form of QA pairs, i.e., $U_{\tt opi}=\{({q}_{i},{a}_{i})\}_{i=1}^{N}$ where ${q}_{i}$ is a previously asked question and ${a}_{i}$ is an opinion (answer) by the user. Then, for given test question $q$ , our goal is to predict the answer $a$ , which would be generated by user $u$ , through LLM $\mathcal{M}$ by using both $U_{\tt pro}$ and $U_{\tt opi}$ . The heuristic design of input prompt p to incorporate such user information has been previously explored [7, 24], i.e., prediction $\widehat{a}$ is obtained by conditioning $\mathcal{M}$ with p, which is constructed using $U_{\tt pro}$ and $U_{\tt opi}$ :

\widehat{a}(\text{p})=\mathcal{M}(q;\text{p}).

(1)

However, heuristically designed prompts could be limited to fully exploit the given user information. For example, compared to using all opinions in $U_{\tt opi}$ , appending fewer user opinions can yield better personalization accuracy for LLM [7]. Therefore, we tackle this limitation by finding personalized prompts that steer LLM to the user, through direct learning from given user information.

3.2 Prompt optimization using mis-aligned responses by LLM

To mitigate the difficulties from the large scale and black-box nature of recent LLMs, we instead optimize input prompts to learn from user information. It is motivated by the recent work [35] that uses two LLMs, $\mathcal{M}$ and $\mathcal{M}_{\tt opt}$ , to solve black-box optimization, where $\mathcal{M}_{\tt opt}$ denotes another LLM used for the optimization. Specifically, our key idea is incorporating the contexts of mis-aligned responses (i.e., QAs in $U_{\tt opi}$ that $\mathcal{M}$ incorrectly predict with current prompts) during the optimization, instead of only using scores of the prompts (e.g., average accuracy of the prediction by $\mathcal{M}$ on $U_{\tt opi}$ ). As the contexts of mis-aligned responses include useful learning signals such as types or patterns of common wrong predictions, they could be effective in learning how to improve the prompts.

We first assume that there is an initial prompt set $\text{P}^{0}=\{\text{p}^{0}\}$ , e.g., heuristically designed prompt [7, 24]. Then, at each iteration $t$ , we conduct the following three steps:

$\circ$

1. Score Prompts: Evaluate prompt based on its accuracy in predicting user’s previous answers.
$\circ$

2. Update Memory: Maintain a memory of the best-performing prompts along with their scores and the contexts of their mis-aligned responses.
$\circ$

3. Generate New Prompts: Generate new improved prompts with $\mathcal{M}_{\tt opt}$ and the updated memory.

$\circ$ Step 1: Score Prompts. We first calculate the score ${s}_{k}$ of each prompt $\text{p}_{k}\in\text{P}^{t}$ , by obtaining the predictions from $\mathcal{M}$ under $\text{p}_{k}$ and evaluating them using the user’s previous answers:

{s}_{k}=\sum\nolimits_{({q}_{i},{a}_{i})\sim U_{\tt opi}}\text{s}\big{(}{a}_{i% },\widehat{a}_{i}(\text{p}_{k})\big{)}/N,~{}\text{ where }~{}\widehat{a}_{i}(% \text{p}_{k})=\mathcal{M}({q}_{i};\text{p}_{k}).

(2)

Here, $\text{s}(\cdot,\cdot)$ is a specific metric to evaluate the prediction (e.g., accuracy). During this calculation of the score $s_{k}$ of the prompt $\text{p}_{k}$ , we also collect mis-aligned QA pairs $U^{k}_{\tt opi}$ that the prediction of $\mathcal{M}$ under $\text{p}_{k}$ is not aligned with the user’s answer:

U^{k}_{\tt opi}=\{({q}_{i},{a}_{i})|\text{s}\big{(}{a}_{i},\widehat{a}_{i}(% \text{p}_{k})\big{)}<\tau,~{}({q}_{i},{a}_{i})\in U_{\tt opi}\},

(3)

where $\tau$ is a threshold to judge the mis-alignment; for example, we set $\tau=0.5$ when we use the correctness of prediction as the score $\text{s}(\cdot,\cdot)$ .

$\circ$ Step 2: Update Memory. Next, we construct an optimization memory $M^{t}$ , which is used for the input of $\mathcal{M}_{\tt opt}$ to generate new improved prompts, by providing the information of well-performing prompts through the contexts of their mis-aligned responses. To be specific, the optimization memory $M^{t}=\{(\text{p}_{l},{s}_{l},c_{l})\}_{l=1}^{L}$ is constructed by selecting top- $L$ prompts among $\text{P}^{t}$ and $M^{t-1}$ (where $M^{0}=\emptyset$ ), according to their scores (Eq. 2). Here, we present the triplets in $M^{t}$ in ascending order, i.e., ${s}_{l}<{s}_{l^{{}^{\prime}}}$ when $l<l^{{}^{\prime}}$ , and provide the varied context $c_{l}$ depending on $l$ . Specifically, for $l=1$ , we construct $c_{l}$ by concatenating QAs and mis-aligned responses by $\mathcal{M}$ under $\text{p}_{l}$ on $U^{l}_{\tt opi}$ :

c_{l}=\texttt{Concat}\{\big{(}i,{q}_{i},{a}_{i},\widehat{a}_{i}(\text{p}_{l})% \big{)}|({q}_{i},{a}_{i})\in U^{l}_{\tt opi}\}.

(4)

In Figure 2, the texts corresponding to $c_{1}$ are highlighted in blue. For other cases (i.e., $l\neq 1$ ), instead of the enumeration like $c_{1}$ , we construct the context $c_{l}$ with (i) the indices of common mis-aligned QA pairs between $\text{p}_{l}$ and $\text{p}_{1}$ , and (ii) the number of newly mis-aligned QAs by $\text{p}_{l}$ compared to $\text{p}_{1}$ (see the green texts in Figure 2 for an example). Through the presented indices in $c_{l}$ , $\mathcal{M}_{\tt opt}$ can directly access the mis-aligned QA pairs by referring $c_{1}$ , and one can avoid unnecessary complexity of $c_{l}$ and cost from the long input to $\mathcal{M}_{\tt opt}$ . Additionally, the number of newly mis-aligned ones offers further insight into whether $\text{p}_{l}$ has improved, which can’t be captured by the common mis-aligned ones.

$\circ$ Step 3: Generate New Prompts. With the updated memory $M^{t}$ , we generate $K$ new improved prompts $\text{P}^{t+1}=\{\text{p}^{\tt new}_{k}\}_{k=1}^{K}$ by prompting $\mathcal{M}_{\tt opt}$ to generate the new and high-scored prompts:

\text{p}^{\tt new}_{k}=\mathcal{M}_{\tt opt}(M^{t};\text{p}_{\tt opt}),

(5)

where $\text{p}_{\tt opt}$ is a fixed input prompt for $\mathcal{M}_{\tt opt}$ to generate new prompts, and we use a random sampling with temperature to generate diverse new prompts from $\mathcal{M}_{\tt opt}$ . Figure 2 presents the example of the overall input of $\mathcal{M}_{\tt opt}$ to generate new prompts, which is constructed with $M^{t}$ and $\text{p}_{\tt opt}$ .

Then, we go back to Step 1 with $\text{P}^{t+1}$ and iterate these 3 steps for $T$ times. After that, we obtain the optimized (i.e., personalized) prompts $\text{P}^{T}=\{\text{p}^{T}_{k}\}_{k=1}^{K}$ for the user $u$ . We remark that we also use the user’s explicit profile $U_{\tt pro}$ to construct the initial prompt set $\text{P}^{0}$ when it is available; thereby we fully utilize the given user information (see more details in Appendix C.3).

3.3 Effective inference by Retrieval-of-Prompt

After $T$ iterations of the optimization procedure, Fermi outputs $K$ unique personalized prompts $\text{P}^{T}=\{\text{p}^{T}_{k}\}_{k=1}^{K}$ . Therefore, for a given test question $q$ , one needs to determine which prompt to apply. Selecting the prompt with the highest score, i.e., $k^{*}=\arg\max_{k}s_{k}$ (Eq. 2), would be a straight-forward way. However, our intuition is that better selection is possible if we utilize the context of the test question $q$ as additional information. To this end, we propose to select the input prompt with the highest score on the subset of $U_{\tt opi}$ , which only consists of the previous questions highly relevant to $q$ . Formally, we first measure the relevance $r$ between $q$ and previous question ${q}_{i}$ :

R(q,U_{\tt opi})=\{r(q,{q}_{i})|{q}_{i}\in U_{\tt opi}\}.

(6)

For the relevance $r$ , we use the cosine similarity between the embeddings of questions, extracted by the sentence encoder [21]. Then, we select top- $\tilde{N}$ questions according to the calculated relevance and construct the subset $U_{\tt opi}^{q}$ with those questions. Lastly, we choose the input prompt $\text{p}^{*}=\text{p}^{T}_{k^{*}}$ based on the score on $U_{\tt opi}^{q}$ , which were already calculated, and use the prediction $\widehat{a}(\text{p}^{*})$ by $\mathcal{M}$ :

k^{*}=\arg\max_{k}{s}^{T}_{k}(U_{\tt opi}^{q}),

(7)

where ${s}^{T}_{k}(U_{\tt opi}^{q})=\sum\nolimits_{({q}_{i},{a}_{i})\sim U_{\tt opi}^% {q}}s\big{(}{a}_{i},\widehat{a}_{i}(\text{p}^{T}_{k})\big{)}/\tilde{N}$ . Figure 1 illustrates the overview of Fermi and Algorithm 1 summarizes the overall procedure of Fermi. We note that a full version of the prompts and examples of personalized prompts are presented in Appendixes C and E, respectively.

Algorithm 1 Fermi algorithm

Input: LLM for prediction

\mathcal{M}

, LLM for optimization

\mathcal{M}_{\tt opt}

, target test question

q

, explicit user profile

U_{\tt pro}

, few-shot previous user opinions

U_{\tt opi}=\{({q}_{i},{a}_{i})\}_{i=1}^{N}

, number of iterations

T

\text{P}^{0}=\{\text{p}^{0}\}\leftarrow\texttt{InitPrompt}(U_{\tt pro})

/*Get initial prompt*/

for

t=0

T-1

S^{t}=\{{s}_{k}\}_{k=1}^{K}\leftarrow\text{Eq.}~{}\ref{eq:score}

with

\mathcal{M}

\text{P}^{t}

U_{\tt opi}

/*Score prompts*/

M^{t}=\{(\text{p}_{l},{s}_{l},c_{l})\}_{l=1}^{L}\leftarrow\text{Top-}L(M^{t-1}% \cup\text{P}^{t}

) with

S^{t}

U_{\tt opi}

(Eq. 4) /*Update memory*/

\text{P}^{t+1}=\{\text{p}^{\tt new}_{k}\}_{k=1}^{K}\leftarrow\text{Eq.~{}}\ref% {eq:new_prompt}

with

\mathcal{M}_{\tt opt}

M^{t}

/*Generate new prompts*/

end for

k^{*}\leftarrow\arg\max_{k}{s}^{T}_{k}(U_{\tt opi}^{q})

, Eq. 6 with

\text{P}^{T},q,U_{\tt opi}

/*Retrieval-of-Prompt*/

return

\widehat{a}(\text{p}^{*})=\mathcal{M}(q;\text{p}^{T}_{k^{*}})

4 Experiments

In this section, we design our experiments to investigate the following questions:

$\circ$

How does Fermi perform compare to other personalization methods? (Tables 1 and 2)
$\circ$

Is the optimized prompt with Fermi from one LLM transferable to different LLMs? (Table 3)
$\circ$

What is the effect of each component in Fermi? (Table 4)
$\circ$

Why optimized prompt by Fermi is more effective than other prompts? (Table 5)

4.1 Setups

First, we describe our experimental setups. More details are presented in Appendix C.

Datasets. For the experiments, we first use two multiple-choice QA datasets proposed to measure the steerability of LLMs for specific users (or social groups): OpinionQA [24] and GlobalOpinionQA [4]. For OpinionQA, we use a subsampled split released by Hwang et al. [7], which consists of 10.5k and 15.8k training and test QA pairs across 525 users and 15 topics, respectively. For GlobalOpinionQA, since the dataset originally included the answer distribution by multiple respondents in the same country, we converted it to have a single answer by selecting the choice with the highest probability. It results in 920 training and 1,317 test QA pairs across 46 countries. We consider each country as a specific user. Next, we use two additional datasets, LaMP_tag and LaMP_rate, from a recent benchmark proposed for personalization of LLMs [23]. LaMP_tag is a 15-way classification data where an input is a movie description and a label is a movie tag, and LaMP_rate is a regression data where an input is a user review and a label is an integer rating (1-5). We construct both datasets by subsampling from their original validation split, which results in 1,000 training and 1,500 test QA pairs across 50 users for each dataset. On average across four datasets, for each user, 20 training QAs as previous opinions and specific profile are given, and then 30 test QAs are used to evaluate. For LaMP_rate, we report mean absolute error (MAE), a commonly used metric for the regression. For others, we report average test accuracy (Acc).

Baselines. We compare Fermi against extensive baselines as follows: (1) Uniform: expected performance when the prediction is made uniformly at random. (2) Vanilla: answers the question with LLMs without any user information. (3) Profile: constructing prompt using all available user profiles [24, 7] such as demographics or nationality. (4) Few-shot: retrieving relevant previous questions and opinions, then append them to the prompt [7, 23]. Following [23], we consider BM25 [22] and Contriever [8] for the retriever models. The number of retrieved profiles is determined among {3, 8, all} with validation performance. (5) All Info: using both explicit profiles and retrieved previous QAs to construct prompt [7]. We use the retrieval with the best performance in Few-shot.²²2In the case of OpinionQA, we additionally consider the retrieved indices originally included by [7]. (6) Optimization by PROmpting (OPRO; Yang et al. [35]): optimizing input prompt using both user profiles and previous opinions using LLMs. Here, all of the previous opinions are utilized during the optimization. In the experiments, the prompt with the best training score is selected for the test.

Table 1: Main result on multiple-choice QA datasets. Test accuracy (%) of ChatGPT over the different methods on OpinionQA (OpQA) and GlobalOpinionQA (GOQA). The best and second best scores are highlighted in bold and underline, respectively.

Methods Datasets Uniform Vanilla Profile Few-shot_bm25 Few-shot_cont All Info OPRO Fermi OpQA 34.2 45.5 48.1 49.8 49.3 48.6 50.2 54.6 GOQA 31.4 62.8 66.1 59.1 61.2 62.3 71.1 74.8

Table 2: Main result on LaMP Benchmark. Test performance of ChatGPT over the different methods on LaMP_tag and LaMP_rate. Test accuracy (Acc(

\uparrow

)) and mean absolute error (MAE (

\downarrow

)) are used, respectively. The best and second best scores are highlighted in bold and underline, respectively.

Methods Datasets (Metric) Uniform Vanilla Few-shot_bm25 Few-shot_cont OPRO Fermi LaMP_tag (Acc) 6.7 36.1 35.9 36.2 34.3 37.8 LaMP_rate (MAE) 1.65 0.62 0.40 0.36 0.57 0.34

Implementation details. We use three recent state-of-the-art LLMs for the prediction LLM $\mathcal{M}$ for the experiments: ChatGPT (gpt-3.5-turbo-0613) [14], GPT-4 (gpt-4-turbo-1106) [15], and LLaMA2-chat-70B [31]. For $\mathcal{M}$ , we use a temperature of $0.0$ when calling the API or greedy decoding for LLaMA, to remove the effect of random sampling. For the optimization LLM $\mathcal{M}_{\tt opt}$ , we always use GPT-4, as the prompt optimization based on the memory (Eq. 5) requires complex reasoning capability (See Appendix B), with a temperature of 1.0. For OPRO and Fermi, we use fixed values of $K=4$ , $L=5$ , and $T=10$ . Also, with previous user opinions in $U_{\tt opi}$ , 80% is used for optimization and 20% is used as few-shot demonstrations in $\text{p}_{\tt opt}$ . To obtain sentence embeddings for Retrieval-of-Prompt, we use the sentence encoder with MPNet [27] showing the best performance.³³3Following the results in https://www.sbert.net Also, we use a fixed $\tilde{N}=3$ for Retrieval-of-Prompt.

Table 3: Transferability of Fermi. Test accuracy (%) of two LLMs (LLaMA2-chat-70B and GPT-4) on GlobalOpinionQA. For Few-shot, we use Contriever which shows higher accuracy in Table 1. For OPRO^∗ and Fermi^∗, prompts optimized on ChatGPT are directly used. The best and second best scores are highlighted in bold and underline, respectively.

Methods Models Vanilla Profile Few-shot All Info OPRO^∗ Fermi^∗ LLaMA-2 62.4 65.5 60.5 65.1 64.5 68.9 GPT-4 56.7 77.7 68.9 78.2 76.7 84.8

4.2 Main results

Table 1 summarizes the experimental results on two different multiple-choice QA datasets, under ChatGPT. First, it is observed that augmenting the user information into the input prompt is effective in improving the accuracies of LLMs, but the effectiveness could be varied. For example, retrieving relevant user opinions is more effective than using the user profile for OpinionQA (49.8% vs. 48.1%), but it’s vice versa in GlobalOpinionQA (61.2% vs. 66.1%). It is due to the difference between datasets, as each user is asked multiple questions on the same topic in OpinionQA while GlobalOpinionQA asks the broader topics; this result also reveals the necessity of the learning-based prompt optimization approach. From the results of OPRO and Fermi, one can observe that the optimization-based approach is actually effective, and the proposed method significantly improves it. To be specific, Fermi exhibits 6.75% average accuracy improvement compared to the previous prompting method. Furthermore, compared to the existing optimization method, Fermi exhibits 4.05% accuracy improvement in the average. In Figure 3, we additionally present detailed results on OpinionQA, a topic-wise accuracy from four representative baselines selected based on average accuracy. Here, Fermi consistently shows better performance than other baselines across all topics, which further demonstrates the effectiveness of Fermi for the personalization of LLMs.

Next, Table 2 summarizes the experimental results on LaMP_tag (classification) and LaMP_rate (regression), under ChatGPT. We note that these datasets do not include explicit user profiles; hence, we exclude both Profile and All Info for the baselines. Here, it is noteworthy that the effectiveness of OPRO is significantly degraded, as the given task becomes more challenging to solve (e.g., the average number of answer choices: 3.96 for GlobalOpinionQA vs. 15 for LaMP_tag). Nevertheless, Fermi is consistently effective and outperforms the other baselines; for example, Fermi exhibits 4.42% and 5.56% relative improvement for both datasets, respectively.

4.3 Analyses with Fermi

In this section, we provide additional analyses of Fermi with the experiments on GlobalOpinionQA. We denote that more analyses are also presented in Appendix B.

Transferability of the optimized prompt.

Here, we provide additional experiments to verify the transferability of the learned prompt with our method. To be specific, we first save the optimized prompts under ChatGPT as LLM for evaluation (Eq. 1), which are used in Table 1. Then, we directly apply these prompts to two different types of LLMs (LLaMA-2-chat-70B and GPT-4), without additional optimization as same as applying heuristically designed prompts. From Table 3, one can observe that the transferred prompts from Fermi significantly outperform the baseline prompting methods on both LLMs; for example, it exhibits 3.4% and 7.1% accuracy improvement compared to the best-performing baselines for each LLM, respectively. We remark that the prompts from OPRO are even less effective than the existing baseline, which further shows the advantages of Fermi in learning the well-generalized personalized prompt. Also, the effectiveness on LLaMA-2 demonstrates that our method is also applicable to open-sourced LLMs, not only for black-box API LLMs.

Table 4: Ablation study of Fermi. Test accuracy of ChatGPT on GlobalOpinionQA with different configurations of the proposed components in Fermi.

Methods	Add_Mis	Add_Num	RoP	Acc
OPRO	✗	✗	✗	71.1
	✓	✗	✗	73.7
	✓	✓	✗	74.2
Fermi	✓	✓	✓	74.8

Table 5: In-depth analyses about prompts for personalization. Training and test accuracies of ChatGPT on GlobalOpinionQA. Training accuracy is measured by given user opinions

U_{\tt opi}

Methods Vanilla Profile Few-shot_top3 Few-shot_all Few-shot_bott3 Few-shot_format Fermi_irrel Fermi Acc_train 62.5 67.9 - 95.2 - 70.2 80.2 81.4 Acc_test 62.8 66.1 61.2 56.3 45.8 66.4 73.8 74.8

Ablation study.

To validate the effectiveness of the proposed component of Fermi in Section 3, we perform the ablation experiments by decomposing our framework into three different components: (1) including QAs that have mis-alinged responses with the initial presentation and referring via common indices (Add_Mis), (2) noting the number of QAs with new mis-aligned responses (Add_Num), and (3) Retrieval-of-Prompt for a test query (RoP). As shown in Table 4, all components progressively improve the few-shot personalization of LLMs. Especially, it is observable that efficiently providing the context of mis-aligned QAs during the optimization is mostly crucial for the improvement. Next, providing the number of new mis-aligned QAs makes additional improvement, as it can provide information about the effectiveness of the given prompt, which is not captured by commonly mis-aligned QAs. Lastly, for a test query, retrieving the most relevant prompt is more effective than selecting with the highest training score, as it successfully utilizes the context of the test query.

Features of good input prompts for personalization.

In Table 5, we further conduct the experiments to answer the following question: what features make good personalized prompts for LLMs? First, we claim that the relevance of the prompt to the test query is crucial; for example, Few-shot_top3, Few-shot_all, and Few-shot_bott3 are different prompting methods by retrieving the 3 mostly relevant, all 20, and 3 mostly irrelevant previous opinions, respectively. Here, it is observable that test accuracy largely degrades when a portion of irrelevant opinions increases. Similarly, when we retrieved the most irrelevant prompt (Fermi_irrel), i.e., take $\arg\min$ in Eq. 7, accuracy of Fermi is also decreased.

Second, providing the user information with the proper format for LLMs is important. As shown in Figure 5, the optimized prompt by Fermi is a detailed instruction consisting of multiple sentences that condense the lessons from the user opinions and LLM’s mis-aligned responses. In contrast, the previous prompt used to incorporate previous opinions is based on the specific form, which is harder to follow by LLMs. To verify the importance of the format, we convert the enumeration of all QAs (by Few-shot_all) into the instruction of multiple sentences (denoted by Few-shot_format), by prompting GPT-4 using the optimized prompts by Fermi as reference. Interestingly, this format conversion shows significant improvement (56.3% $\rightarrow$ 66.4%) while it is still underperforming Fermi.

Lastly, effectively distilling the given user information is important. As shown in Table 5, the prompting method with higher accuracy on previous user opinions $U_{\tt opi}$ (i.e., training accuracy) has a higher test accuracy for that user as well, except Few-shot_all which can directly access $U_{\tt opi}$ . In this aspect, Fermi shows a clear advantage compared to the previous prompting optimization method; as shown in Figure 4, Fermi more effectively optimizes the prompt and achieves higher training accuracy than OPRO. These results indicate that finding a proper way to condense and incorporate the user information to design input prompts is crucial, and Fermi achieves this by using the context of mis-aligned responses.

Overall, designing personalized prompts satisfying these three properties (relevancy to test query, proper format, and effective distillation of user information) is challenging, but Fermi effectively accomplishes this goal.

5 Conclusion

In this paper, we propose Fermi, a simple yet effective framework for improving the few-shot personalization of LLMs. Our key idea is to optimize the input prompt by learning from the user information; we propose an efficient way to incorporate contexts of mis-aligned responses by LLMs during the optimization, and a retrieval approach to select the optimized prompt relevant to test query. The effectiveness of Fermi is demonstrated by results on various personalization tasks and LLMs. We believe that our framework could be beneficial for improving the experience with the personal usage of LLMs, which become increasingly emerging and important in the future. More discussions on the limitation and the broader impact of this work are presented in Appendix A.

References

Banerjee et al. [2023] D. Banerjee, P. Singh, A. Avadhanam, and S. Srivastava. Benchmarking llm powered chatbots: methods and metrics. arXiv preprint arXiv:2308.04624, 2023.
Chao et al. [2023] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
Deng et al. [2022] M. Deng, J. Wang, C.-P. Hsieh, Y. Wang, H. Guo, T. Shu, M. Song, E. Xing, and Z. Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Durmus et al. [2023] E. Durmus, K. Nyugen, T. I. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield-Dodds, D. Hernandez, N. Joseph, et al. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
Gao et al. [2023] T. Gao, H. Yen, J. Yu, and D. Chen. Enabling large language models to generate text with citations. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Glaese et al. [2022] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
Hwang et al. [2023] E. Hwang, B. P. Majumder, and N. Tandon. Aligning language models to user opinions. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Izacard et al. [2022] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave. Unsupervised dense information retrieval with contrastive learning. In Transactions on Machine Learning Research (TMLR), 2022.
Kamalloo et al. [2023] E. Kamalloo, N. Dziri, C. Clarke, and D. Rafiei. Evaluating open-domain question answering in the era of large language models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Kandpal et al. [2023] N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel. Large language models struggle to learn long-tail knowledge. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
Lester et al. [2021] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
Li et al. [2023] J. Li, N. Mehrabi, C. Peris, P. Goyal, K.-W. Chang, A. Galstyan, R. Zemel, and R. Gupta. On the steerability of large language models toward data-driven personas. arXiv preprint arXiv:2311.04978, 2023.
Li and Liang [2021] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
OpenAI [2022] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[16] PewResearch. Writing survey questions. https://www.pewresearch.org/our-methods/u-s-surveys/writing-survey-questions.
Pillutla et al. [2021] K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
Prasad et al. [2023] A. Prasad, P. Hase, X. Zhou, and M. Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023.
Pryzant et al. [2023] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with" gradient descent" and beam search. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Razdaibiedina et al. [2023] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. In International Conference on Learning Representations (ICLR), 2023.
Reimers and Gurevych [2019] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
Robertson et al. [2009] S. Robertson, H. Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
Salemi et al. [2023] A. Salemi, S. Mysore, M. Bendersky, and H. Zamani. Lamp: When large language models meet personalization. arXiv preprint arXiv:2304.11406, 2023.
Santurkar et al. [2023] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect? In Proceedings of the International Conference on Machine Learning (ICML), 2023.
Shin et al. [2020] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Solaiman and Dennison [2021] I. Solaiman and C. Dennison. Process for adapting language models to society (palms) with values-targeted datasets. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
Song et al. [2020] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems (NeurIPS), 2020.
Stelmakh et al. [2022] I. Stelmakh, Y. Luan, B. Dhingra, and M.-W. Chang. Asqa: Factoid questions meet long-form answers. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
Sun et al. [2024] C. Sun, K. Yang, R. G. Reddy, Y. R. Fung, H. P. Chan, C. Zhai, and H. Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. arXiv preprint arXiv:2402.11060, 2024.
Team et al. [2023] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Wang et al. [2023] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
Wang et al. [2022] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Xie et al. [2023] Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11, 2023.
Yang et al. [2024] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen. Large language models as optimizers. In International Conference on Learning Representations (ICLR), 2024.
Zhao et al. [2024] S. Zhao, J. Dang, and A. Grover. Group preference optimization: Few-shot alignment of large language models. In International Conference on Learning Representations (ICLR), 2024.
Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
Zhou et al. [2023] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. In International Conference on Learning Representations (ICLR), 2023.
Zhu et al. [2022] Q. Zhu, B. Li, F. Mi, X. Zhu, and M. Huang. Continual prompt tuning for dialog state tracking. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.

Appendix A Limitations and Broader Impact

A.1 Limitations and future work

Although we have conducted comprehensive experiments on various NLP tasks with multiple LLMs, results and analyses on more datasets, tasks, and LLMs would likely draw a more decisive conclusion. For example, the tested benchmarks in Section 4 are discriminative tasks, i.e., the correctness of the responses by LLM can be directly evaluated using the ground-truth response from the user, and hence it’s easy to find the mis-aligned responses. In contrast, evaluating the correctness of LLM’s response (i.e., finding a proper metric) is challenging for the generation tasks, and is being continuously discussed [1, 9]. Nevertheless, we believe that our framework is still applicable in the generation task if the proper metric is given. For instance, ROUGE-L [28, 32] and MAUVE [5, 17] are popular metrics to measure the quality of machine-generated responses compared to ground-truth human-generated responses. As these metrics range between 0 and 1, one can set a specific threshold (i.e., $\tau\in[0,1]$ ) to determine the mis-aligned responses under these metrics (see Step 1 in Section 3.2). In addition, LLM-as-judge [9, 37] is another emerging way to evaluate the correctness of generation; in this case, it’s more straightforward to apply our framework, as it provides the binary outputs as same as discriminative tasks. However, finding a proper metric for each generation task itself is still a difficult problem, and hence we expect that this direction could be explored in the future.

In addition, while we show that the proposed framework can find personalized prompts by learning from the given user information, we also observe that its success highly depends on the capability of LLMs used for the optimization (i.e., generating new prompt from the memory in Eq. 5), as shown in Figure 6. Since our approach requires a few number of iterations of optimization to provide high-quality personalized prompts, a certain amount of cost is inevitably required. However, as we demonstrated in the experiments, the personalized prompts from our method are well-transferrable to other LLMs that are not used during optimization (Table 3), could be continuously updated with enlarged data through the user interactions (Table 8), and also reusable to convert previous prompts to have the proper format for LLMs (Table 5). Therefore, we believe that our approach could be an even more efficient way for personalization compared to the heuristical design of the prompt, after the consumption of the cost at the initial optimization.

A.2 Broader impact and ethical implications

We strongly believe that Fermi can provide a strong positive impact in real-world applications that require personalized responses for the given user, e.g., search engines or chatbots. We expect that our framework would be especially beneficial for the users belonging to under-populated social groups, since LLMs are known to follow the knowledge or opinion of the major population within pre-trained data [10, 24]. In contrast, there also exists some potential negative impacts. Since our framework needs to provide personal information to LLMs (mostly through API), it has a potential privacy risk when the provider of LLMs does not follow the safeguard and collects the given information. In addition, as our framework didn’t filter out the resulting prompts separately, it can include the prompts that have socially negative impacts, e.g., jailbreak of LLMs [2]. We believe that the incorporation of an additional filtering step could be a solution to this problem [34].

Appendix B More Analyses with Fermi

In this section, we provide more analyses of Fermi in addition to the analyses in Section 4.3.

Importance of using strong LLM for optimization $\mathcal{M}_{\tt opt}$ .

As denoted in Section 4.1, we commonly use GPT-4 for LLM $\mathcal{M}_{\tt opt}$ to generate new prompts from the optimization memory (Eq. 5) for all the experiments in Section 4. To validate this design choice, we conduct the experiments by substituting GPT-4 with ChatGPT $\mathcal{M}_{\tt opt}$ in both OPRO and Fermi. Figure 6 is the optimization trajectory in terms of training accuracy (i.e., average accuracy of the prediction by $\mathcal{M}$ on previous user opinions). Here, one can observe that both OPRO and Fermi suffer in optimizing the prompt when we use ChatGPT as $\mathcal{M}_{\tt opt}$ , similar to the previous observation [35]; it reveals that generating the improved prompts from the optimization memory with previous prompts, scores, and contexts requires complex reasoning capability. Therefore, using a strong LLM such as GPT-4 is necessary.

Table 6: Fermi only using GPT-4. Test accuracy of GPT-4 over the different prompting methods on GlobalOpinionQA. For Few-shot, we use Contriever which shows higher accuracy in Table 1. For OPRO^∗ and Fermi^∗, prompts optimized on ChatGPT are directly used. The best and second best scores are highlighted in bold and underline, respectively.

Methods Models Vanilla Profile Few-shot All Info OPRO^∗ Fermi^∗ Fermi GPT-4 56.7 77.7 68.9 78.2 76.7 84.8 86.7

Table 7: Different initial prompts. Test accuracy of ChatGPT over the different prompting methods on GlobalOpinionQA.

Methods Models Vanilla Profile Fermi_vanilla Fermi ChatGPT 62.8 66.1 69.9 74.8

Optimization with stronger LLM for evaluation $\mathcal{M}$ .

Next, to explore the compatibility of Fermi with different configurations of two LLMs during the optimization, we conduct the additional experiments by substituting evaluating LLM $\mathcal{M}$ to GPT-4 from ChatGPT; namely, two LLMs $\mathcal{M}$ and $\mathcal{M}_{\tt opt}$ for evaluating and generating are GPT-4. The results on GlobalOpinionQA are presented in Table 6. It is observable that one can find further improved personalized prompts in terms of test accuracy, when using stronger LLM $\mathcal{M}$ for evaluating (Eq. 2). For example, compared to the use of personalized prompts optimized by ChatGPT as $\mathcal{M}$ (Fermi^∗), the optimization only using GPT-4 exhibits 1.9% additional test accuracy improvement. This result clearly shows that the proposed Fermi is compatible with different types and capacities of evaluating LLMs.

Importance of initial prompts in Fermi.

For the experiments, we used a fixed initial prompt template across all datasets in our experiments, that maximally incorporates the given user profiles, as it has proven effective in prior studies [7, 24, 36], as described in Section 3.2 and Appendix C.3.

Nevertheless, to further provide insights about the impact of initial prompt templates on Fermi, we conduct additional experiments by varying the initial prompt set $P^{0}=\{\text{p}_{0}\}$ . To be specific, on GlobalOpinionQA dataset, we exclude the user profiles for the construction of the initial prompt unlike the original Fermi (in Table 1), and use the prompt of Vanilla for the initialization. We denote this version as Fermi_vanilla. The results (in comparison with other methods are) shown in Table 7, where Fermi (our method) consistently outperforms the baselines with both choices of prompt initialization while the gain is enlarged with better initialization when incorporating the user profile.

Table 8: Continual prompt optimization. Test accuracy of ChatGPT over the different prompting methods on GlobalOpinionQA. ^∗ denotes that the results are obtained with a two times larger pool for Retrieval-of-Prompt.

Methods Models Profile OPRO Fermi^half Fermi ${}^{\tt cont}_{\text{iter 1}}$ Fermi ${}^{\tt cont}_{\text{iter 5}}$ Fermi ChatGPT 66.1 71.1 73.0 74.0^∗ 74.9^∗ 74.8

Continual optimization of prompts.

In the previous experiments, we assumed that the fixed dataset $U_{\tt opi}$ of questions and user’s opinions is given. However, in the real-world, user often interacts frequently with LLMs, which means that the dataset could be continuously updated. Therefore, the iterative process of refining prompts might incur significant computational costs, if it should be conducted from scratch at certain intervals (e.g., when the number of new data reaches the threshold).

To mitigate this issue, we conduct additional experiments to show that the idea of continual prompt optimization [20, 33, 39] could be applied to Fermi, and hence such cost could be drastically reduced. Specifically, we first conduct Fermi by using half of the previous questions and the user’s responses $U_{\tt opi}$ (denoted by Fermi^half). We remark that other parameters are kept the same such as the 10 iterations of the optimization. Then, with the entire $U_{\tt opi}$ , we continuously conduct Fermi under the limited number of iterations, by initializing the prompt pool with previously optimized prompts in Fermi^half (i.e., substituting the initialization in line 116). We denoted the results of this continuous optimization with 1 and 5 iterations as Fermi ${}^{\tt cont}_{\text{iter 1}}$ and Fermi ${}^{\tt cont}_{\text{iter 5}}$ , respectively.

The results are presented in Table 8. First, it is notable that even with the reduced number of data for the optimization, Fermi still outperforms the strong baselines that are based on heuristic prompt engineering (Profile) or using the optimization by LLMs under full data (ORPO). However, one can also observe that the accuracy under full data is much better (74.8 vs. 73.0), which reveals that the data quantity is still important in Fermi. Next, it is also observed that the prompts could be successfully optimized continuously when the new data is added. Here, we denote that the previously optimized prompts in Fermi^half are also re-used for the pool of Retrieval-of-Prompt, to keep the knowledge of previous iterations.⁴⁴4Integrating new prompts into each user’s retrieval pool adds minimal computational overhead for calculating their embeddings. Remarkably, even with only 1 additional iteration of optimization, the accuracy is significantly increased (73.0 $\rightarrow$ 74.0). Also, when increasing the number of iterations to 5 (i.e,, the same amount of computations compared to the original Fermi), the accuracy is increased and slightly outperforms the original optimization under the full data. Such improvement might be from the enlarged pool of Retrieval-of-prompt that enables better exploitation of the previous knowledge.

These results clearly show that the proposed framework is still effective for a more realistic scenario under the continuously updated user data.

Appendix C Experimental Details

Table 9: Dataset statistics. More descriptions and statistics of datasets used in experiments.

Dataset Task Users Types of User Profiles # of Previous Opinions # of Test Questions OpinionQA Multiple Choice QA 525 Demographic and Ideology 10.5k 15.8k GlobalOpinionQA Multiple Choice QA 46 Nationality 920 1,317 LaMP_tag 15-way Movie Tagging 50 Not Available 1,000 1,500 LaMP_rate 5-scale Review Rating 50 Not Available 1,000 1,500

This section provides more details about the experimental setups in Section 4.

Table 10: Information of GlobalOpinionQA. List of 46 countries in the constructed dataset from GlobalOpinionQA.

Countries Greece, Sweden, China (Non-national sample), Colombia, Tunisia, Malaysia, Vietnam, Argentina, Bulgaria, Russia, Egypt, Indonesia, Jordan, Mexico, Pakistan, Palest. ter., Tanzania, Turkey, Ukraine, Kenya, Ghana, Canada, France, Germany, Lebanon, Peru, Poland, S. Korea, Italy, Spain, United States, Brazil, Chile, Japan, Venezuela, Senegal, Britain, Australia, Netherlands, Uganda, Nigeria, Philippines, Ethiopia, Myanmar, Maldives, Libya

C.1 Datasets

First, we present more detailed descriptions of the used datasets: OpinionQA [24], GlobalOpinionQA [4], LaMP_tag, and LaMP_rate [23]. Dataset statistics are presented in Table 9. Also, an example from each dataset is presented in Figure 7.

$\circ$

OpinionQA is a multiple-choice QA dataset originally constructed based on a public opinion survey [16], to evaluate the alignment of LM with 60 US demographic groups over various topics. As OpinionQA includes the information of each respondent, this dataset has been also used to evaluate the personalization of LLMs [7] and we also adopt it. Specifically, we use a subsampled split released by Hwang et al. [7], which consists of 10.5k and 15.8k training and test QA pairs across 525 users and 15 topics; namely, each user has 20 training QA pairs and 30 test QA pairs for each topic, on average. Also, the average number of answer choices is 3.2. Then, we use training QA pairs as given previous opinions by user, and use test QA pairs to evaluate. In addition, for the experiments, we use all 12 types of user profiles included in the dataset: {Age, Citizenship, Region, Education, Income, Marital status, Political ideology, Political party, Race, Religion, Frequency of religious attendance, Gender}.
$\circ$

GlobalOpinionQA is a multiple-choice QA dataset constructed from cross-national surveys to capture diverse opinions on global issues across different countries. Since the dataset originally included the answer distribution by multiple respondents in the same country, we converted it to have a single answer by selecting the choice with the highest probability, and treated each country as a specific user. To be specific, we set a threshold (0.8) and selectively use the data when its highest probability is higher than the threshold to guarantee the quality of the converted. It results in 920 training and 1,317 test QA pairs across 46 countries; namely, each user (country) has 20 training QA pairs and 28.6 test QA pairs for each topic, on average. Also, the average number of answer choices is 4.1. Then, we use training QA pairs as given previous opinions by user, and use test QA pairs to evaluate. Also, nationality becomes the only available profile. The full list of countries included in the dataset is presented in Table 10. Dataset could be downloaded from https://huggingface.co/datasets/Anthropic/llm_global_opinions.
$\circ$

LaMP_tag is is a 15-way classification data where an input is a movie description and a label is a corresponding movie tag among 15 categories: {Sci-fi, Based on a book, Comedy, Action, Twist ending, Dystopia, Dark comedy, Classic, Psychology, Fantasy, Romance, Thought-provoking, Social commentary, Violence, True story}. Since the original dataset is proposed to consider the scenario of fine-tuning LMs and hence it consists of a large number of examples, we construct our dataset by subsampling from its validation dataset to make it suitable to evaluate LLMs with inference. It results in 1,000 training and 1,500 test QA pairs across 50 users, respectively.
$\circ$

LaMP_rate is a regression data where an input is a user review and a label is an integer rating (1-5), i.e., 1 is mostly negative and 5 is mostly positive. Under the same motivation with LaMP_tag, we construct our dataset by subsampling from its validation dataset, which results in 1,000 training and 1,500 test QA pairs across 50 users, respectively. LaMP benchmarks could be downloaded in https://github.com/LaMP-Benchmark/LaMP.

C.2 Baselines

In this section, we present the specific prompts used for the experiments in Section 4. Listing C.3-C.3 are actually used prompts for Vanilla, Profile, Few-shot, and All Info, during the experiments on GlobalOpinionQA. Also, the prompt of OPRO used for the optimization is presented in Figure 9, which is the originally used one in Yang et al. [35]. While we’re trying to adapt this prompt similar to ours in Figure 8, we observed that it degrades the performance of OPRO; for example, the average test accuracy is reduced to 70.7% from 71.1%. Therefore, we use the original prompt for all the experiments. We remark that each prompt is minimally adjusted to consider the difference between datasets. For example, as OpinionQA includes many available user profiles, we fully incorporate these with the prompt in Listing C.3, following Hwang et al. [7]. Also, we present the prompt of Vanilla method on LaMP_rate dataset in Listing C.3. In addition, we present the prompt used to convert the format of the input prompt by Few-shot (Table 5) in Listing C.3.

C.3 Fermi

As denoted in Section 3.2, we need to provide an initial input prompt set $\text{P}^{0}=\{\text{p}^{0}\}$ . To this end, we use the heuristically design input prompts, which are presented in C.2. Specifically, we adopt the prompts used for Profile tuned for each data, when the user profile $U_{\tt pro}$ is available (both OpinionQA and GlobalOpinionQA). Since our framework only utilizes a given few-shot previous opinions during the optimization, this way of initial prompting naturally enables us to fully utilize all the user information. When the user profile is not available, we adopt the prompts used for Vaniall. In addition, we present a more detailed version of the prompt $\text{p}_{\tt opt}$ used to generate new input prompts with $\mathcal{M}_{\tt opt}$ in Figure 8. We remark that $\text{p}_{\tt opt}$ is minimally adjusted across dataset, to match the different task and user information of each dataset.

{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Vanilla method on GlobalOpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Choose the proper answer to the given question among the given answer choices, as if you currently reside in user profile. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Profile method on GlobalOpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ [1]. Question: question of 1st retrieval among previous opinions

Answer choices: answer choice of 1st retrieval among previous opinions

Answer: answer of 1st retrieval among previous opinions

…

[N]. Question: question of Nth retrieval among previous opinions

Answer choices: answer choice of Nth retrieval among previous opinions

Answer: answer of Nth retrieval among previous opinions

Based on the above previous questions and answers, choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Few-shot method. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ [1]. Question: question of 1st retrieval among previous opinions

Answer choices: answer choice of 1st retrieval among previous opinions

Answer: answer of 1st retrieval among previous opinions

…

[N]. Question: question of Nth retrieval among previous opinions

Answer choices: answer choice of Nth retrieval among previous opinions

Answer: answer of Nth retrieval among previous opinions

Based on the above previous questions and answers, choose the proper answer to the given question among the given answer choices, as if you currently reside in explicit_profile. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for All Info method.

{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ A person can be described as follows: Age: age in user profile Citizenship in America: citizenship in America in user profile Region: region in user profile Education: education in user profile Income: income in user profile Marital status: marital status in user profile Political ideology: political ideology in user profile Political party: political party in user profile Race: race in user profile Religion: religion in user profile Frequency of religious attendance: frequency of religious attendance in user profile Gender: gender in user profile

Based on the demographic information, choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Profile method on OpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Answer to the given question. Just answer with 1, 2, 3, 4, or 5 without further explanation:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Vanilla method on LaMP_rate.

{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ The followings are two different prompts used to answer the question.

[Input prompt]: prompt by Few-shot

[Target prompt]: prompt optimized by Fermi

You need to convert the input prompt to the format of the target prompt while preserving the original contexts in the input prompt.

Converted prompt: ”’ Prompt used to convert the format of input prompt by Few-shot to be instruction with multiple sentences.

Appendix D Additional Quantitative Results

In this section, we provide additional quantitative results that can’t be presented in the main draft due to the limited space. First, in Table 11, we present the average and standard deviation of topic-wise accuracy, i.e., the average and standard deviation are calculated across 35 users where each user receives 30 test questions in the same topic. Next, we present the test performance of Few-shot method in Section 4, under different numbers of retrieved opinions. Lastly, we present the test performance under a different number of considered training questions $\tilde{N}$ (Eq. 7). As one can see in Table 13, $\tilde{N}=3$ which is commonly used in our experiments shows consistent improvements in general, although the optimal values are different across the datasets.

Table 11: Detailed topic-wise accuracy. Average topic-wise accuracy and standard deviation with different methods on OpinionQA.

Methods Topics Vanilla Few-shot_cont OPRO Fermi Guns 45.3 $\pm$ 9.6 54.2 $\pm$ 13.7 54.7 $\pm$ 9.0 57.4 $\pm$ 14.5 Auto. vehicles 46.0 $\pm$ 10.9 48.7 $\pm$ 10.0 50.2 $\pm$ 9.5 53.2 $\pm$ 10.6 Views on gender 39.7 $\pm$ 10.4 49.0 $\pm$ 7.8 52.9 $\pm$ 11.5 58.9 $\pm$ 8.8 Sex. harassment 38.0 $\pm$ 10.9 40.4 $\pm$ 10.4 46.1 $\pm$ 9.4 47.7 $\pm$ 10.4 Biomedical & food 54.8 $\pm$ 10.6 59.9 $\pm$ 11.9 61.0 $\pm$ 11.1 63.7 $\pm$ 10.4 Gender & Leadership 49.9 $\pm$ 12.5 53.0 $\pm$ 10.6 54.9 $\pm$ 11.7 59.5 $\pm$ 9.0 America in 2050 48.6 $\pm$ 12.2 46.4 $\pm$ 10.8 44.6 $\pm$ 10.5 49.8 $\pm$ 10.8 Trust in science 49.0 $\pm$ 9.9 56.1 $\pm$ 10.8 54.8 $\pm$ 10.4 60.7 $\pm$ 7.8 Race 38.8 $\pm$ 7.8 46.8 $\pm$ 6.9 43.4 $\pm$ 11.0 49.3 $\pm$ 13.7 Misinformation 49.7 $\pm$ 11.7 50.5 $\pm$ 7.4 46.6 $\pm$ 9.2 52.3 $\pm$ 9.0 Privacy & Surveilance 41.5 $\pm$ 10.4 49.5 $\pm$ 9.2 46.6 $\pm$ 9.9 50.6 $\pm$ 10.6 Family & Relationships 51.4 $\pm$ 10.2 53.2 $\pm$ 12.1 50.9 $\pm$ 13.3 56.3 $\pm$ 11.9 Economic inequality 40.9 $\pm$ 9.2 47.0 $\pm$ 9.4 49.3 $\pm$ 12.7 53.5 $\pm$ 9.0 Global attitudes 46.3 $\pm$ 13.6 49.7 $\pm$ 12.3 47.9 $\pm$ 12.0 50.8 $\pm$ 13.9 Political views 43.2 $\pm$ 12.6 42.4 $\pm$ 9.2 48.9 $\pm$ 9.8 53.9 $\pm$ 11.8

Table 12: Different number of retrieval. Test performance of ChatGPT under different configurations for Few-shot method. Here,

k

denotes the number of retrieved opinions. The best scores are highlighted in bold.

Datasets (Metric) Methods OpinionQA GlobalOpinionQA LaMP_tag LaMP_rate (Acc.) (Acc.) (Acc.) (MAE) Few-shot_bm25 (k=3) 49.8 59.1 34.9 0.40 Few-shot_bm25 (k=8) 48.3 59.1 35.9 0.41 Few-shot_cont (k=3) 49.3 61.2 35.6 0.36 Few-shot_cont (k=8) 48.7 58.2 36.2 0.38 Few-shot_all (k=20) 47.9 56.3 35.8 0.46

Table 13: Different

\tilde{N}

for RoP. Test performance of ChatGPT under different

\tilde{N}

for RoP (Eq. 7).

Datasets (Metric) $\tilde{N}$ OpinionQA GlobalOpinionQA LaMP_tag LaMP_rate (Acc.) (Acc.) (Acc.) (MAE) $\tilde{N}=1$ 54.6 74.8 37.8 0.341 $\tilde{N}=3$ 54.5 74.8 37.8 0.343 $\tilde{N}=5$ 54.5 74.4 37.5 0.341 $\tilde{N}=10$ 54.1 74.1 37.7 0.347 $\tilde{N}=20$ 54.3 74.2 36.7 0.338

Appendix E More Comparison Examples between Personalized Prompts

In this section, we present more qualitative comparisons between the prompts from different methods for personalization of LLMs. To be specific, we present the specific test query from each data, and three corresponding prompts from the heuristic design, OPRO, and Fermi. Figures 10-17 are the comparison results on four datasets used in Section 4. Somewhat interestingly, one can observe that the personalized prompts by Fermi exhibit non-trivial incorporation of user information. In addition, we present examples of format-converted versions of few-shot prompting of previous user opinions (i.e., Few-shot_format in Table 5) in Figures 18 and 19. Here, one can observe that the converted prompts have a similar form to the personalized prompts by Fermi which is more natural to understand and follow for LLMs, and hence it significantly improves the performance up to 10.1%, as shown in Table 5.