Few-shot Personalization of LLMs
with Mis-aligned Responses

Jaehyung Kim  Yiming Yang
Carnegie Mellon University
jaehyun4@andrew.cmu.edu
Abstract

As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.111The code will be available at https://github.com/bbuing9/Fermi.

1 Introduction

The recent development of large language models (LLMs) has significantly accelerated progress in various NLP tasks, and yielded real-world applications used by millions of users, such as coding assistants and chatbots [14, 30, 31]. As the use of LLMs by diverse users in real-world applications increases, personalization of LLMs, i.e., steering LLMs’ responses towards the unique needs or preferences of individual users becomes progressively important [6, 26]. However, recent studies show that LLMs’ responses are often biased toward certain groups but not suited for other diverse groups of users, and such biases cannot be fixed by providing simple instructions [24].

To tackle this problem, methods to steer the responses of LLMs have been recently explored and they can be roughly divided into two categories. One category is prompt engineering, which heuristically incorporates the user’s information into the input prompts of LLMs [7, 23]. The other category focuses on learning from other users’ data [12, 29, 36]. But, both categories have limitations: prompt engineering for every user would be too costly and non-trivial, while the learning-based category relies on unrealistic assumption that personal data can be shared without violating privacy considerations.

Refer to caption
Figure 1: An overview of Fermi. Fermi iterates three steps to optimize the prompt from the given user information: (1) scoring new prompts, (2) updating the memory with high-scored prompts, and (3) generating new improved prompts (left). After the optimization, Fermi selectively uses the personalized prompts for the inference, via Retrieval-of-Prompt (right).

This paper addresses those limitations by introducing a new approach, namely Few-shot Personalization of LLMs with mis-aligned responses (Fermi). Our high-level idea is to use LLM to progressively improve its input prompts based on a few examples of previous user opinions and profiles (e.g., demographics) in an iterative process. In addition to the current prompts’ scores measured on given few-shot user opinions [35], Fermi incorporates the mis-aligned responses (i.e., LLM’s responses with those prompts, which are inconsistent with given user opinions) as additional context. The contexts of mis-aligned responses include useful learning signals to update prompts such as the types of wrong predictions with the current prompts (see the empirical evidence in Section 4). Specifically, the iterative process of Fermi consists of three steps: (1) scoring the initial or current prompts with LLM, (2) updating the memory with high-scored prompts in the form of <<<prompt, score, context>>> triplets, and (3) generating new improved prompts with LLM based on the updated memory. In addition, we propose Retrieval-or-Prompt, a method to improve the inference on a given test query. Retrieval-or-Prompt selectively uses one of the personalized prompts obtained from the optimization, based on the context of the test query. An overview of Fermi is presented in Figure 1.

We demonstrate the effectiveness of Fermi for few-shot personalization of LLMs, through extensive evaluations on various tasks including question-answering (QA), classification, and regression. For example, we observe that Fermi exhibited 6.8% and 4.1% average accuracy improvements on two multiple-choice QA datasets, constructed to evaluate the personalization of LLMs, compared to the previous state-of-the-art heuristic and optimization approaches, respectively. We also found that the personalized prompts produced with one LLM are also effective on other LLMs, including both API-based and open-sourced ones, which is crucial for efficient deployment in practice. In addition, our in-depth analyses reveal why Fermi is more effective than other prompting methods and what are the important features of prompts for effective personalization of LLMs. We hope our work provides useful insights for the research on LLM personalization, which becomes increasingly emerging and important for the future success of LLMs in real-world applications.

2 Related Works

Few-shot personalization of LLMs.

Few-shot personalization of LLM is to align LLM’s responses to a specific user with a limited number of user information such as user profile (e.g., demographic information) or opinions (e.g., previous responses to questions by user). To this end, one line of prior works has explored how to input given user information into LLM in a heuristical manner, i.e., prompt engineering; for example, Santurkar et al. [24] designs three different templates of input prompt. Salemi et al. [23] leverages the retrieval system [8] to use the given user opinions selectively. Hwang et al. [7] shows that using both user profile and opinions is more effective. On the other hand, another line of prior works has proposed learning from other user’s data; Li et al. [12] selects the relevant users using collaborative filtering, then learns the soft-prompt [13] from the augmented training data from these users’ data. Zhao et al. [36] proposes to train an independent transformer module via meta-learning on several users’ data. However, both approaches have their limitations; prompt engineering incurs the cost of designing the prompt, and could be limited to fully utilizing the user information due to the absence of learning. The learning-based one necessitates other users’ data which is hard to obtain in real-world, due to privacy issues. Therefore, we propose to only learn from target user’s information and find the optimized (i.e., personalized) prompt for that user.

Prompt optimization with LLM.

As the prior works for prompt-tuning, relying on the gradient-based update [3, 11, 25], become inapplicable to the recent API-based LLM due to their black-box nature, other approaches have been recently explored for gradient-free prompt optimization, such as a progressive improvement using heuristic rules or LLMs [18, 35, 38]. For example, Pryzant et al. [19] receives text feedback on how to update the prompts by instructing LLM. Also, after generating initial prompts with LLMs, Zhou et al. [38] generates a semantically similar variant of the prompts with the highest accuracies. Yang et al. [35] iterates evaluation and generation of prompts with two LLMs, to solve the black-box optimization such as prompt optimization; Yang et al. [35] incorporates the past generated prompts with their scores to enable the LLM for the optimization to construct new improved prompts. However, only providing the scores on training examples is insufficient to optimize the prompt for few-shot personalization of LLMs, as the context with mis-aligned responses such as the types or patterns within recursively wrong predictions can’t be captured in scores. Therefore, we propose an efficient way to incorporate such context during the optimization, along with an additional method to improve the inference by considering the context of the given test query.

3 Fermi: Few-shot Personalization of LLMs with Mis-aligned Responses

In this section, we present our framework proposed for Few-shot Personalization of LLMs from mis-aligned responses (Fermi). We first present our problem setup in Section 3.1. Then, in Section 3.2, we present our core component that optimizes the input prompt with a given user information, by using LLM as a black-box optimizer along with the additional contexts from mis-aligned responses. Lastly, we introduce an efficient inference scheme after optimizing prompts with Fermi, by utilizing the context of a test query (Section 3.3).

3.1 Problem description

We first describe the problem setup of our interest under a question-answering (QA) scenario. Our goal is to steer LLM for a specific user using that user’s information, and hence make LLM adaptively answer a given question depending on the user. Formally, let q𝑞{q}italic_q denote the given test question and \mathcal{M}caligraphic_M denote the LLM, respectively. Next, for user u𝑢uitalic_u, we assume two types of user information: U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT and U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT. U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT indicates explicit profile of u𝑢uitalic_u such as demographics information (e.g., region, sex, and age) or ideology (e.g., political affiliation). U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT indicates N𝑁Nitalic_N few-shot previous opinions by u𝑢uitalic_u, which has the form of QA pairs, i.e., U𝚘𝚙𝚒={(qi,ai)}i=1Nsubscript𝑈𝚘𝚙𝚒superscriptsubscriptsubscript𝑞𝑖subscript𝑎𝑖𝑖1𝑁U_{\tt opi}=\{({q}_{i},{a}_{i})\}_{i=1}^{N}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT = { ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT where qisubscript𝑞𝑖{q}_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a previously asked question and aisubscript𝑎𝑖{a}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an opinion (answer) by the user. Then, for given test question q𝑞qitalic_q, our goal is to predict the answer a𝑎aitalic_a, which would be generated by user u𝑢uitalic_u, through LLM \mathcal{M}caligraphic_M by using both U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT and U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT. The heuristic design of input prompt p to incorporate such user information has been previously explored [7, 24], i.e., prediction a^^𝑎\widehat{a}over^ start_ARG italic_a end_ARG is obtained by conditioning \mathcal{M}caligraphic_M with p, which is constructed using U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT and U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT:

a^(p)=(q;p).^𝑎p𝑞p\widehat{a}(\text{p})=\mathcal{M}(q;\text{p}).over^ start_ARG italic_a end_ARG ( p ) = caligraphic_M ( italic_q ; p ) . (1)

However, heuristically designed prompts could be limited to fully exploit the given user information. For example, compared to using all opinions in U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT, appending fewer user opinions can yield better personalization accuracy for LLM [7]. Therefore, we tackle this limitation by finding personalized prompts that steer LLM to the user, through direct learning from given user information.

Refer to caption
Figure 2: Prompt example. Example of input prompt for 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new prompts, composed of fixed input prompt p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT (including fixed few-shot demonstrations) and optimization memory Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (Eq. 5) on OpinionQA dataset [24]. A more detailed version is presented in Appendix C.3.

3.2 Prompt optimization using mis-aligned responses by LLM

To mitigate the difficulties from the large scale and black-box nature of recent LLMs, we instead optimize input prompts to learn from user information. It is motivated by the recent work [35] that uses two LLMs, \mathcal{M}caligraphic_M and 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT, to solve black-box optimization, where 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT denotes another LLM used for the optimization. Specifically, our key idea is incorporating the contexts of mis-aligned responses (i.e., QAs in U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT that \mathcal{M}caligraphic_M incorrectly predict with current prompts) during the optimization, instead of only using scores of the prompts (e.g., average accuracy of the prediction by \mathcal{M}caligraphic_M on U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT). As the contexts of mis-aligned responses include useful learning signals such as types or patterns of common wrong predictions, they could be effective in learning how to improve the prompts.

We first assume that there is an initial prompt set P0={p0}superscriptP0superscriptp0\text{P}^{0}=\{\text{p}^{0}\}P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT }, e.g., heuristically designed prompt [7, 24]. Then, at each iteration t𝑡titalic_t, we conduct the following three steps:

  1. \circ

    1. Score Prompts: Evaluate prompt based on its accuracy in predicting user’s previous answers.

  2. \circ

    2. Update Memory: Maintain a memory of the best-performing prompts along with their scores and the contexts of their mis-aligned responses.

  3. \circ

    3. Generate New Prompts: Generate new improved prompts with 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT and the updated memory.

\circ Step 1: Score Prompts. We first calculate the score sksubscript𝑠𝑘{s}_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of each prompt pkPtsubscriptp𝑘superscriptP𝑡\text{p}_{k}\in\text{P}^{t}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, by obtaining the predictions from \mathcal{M}caligraphic_M under pksubscriptp𝑘\text{p}_{k}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and evaluating them using the user’s previous answers:

sk=(qi,ai)U𝚘𝚙𝚒s(ai,a^i(pk))/N, where a^i(pk)=(qi;pk).formulae-sequencesubscript𝑠𝑘subscriptsimilar-tosubscript𝑞𝑖subscript𝑎𝑖subscript𝑈𝚘𝚙𝚒ssubscript𝑎𝑖subscript^𝑎𝑖subscriptp𝑘𝑁 where subscript^𝑎𝑖subscriptp𝑘subscript𝑞𝑖subscriptp𝑘{s}_{k}=\sum\nolimits_{({q}_{i},{a}_{i})\sim U_{\tt opi}}\text{s}\big{(}{a}_{i% },\widehat{a}_{i}(\text{p}_{k})\big{)}/N,~{}\text{ where }~{}\widehat{a}_{i}(% \text{p}_{k})=\mathcal{M}({q}_{i};\text{p}_{k}).italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT end_POSTSUBSCRIPT s ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) / italic_N , where over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = caligraphic_M ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (2)

Here, s(,)s\text{s}(\cdot,\cdot)s ( ⋅ , ⋅ ) is a specific metric to evaluate the prediction (e.g., accuracy). During this calculation of the score sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the prompt pksubscriptp𝑘\text{p}_{k}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we also collect mis-aligned QA pairs U𝚘𝚙𝚒ksubscriptsuperscript𝑈𝑘𝚘𝚙𝚒U^{k}_{\tt opi}italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT that the prediction of \mathcal{M}caligraphic_M under pksubscriptp𝑘\text{p}_{k}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not aligned with the user’s answer:

U𝚘𝚙𝚒k={(qi,ai)|s(ai,a^i(pk))<τ,(qi,ai)U𝚘𝚙𝚒},subscriptsuperscript𝑈𝑘𝚘𝚙𝚒conditional-setsubscript𝑞𝑖subscript𝑎𝑖formulae-sequencessubscript𝑎𝑖subscript^𝑎𝑖subscriptp𝑘𝜏subscript𝑞𝑖subscript𝑎𝑖subscript𝑈𝚘𝚙𝚒U^{k}_{\tt opi}=\{({q}_{i},{a}_{i})|\text{s}\big{(}{a}_{i},\widehat{a}_{i}(% \text{p}_{k})\big{)}<\tau,~{}({q}_{i},{a}_{i})\in U_{\tt opi}\},italic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT = { ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | s ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) < italic_τ , ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT } , (3)

where τ𝜏\tauitalic_τ is a threshold to judge the mis-alignment; for example, we set τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 when we use the correctness of prediction as the score s(,)s\text{s}(\cdot,\cdot)s ( ⋅ , ⋅ ).

\circ Step 2: Update Memory. Next, we construct an optimization memory Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which is used for the input of 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new improved prompts, by providing the information of well-performing prompts through the contexts of their mis-aligned responses. To be specific, the optimization memory Mt={(pl,sl,cl)}l=1Lsuperscript𝑀𝑡superscriptsubscriptsubscriptp𝑙subscript𝑠𝑙subscript𝑐𝑙𝑙1𝐿M^{t}=\{(\text{p}_{l},{s}_{l},c_{l})\}_{l=1}^{L}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { ( p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is constructed by selecting top-L𝐿Litalic_L prompts among PtsuperscriptP𝑡\text{P}^{t}P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Mt1superscript𝑀𝑡1M^{t-1}italic_M start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT (where M0=superscript𝑀0M^{0}=\emptysetitalic_M start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ∅), according to their scores (Eq. 2). Here, we present the triplets in Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in ascending order, i.e., sl<slsubscript𝑠𝑙subscript𝑠superscript𝑙{s}_{l}<{s}_{l^{{}^{\prime}}}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT when l<l𝑙superscript𝑙l<l^{{}^{\prime}}italic_l < italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, and provide the varied context clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT depending on l𝑙litalic_l. Specifically, for l=1𝑙1l=1italic_l = 1, we construct clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT by concatenating QAs and mis-aligned responses by \mathcal{M}caligraphic_M under plsubscriptp𝑙\text{p}_{l}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT on U𝚘𝚙𝚒lsubscriptsuperscript𝑈𝑙𝚘𝚙𝚒U^{l}_{\tt opi}italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT:

cl=Concat{(i,qi,ai,a^i(pl))|(qi,ai)U𝚘𝚙𝚒l}.subscript𝑐𝑙Concatconditional-set𝑖subscript𝑞𝑖subscript𝑎𝑖subscript^𝑎𝑖subscriptp𝑙subscript𝑞𝑖subscript𝑎𝑖subscriptsuperscript𝑈𝑙𝚘𝚙𝚒c_{l}=\texttt{Concat}\{\big{(}i,{q}_{i},{a}_{i},\widehat{a}_{i}(\text{p}_{l})% \big{)}|({q}_{i},{a}_{i})\in U^{l}_{\tt opi}\}.italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = Concat { ( italic_i , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) | ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT } . (4)

In Figure 2, the texts corresponding to c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are highlighted in blue. For other cases (i.e., l1𝑙1l\neq 1italic_l ≠ 1), instead of the enumeration like c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we construct the context clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with (i) the indices of common mis-aligned QA pairs between plsubscriptp𝑙\text{p}_{l}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and p1subscriptp1\text{p}_{1}p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and (ii) the number of newly mis-aligned QAs by plsubscriptp𝑙\text{p}_{l}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT compared to p1subscriptp1\text{p}_{1}p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (see the green texts in Figure 2 for an example). Through the presented indices in clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT can directly access the mis-aligned QA pairs by referring c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and one can avoid unnecessary complexity of clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and cost from the long input to 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT. Additionally, the number of newly mis-aligned ones offers further insight into whether plsubscriptp𝑙\text{p}_{l}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT has improved, which can’t be captured by the common mis-aligned ones.

\circ Step 3: Generate New Prompts. With the updated memory Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we generate K𝐾Kitalic_K new improved prompts Pt+1={pk𝚗𝚎𝚠}k=1KsuperscriptP𝑡1superscriptsubscriptsubscriptsuperscriptp𝚗𝚎𝚠𝑘𝑘1𝐾\text{P}^{t+1}=\{\text{p}^{\tt new}_{k}\}_{k=1}^{K}P start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT typewriter_new end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT by prompting 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate the new and high-scored prompts:

pk𝚗𝚎𝚠=𝚘𝚙𝚝(Mt;p𝚘𝚙𝚝),subscriptsuperscriptp𝚗𝚎𝚠𝑘subscript𝚘𝚙𝚝superscript𝑀𝑡subscriptp𝚘𝚙𝚝\text{p}^{\tt new}_{k}=\mathcal{M}_{\tt opt}(M^{t};\text{p}_{\tt opt}),p start_POSTSUPERSCRIPT typewriter_new end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT ( italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT ) , (5)

where p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT is a fixed input prompt for 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new prompts, and we use a random sampling with temperature to generate diverse new prompts from 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT. Figure 2 presents the example of the overall input of 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new prompts, which is constructed with Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT.

Then, we go back to Step 1 with Pt+1superscriptP𝑡1\text{P}^{t+1}P start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT and iterate these 3 steps for T𝑇Titalic_T times. After that, we obtain the optimized (i.e., personalized) prompts PT={pkT}k=1KsuperscriptP𝑇superscriptsubscriptsubscriptsuperscriptp𝑇𝑘𝑘1𝐾\text{P}^{T}=\{\text{p}^{T}_{k}\}_{k=1}^{K}P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT for the user u𝑢uitalic_u. We remark that we also use the user’s explicit profile U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT to construct the initial prompt set P0superscriptP0\text{P}^{0}P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT when it is available; thereby we fully utilize the given user information (see more details in Appendix C.3).

3.3 Effective inference by Retrieval-of-Prompt

After T𝑇Titalic_T iterations of the optimization procedure, Fermi outputs K𝐾Kitalic_K unique personalized prompts PT={pkT}k=1KsuperscriptP𝑇superscriptsubscriptsubscriptsuperscriptp𝑇𝑘𝑘1𝐾\text{P}^{T}=\{\text{p}^{T}_{k}\}_{k=1}^{K}P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Therefore, for a given test question q𝑞qitalic_q, one needs to determine which prompt to apply. Selecting the prompt with the highest score, i.e., k=argmaxksksuperscript𝑘subscript𝑘subscript𝑠𝑘k^{*}=\arg\max_{k}s_{k}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Eq. 2), would be a straight-forward way. However, our intuition is that better selection is possible if we utilize the context of the test question q𝑞qitalic_q as additional information. To this end, we propose to select the input prompt with the highest score on the subset of U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT, which only consists of the previous questions highly relevant to q𝑞qitalic_q. Formally, we first measure the relevance r𝑟ritalic_r between q𝑞qitalic_q and previous question qisubscript𝑞𝑖{q}_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

R(q,U𝚘𝚙𝚒)={r(q,qi)|qiU𝚘𝚙𝚒}.𝑅𝑞subscript𝑈𝚘𝚙𝚒conditional-set𝑟𝑞subscript𝑞𝑖subscript𝑞𝑖subscript𝑈𝚘𝚙𝚒R(q,U_{\tt opi})=\{r(q,{q}_{i})|{q}_{i}\in U_{\tt opi}\}.italic_R ( italic_q , italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT ) = { italic_r ( italic_q , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT } . (6)

For the relevance r𝑟ritalic_r, we use the cosine similarity between the embeddings of questions, extracted by the sentence encoder [21]. Then, we select top-N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG questions according to the calculated relevance and construct the subset U𝚘𝚙𝚒qsuperscriptsubscript𝑈𝚘𝚙𝚒𝑞U_{\tt opi}^{q}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT with those questions. Lastly, we choose the input prompt p=pkTsuperscriptpsubscriptsuperscriptp𝑇superscript𝑘\text{p}^{*}=\text{p}^{T}_{k^{*}}p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT based on the score on U𝚘𝚙𝚒qsuperscriptsubscript𝑈𝚘𝚙𝚒𝑞U_{\tt opi}^{q}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, which were already calculated, and use the prediction a^(p)^𝑎superscriptp\widehat{a}(\text{p}^{*})over^ start_ARG italic_a end_ARG ( p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) by \mathcal{M}caligraphic_M:

k=argmaxkskT(U𝚘𝚙𝚒q),superscript𝑘subscript𝑘subscriptsuperscript𝑠𝑇𝑘superscriptsubscript𝑈𝚘𝚙𝚒𝑞k^{*}=\arg\max_{k}{s}^{T}_{k}(U_{\tt opi}^{q}),italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) , (7)

where skT(U𝚘𝚙𝚒q)=(qi,ai)U𝚘𝚙𝚒qs(ai,a^i(pkT))/N~subscriptsuperscript𝑠𝑇𝑘superscriptsubscript𝑈𝚘𝚙𝚒𝑞subscriptsimilar-tosubscript𝑞𝑖subscript𝑎𝑖superscriptsubscript𝑈𝚘𝚙𝚒𝑞𝑠subscript𝑎𝑖subscript^𝑎𝑖subscriptsuperscriptp𝑇𝑘~𝑁{s}^{T}_{k}(U_{\tt opi}^{q})=\sum\nolimits_{({q}_{i},{a}_{i})\sim U_{\tt opi}^% {q}}s\big{(}{a}_{i},\widehat{a}_{i}(\text{p}^{T}_{k})\big{)}/\tilde{N}italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) / over~ start_ARG italic_N end_ARG. Figure 1 illustrates the overview of Fermi and Algorithm 1 summarizes the overall procedure of Fermi. We note that a full version of the prompts and examples of personalized prompts are presented in Appendixes C and E, respectively.

Algorithm 1 Fermi algorithm
Input: LLM for prediction \mathcal{M}caligraphic_M, LLM for optimization 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT, target test question q𝑞qitalic_q, explicit user profile U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT, few-shot previous user opinions U𝚘𝚙𝚒={(qi,ai)}i=1Nsubscript𝑈𝚘𝚙𝚒superscriptsubscriptsubscript𝑞𝑖subscript𝑎𝑖𝑖1𝑁U_{\tt opi}=\{({q}_{i},{a}_{i})\}_{i=1}^{N}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT = { ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, number of iterations T𝑇Titalic_T  
P0={p0}InitPrompt(U𝚙𝚛𝚘)superscriptP0superscriptp0InitPromptsubscript𝑈𝚙𝚛𝚘\text{P}^{0}=\{\text{p}^{0}\}\leftarrow\texttt{InitPrompt}(U_{\tt pro})P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } ← InitPrompt ( italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT ) /*Get initial prompt*/
for t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 do
     St={sk}k=1KEq.2superscript𝑆𝑡superscriptsubscriptsubscript𝑠𝑘𝑘1𝐾Eq.2S^{t}=\{{s}_{k}\}_{k=1}^{K}\leftarrow\text{Eq.}~{}\ref{eq:score}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ← Eq. with \mathcal{M}caligraphic_M, PtsuperscriptP𝑡\text{P}^{t}P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT /*Score prompts*/
     Mt={(pl,sl,cl)}l=1LTop-L(Mt1PtM^{t}=\{(\text{p}_{l},{s}_{l},c_{l})\}_{l=1}^{L}\leftarrow\text{Top-}L(M^{t-1}% \cup\text{P}^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { ( p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ← Top- italic_L ( italic_M start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∪ P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) with Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT (Eq. 4) /*Update memory*/
     Pt+1={pk𝚗𝚎𝚠}k=1KEq. 5superscriptP𝑡1superscriptsubscriptsubscriptsuperscriptp𝚗𝚎𝚠𝑘𝑘1𝐾Eq. 5\text{P}^{t+1}=\{\text{p}^{\tt new}_{k}\}_{k=1}^{K}\leftarrow\text{Eq.~{}}\ref% {eq:new_prompt}P start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT typewriter_new end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ← Eq. with 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT, Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT /*Generate new prompts*/
end for
kargmaxkskT(U𝚘𝚙𝚒q)superscript𝑘subscript𝑘subscriptsuperscript𝑠𝑇𝑘superscriptsubscript𝑈𝚘𝚙𝚒𝑞k^{*}\leftarrow\arg\max_{k}{s}^{T}_{k}(U_{\tt opi}^{q})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ), Eq. 6 with PT,q,U𝚘𝚙𝚒superscriptP𝑇𝑞subscript𝑈𝚘𝚙𝚒\text{P}^{T},q,U_{\tt opi}P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_q , italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT /*Retrieval-of-Prompt*/
return  a^(p)=(q;pkT)^𝑎superscriptp𝑞subscriptsuperscriptp𝑇superscript𝑘\widehat{a}(\text{p}^{*})=\mathcal{M}(q;\text{p}^{T}_{k^{*}})over^ start_ARG italic_a end_ARG ( p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = caligraphic_M ( italic_q ; p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )

4 Experiments

In this section, we design our experiments to investigate the following questions:

  • \circ

    How does Fermi perform compare to other personalization methods? (Tables 1 and 2)

  • \circ

    Is the optimized prompt with Fermi from one LLM transferable to different LLMs? (Table 3)

  • \circ

    What is the effect of each component in Fermi? (Table 4)

  • \circ

    Why optimized prompt by Fermi is more effective than other prompts? (Table 5)

4.1 Setups

First, we describe our experimental setups. More details are presented in Appendix C.

Datasets. For the experiments, we first use two multiple-choice QA datasets proposed to measure the steerability of LLMs for specific users (or social groups): OpinionQA [24] and GlobalOpinionQA [4]. For OpinionQA, we use a subsampled split released by Hwang et al. [7], which consists of 10.5k and 15.8k training and test QA pairs across 525 users and 15 topics, respectively. For GlobalOpinionQA, since the dataset originally included the answer distribution by multiple respondents in the same country, we converted it to have a single answer by selecting the choice with the highest probability. It results in 920 training and 1,317 test QA pairs across 46 countries. We consider each country as a specific user. Next, we use two additional datasets, LaMPtag and LaMPrate, from a recent benchmark proposed for personalization of LLMs [23]. LaMPtag is a 15-way classification data where an input is a movie description and a label is a movie tag, and LaMPrate is a regression data where an input is a user review and a label is an integer rating (1-5). We construct both datasets by subsampling from their original validation split, which results in 1,000 training and 1,500 test QA pairs across 50 users for each dataset. On average across four datasets, for each user, 20 training QAs as previous opinions and specific profile are given, and then 30 test QAs are used to evaluate. For LaMPrate, we report mean absolute error (MAE), a commonly used metric for the regression. For others, we report average test accuracy (Acc).

Baselines. We compare Fermi against extensive baselines as follows: (1) Uniform: expected performance when the prediction is made uniformly at random. (2) Vanilla: answers the question with LLMs without any user information. (3) Profile: constructing prompt using all available user profiles [24, 7] such as demographics or nationality. (4) Few-shot: retrieving relevant previous questions and opinions, then append them to the prompt [7, 23]. Following [23], we consider BM25 [22] and Contriever [8] for the retriever models. The number of retrieved profiles is determined among {3, 8, all} with validation performance. (5) All Info: using both explicit profiles and retrieved previous QAs to construct prompt [7]. We use the retrieval with the best performance in Few-shot.222In the case of OpinionQA, we additionally consider the retrieved indices originally included by [7]. (6) Optimization by PROmpting (OPRO; Yang et al. [35]): optimizing input prompt using both user profiles and previous opinions using LLMs. Here, all of the previous opinions are utilized during the optimization. In the experiments, the prompt with the best training score is selected for the test.

Table 1: Main result on multiple-choice QA datasets. Test accuracy (%) of ChatGPT over the different methods on OpinionQA (OpQA) and GlobalOpinionQA (GOQA). The best and second best scores are highlighted in bold and underline, respectively.

Methods Datasets Uniform Vanilla Profile Few-shotbm25 Few-shotcont All Info OPRO Fermi OpQA 34.2 45.5 48.1 49.8 49.3 48.6 50.2 54.6 GOQA 31.4 62.8 66.1 59.1 61.2 62.3 71.1 74.8

Table 2: Main result on LaMP Benchmark. Test performance of ChatGPT over the different methods on LaMPtag and LaMPrate. Test accuracy (Acc(\uparrow)) and mean absolute error (MAE (\downarrow)) are used, respectively. The best and second best scores are highlighted in bold and underline, respectively.

Methods Datasets (Metric) Uniform Vanilla Few-shotbm25 Few-shotcont OPRO Fermi LaMPtag (Acc) 6.7 36.1 35.9 36.2 34.3 37.8 LaMPrate (MAE) 1.65 0.62 0.40 0.36 0.57 0.34

Implementation details. We use three recent state-of-the-art LLMs for the prediction LLM \mathcal{M}caligraphic_M for the experiments: ChatGPT (gpt-3.5-turbo-0613) [14], GPT-4 (gpt-4-turbo-1106) [15], and LLaMA2-chat-70B [31]. For \mathcal{M}caligraphic_M, we use a temperature of 0.00.00.00.0 when calling the API or greedy decoding for LLaMA, to remove the effect of random sampling. For the optimization LLM 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT, we always use GPT-4, as the prompt optimization based on the memory (Eq. 5) requires complex reasoning capability (See Appendix B), with a temperature of 1.0. For OPRO and Fermi, we use fixed values of K=4𝐾4K=4italic_K = 4, L=5𝐿5L=5italic_L = 5, and T=10𝑇10T=10italic_T = 10. Also, with previous user opinions in U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT, 80% is used for optimization and 20% is used as few-shot demonstrations in p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT. To obtain sentence embeddings for Retrieval-of-Prompt, we use the sentence encoder with MPNet [27] showing the best performance.333Following the results in https://www.sbert.net Also, we use a fixed N~=3~𝑁3\tilde{N}=3over~ start_ARG italic_N end_ARG = 3 for Retrieval-of-Prompt.

Refer to caption
Figure 3: Overall topic-wise improvement. Test accuracy of ChatGPT over four different personalization methods on OpinionQA. Detailed results are presented in Appendix D.
Table 3: Transferability of Fermi. Test accuracy (%) of two LLMs (LLaMA2-chat-70B and GPT-4) on GlobalOpinionQA. For Few-shot, we use Contriever which shows higher accuracy in Table 1. For OPRO and Fermi, prompts optimized on ChatGPT are directly used. The best and second best scores are highlighted in bold and underline, respectively.

Methods Models Vanilla Profile Few-shot All Info OPRO Fermi LLaMA-2 62.4 65.5 60.5 65.1 64.5 68.9 GPT-4 56.7 77.7 68.9 78.2 76.7 84.8

4.2 Main results

Table 1 summarizes the experimental results on two different multiple-choice QA datasets, under ChatGPT. First, it is observed that augmenting the user information into the input prompt is effective in improving the accuracies of LLMs, but the effectiveness could be varied. For example, retrieving relevant user opinions is more effective than using the user profile for OpinionQA (49.8% vs. 48.1%), but it’s vice versa in GlobalOpinionQA (61.2% vs. 66.1%). It is due to the difference between datasets, as each user is asked multiple questions on the same topic in OpinionQA while GlobalOpinionQA asks the broader topics; this result also reveals the necessity of the learning-based prompt optimization approach. From the results of OPRO and Fermi, one can observe that the optimization-based approach is actually effective, and the proposed method significantly improves it. To be specific, Fermi exhibits 6.75% average accuracy improvement compared to the previous prompting method. Furthermore, compared to the existing optimization method, Fermi exhibits 4.05% accuracy improvement in the average. In Figure 3, we additionally present detailed results on OpinionQA, a topic-wise accuracy from four representative baselines selected based on average accuracy. Here, Fermi consistently shows better performance than other baselines across all topics, which further demonstrates the effectiveness of Fermi for the personalization of LLMs.

Next, Table 2 summarizes the experimental results on LaMPtag (classification) and LaMPrate (regression), under ChatGPT. We note that these datasets do not include explicit user profiles; hence, we exclude both Profile and All Info for the baselines. Here, it is noteworthy that the effectiveness of OPRO is significantly degraded, as the given task becomes more challenging to solve (e.g., the average number of answer choices: 3.96 for GlobalOpinionQA vs. 15 for LaMPtag). Nevertheless, Fermi is consistently effective and outperforms the other baselines; for example, Fermi exhibits 4.42% and 5.56% relative improvement for both datasets, respectively.

4.3 Analyses with Fermi

In this section, we provide additional analyses of Fermi with the experiments on GlobalOpinionQA. We denote that more analyses are also presented in Appendix B.

Transferability of the optimized prompt.

Here, we provide additional experiments to verify the transferability of the learned prompt with our method. To be specific, we first save the optimized prompts under ChatGPT as LLM for evaluation (Eq. 1), which are used in Table 1. Then, we directly apply these prompts to two different types of LLMs (LLaMA-2-chat-70B and GPT-4), without additional optimization as same as applying heuristically designed prompts. From Table 3, one can observe that the transferred prompts from Fermi significantly outperform the baseline prompting methods on both LLMs; for example, it exhibits 3.4% and 7.1% accuracy improvement compared to the best-performing baselines for each LLM, respectively. We remark that the prompts from OPRO are even less effective than the existing baseline, which further shows the advantages of Fermi in learning the well-generalized personalized prompt. Also, the effectiveness on LLaMA-2 demonstrates that our method is also applicable to open-sourced LLMs, not only for black-box API LLMs.

Table 4: Ablation study of Fermi. Test accuracy of ChatGPT on GlobalOpinionQA with different configurations of the proposed components in Fermi.
Methods AddMis AddNum RoP Acc
OPRO 71.1
73.7
74.2
Fermi 74.8
Table 5: In-depth analyses about prompts for personalization. Training and test accuracies of ChatGPT on GlobalOpinionQA. Training accuracy is measured by given user opinions U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT.

Methods Vanilla Profile Few-shottop3 Few-shotall Few-shotbott3 Few-shotformat Fermiirrel Fermi Acctrain 62.5 67.9 - 95.2 - 70.2 80.2 81.4 Acctest 62.8 66.1 61.2 56.3 45.8 66.4 73.8 74.8

Ablation study.

To validate the effectiveness of the proposed component of Fermi in Section 3, we perform the ablation experiments by decomposing our framework into three different components: (1) including QAs that have mis-alinged responses with the initial presentation and referring via common indices (AddMis), (2) noting the number of QAs with new mis-aligned responses (AddNum), and (3) Retrieval-of-Prompt for a test query (RoP). As shown in Table 4, all components progressively improve the few-shot personalization of LLMs. Especially, it is observable that efficiently providing the context of mis-aligned QAs during the optimization is mostly crucial for the improvement. Next, providing the number of new mis-aligned QAs makes additional improvement, as it can provide information about the effectiveness of the given prompt, which is not captured by commonly mis-aligned QAs. Lastly, for a test query, retrieving the most relevant prompt is more effective than selecting with the highest training score, as it successfully utilizes the context of the test query.

Features of good input prompts for personalization.

In Table 5, we further conduct the experiments to answer the following question: what features make good personalized prompts for LLMs? First, we claim that the relevance of the prompt to the test query is crucial; for example, Few-shottop3, Few-shotall, and Few-shotbott3 are different prompting methods by retrieving the 3 mostly relevant, all 20, and 3 mostly irrelevant previous opinions, respectively. Here, it is observable that test accuracy largely degrades when a portion of irrelevant opinions increases. Similarly, when we retrieved the most irrelevant prompt (Fermiirrel), i.e., take argmin\arg\minroman_arg roman_min in Eq. 7, accuracy of Fermi is also decreased.

Second, providing the user information with the proper format for LLMs is important. As shown in Figure 5, the optimized prompt by Fermi is a detailed instruction consisting of multiple sentences that condense the lessons from the user opinions and LLM’s mis-aligned responses. In contrast, the previous prompt used to incorporate previous opinions is based on the specific form, which is harder to follow by LLMs. To verify the importance of the format, we convert the enumeration of all QAs (by Few-shotall) into the instruction of multiple sentences (denoted by Few-shotformat), by prompting GPT-4 using the optimized prompts by Fermi as reference. Interestingly, this format conversion shows significant improvement (56.3% \rightarrow 66.4%) while it is still underperforming Fermi.

Refer to caption
Figure 4: Optimization trajectory. Average training accuracies of OPRO and Fermi on GlobalOpinionQA, across optimization iterations (T=10𝑇10T=10italic_T = 10).

Lastly, effectively distilling the given user information is important. As shown in Table 5, the prompting method with higher accuracy on previous user opinions U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT (i.e., training accuracy) has a higher test accuracy for that user as well, except Few-shotall which can directly access U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT. In this aspect, Fermi shows a clear advantage compared to the previous prompting optimization method; as shown in Figure 4, Fermi more effectively optimizes the prompt and achieves higher training accuracy than OPRO. These results indicate that finding a proper way to condense and incorporate the user information to design input prompts is crucial, and Fermi achieves this by using the context of mis-aligned responses.

Overall, designing personalized prompts satisfying these three properties (relevancy to test query, proper format, and effective distillation of user information) is challenging, but Fermi effectively accomplishes this goal.

Refer to caption
Figure 5: Qualitative comparison. Example prompts from All Info (middle) and Fermi (bottom) for the specific question (top) from GlobalOpinionQA. Prompt is inserted in the location of <INS>. More qualitative examples are presented in Appendix E.

5 Conclusion

In this paper, we propose Fermi, a simple yet effective framework for improving the few-shot personalization of LLMs. Our key idea is to optimize the input prompt by learning from the user information; we propose an efficient way to incorporate contexts of mis-aligned responses by LLMs during the optimization, and a retrieval approach to select the optimized prompt relevant to test query. The effectiveness of Fermi is demonstrated by results on various personalization tasks and LLMs. We believe that our framework could be beneficial for improving the experience with the personal usage of LLMs, which become increasingly emerging and important in the future. More discussions on the limitation and the broader impact of this work are presented in Appendix A.

References

  • Banerjee et al. [2023] D. Banerjee, P. Singh, A. Avadhanam, and S. Srivastava. Benchmarking llm powered chatbots: methods and metrics. arXiv preprint arXiv:2308.04624, 2023.
  • Chao et al. [2023] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
  • Deng et al. [2022] M. Deng, J. Wang, C.-P. Hsieh, Y. Wang, H. Guo, T. Shu, M. Song, E. Xing, and Z. Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  • Durmus et al. [2023] E. Durmus, K. Nyugen, T. I. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield-Dodds, D. Hernandez, N. Joseph, et al. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
  • Gao et al. [2023] T. Gao, H. Yen, J. Yu, and D. Chen. Enabling large language models to generate text with citations. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
  • Glaese et al. [2022] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  • Hwang et al. [2023] E. Hwang, B. P. Majumder, and N. Tandon. Aligning language models to user opinions. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
  • Izacard et al. [2022] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave. Unsupervised dense information retrieval with contrastive learning. In Transactions on Machine Learning Research (TMLR), 2022.
  • Kamalloo et al. [2023] E. Kamalloo, N. Dziri, C. Clarke, and D. Rafiei. Evaluating open-domain question answering in the era of large language models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  • Kandpal et al. [2023] N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel. Large language models struggle to learn long-tail knowledge. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  • Lester et al. [2021] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
  • Li et al. [2023] J. Li, N. Mehrabi, C. Peris, P. Goyal, K.-W. Chang, A. Galstyan, R. Zemel, and R. Gupta. On the steerability of large language models toward data-driven personas. arXiv preprint arXiv:2311.04978, 2023.
  • Li and Liang [2021] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  • OpenAI [2022] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  • OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [16] PewResearch. Writing survey questions. https://www.pewresearch.org/our-methods/u-s-surveys/writing-survey-questions.
  • Pillutla et al. [2021] K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Prasad et al. [2023] A. Prasad, P. Hase, X. Zhou, and M. Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023.
  • Pryzant et al. [2023] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with" gradient descent" and beam search. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
  • Razdaibiedina et al. [2023] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. In International Conference on Learning Representations (ICLR), 2023.
  • Reimers and Gurevych [2019] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  • Robertson et al. [2009] S. Robertson, H. Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  • Salemi et al. [2023] A. Salemi, S. Mysore, M. Bendersky, and H. Zamani. Lamp: When large language models meet personalization. arXiv preprint arXiv:2304.11406, 2023.
  • Santurkar et al. [2023] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect? In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  • Shin et al. [2020] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  • Solaiman and Dennison [2021] I. Solaiman and C. Dennison. Process for adapting language models to society (palms) with values-targeted datasets. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Song et al. [2020] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Stelmakh et al. [2022] I. Stelmakh, Y. Luan, B. Dhingra, and M.-W. Chang. Asqa: Factoid questions meet long-form answers. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  • Sun et al. [2024] C. Sun, K. Yang, R. G. Reddy, Y. R. Fung, H. P. Chan, C. Zhai, and H. Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. arXiv preprint arXiv:2402.11060, 2024.
  • Team et al. [2023] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. [2023] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  • Wang et al. [2022] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Xie et al. [2023] Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11, 2023.
  • Yang et al. [2024] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen. Large language models as optimizers. In International Conference on Learning Representations (ICLR), 2024.
  • Zhao et al. [2024] S. Zhao, J. Dang, and A. Grover. Group preference optimization: Few-shot alignment of large language models. In International Conference on Learning Representations (ICLR), 2024.
  • Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • Zhou et al. [2023] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. In International Conference on Learning Representations (ICLR), 2023.
  • Zhu et al. [2022] Q. Zhu, B. Li, F. Mi, X. Zhu, and M. Huang. Continual prompt tuning for dialog state tracking. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.

Appendix A Limitations and Broader Impact

A.1 Limitations and future work

Although we have conducted comprehensive experiments on various NLP tasks with multiple LLMs, results and analyses on more datasets, tasks, and LLMs would likely draw a more decisive conclusion. For example, the tested benchmarks in Section 4 are discriminative tasks, i.e., the correctness of the responses by LLM can be directly evaluated using the ground-truth response from the user, and hence it’s easy to find the mis-aligned responses. In contrast, evaluating the correctness of LLM’s response (i.e., finding a proper metric) is challenging for the generation tasks, and is being continuously discussed [1, 9]. Nevertheless, we believe that our framework is still applicable in the generation task if the proper metric is given. For instance, ROUGE-L [28, 32] and MAUVE [5, 17] are popular metrics to measure the quality of machine-generated responses compared to ground-truth human-generated responses. As these metrics range between 0 and 1, one can set a specific threshold (i.e., τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ]) to determine the mis-aligned responses under these metrics (see Step 1 in Section 3.2). In addition, LLM-as-judge [9, 37] is another emerging way to evaluate the correctness of generation; in this case, it’s more straightforward to apply our framework, as it provides the binary outputs as same as discriminative tasks. However, finding a proper metric for each generation task itself is still a difficult problem, and hence we expect that this direction could be explored in the future.

In addition, while we show that the proposed framework can find personalized prompts by learning from the given user information, we also observe that its success highly depends on the capability of LLMs used for the optimization (i.e., generating new prompt from the memory in Eq. 5), as shown in Figure 6. Since our approach requires a few number of iterations of optimization to provide high-quality personalized prompts, a certain amount of cost is inevitably required. However, as we demonstrated in the experiments, the personalized prompts from our method are well-transferrable to other LLMs that are not used during optimization (Table 3), could be continuously updated with enlarged data through the user interactions (Table 8), and also reusable to convert previous prompts to have the proper format for LLMs (Table 5). Therefore, we believe that our approach could be an even more efficient way for personalization compared to the heuristical design of the prompt, after the consumption of the cost at the initial optimization.

A.2 Broader impact and ethical implications

We strongly believe that Fermi can provide a strong positive impact in real-world applications that require personalized responses for the given user, e.g., search engines or chatbots. We expect that our framework would be especially beneficial for the users belonging to under-populated social groups, since LLMs are known to follow the knowledge or opinion of the major population within pre-trained data [10, 24]. In contrast, there also exists some potential negative impacts. Since our framework needs to provide personal information to LLMs (mostly through API), it has a potential privacy risk when the provider of LLMs does not follow the safeguard and collects the given information. In addition, as our framework didn’t filter out the resulting prompts separately, it can include the prompts that have socially negative impacts, e.g., jailbreak of LLMs [2]. We believe that the incorporation of an additional filtering step could be a solution to this problem [34].

Appendix B More Analyses with Fermi

In this section, we provide more analyses of Fermi in addition to the analyses in Section 4.3.

Importance of using strong LLM for optimization 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT.

As denoted in Section 4.1, we commonly use GPT-4 for LLM 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new prompts from the optimization memory (Eq. 5) for all the experiments in Section 4. To validate this design choice, we conduct the experiments by substituting GPT-4 with ChatGPT 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT in both OPRO and Fermi. Figure 6 is the optimization trajectory in terms of training accuracy (i.e., average accuracy of the prediction by \mathcal{M}caligraphic_M on previous user opinions). Here, one can observe that both OPRO and Fermi suffer in optimizing the prompt when we use ChatGPT as 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT, similar to the previous observation [35]; it reveals that generating the improved prompts from the optimization memory with previous prompts, scores, and contexts requires complex reasoning capability. Therefore, using a strong LLM such as GPT-4 is necessary.

Refer to caption
Figure 6: Optimization trajectory under different LLMs for 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT. Average training accuracies on GlobalOpinionQA across optimization iterations (T=10𝑇10T=10italic_T = 10) under OPRO and Fermi.
Table 6: Fermi only using GPT-4. Test accuracy of GPT-4 over the different prompting methods on GlobalOpinionQA. For Few-shot, we use Contriever which shows higher accuracy in Table 1. For OPRO and Fermi, prompts optimized on ChatGPT are directly used. The best and second best scores are highlighted in bold and underline, respectively.

Methods Models Vanilla Profile Few-shot All Info OPRO Fermi Fermi GPT-4 56.7 77.7 68.9 78.2 76.7 84.8 86.7

Table 7: Different initial prompts. Test accuracy of ChatGPT over the different prompting methods on GlobalOpinionQA.

Methods Models Vanilla Profile Fermivanilla Fermi ChatGPT 62.8 66.1 69.9 74.8

Optimization with stronger LLM for evaluation \mathcal{M}caligraphic_M.

Next, to explore the compatibility of Fermi with different configurations of two LLMs during the optimization, we conduct the additional experiments by substituting evaluating LLM \mathcal{M}caligraphic_M to GPT-4 from ChatGPT; namely, two LLMs \mathcal{M}caligraphic_M and 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT for evaluating and generating are GPT-4. The results on GlobalOpinionQA are presented in Table 6. It is observable that one can find further improved personalized prompts in terms of test accuracy, when using stronger LLM \mathcal{M}caligraphic_M for evaluating (Eq. 2). For example, compared to the use of personalized prompts optimized by ChatGPT as \mathcal{M}caligraphic_M (Fermi), the optimization only using GPT-4 exhibits 1.9% additional test accuracy improvement. This result clearly shows that the proposed Fermi is compatible with different types and capacities of evaluating LLMs.

Importance of initial prompts in Fermi.

For the experiments, we used a fixed initial prompt template across all datasets in our experiments, that maximally incorporates the given user profiles, as it has proven effective in prior studies [7, 24, 36], as described in Section 3.2 and Appendix C.3.

Nevertheless, to further provide insights about the impact of initial prompt templates on Fermi, we conduct additional experiments by varying the initial prompt set P0={p0}superscript𝑃0subscriptp0P^{0}=\{\text{p}_{0}\}italic_P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }. To be specific, on GlobalOpinionQA dataset, we exclude the user profiles for the construction of the initial prompt unlike the original Fermi (in Table 1), and use the prompt of Vanilla for the initialization. We denote this version as Fermivanilla. The results (in comparison with other methods are) shown in Table 7, where Fermi (our method) consistently outperforms the baselines with both choices of prompt initialization while the gain is enlarged with better initialization when incorporating the user profile.

Table 8: Continual prompt optimization. Test accuracy of ChatGPT over the different prompting methods on GlobalOpinionQA. denotes that the results are obtained with a two times larger pool for Retrieval-of-Prompt.

Methods Models Profile OPRO Fermihalf Fermiiter 1𝚌𝚘𝚗𝚝subscriptsuperscriptabsent𝚌𝚘𝚗𝚝iter 1{}^{\tt cont}_{\text{iter 1}}start_FLOATSUPERSCRIPT typewriter_cont end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT iter 1 end_POSTSUBSCRIPT Fermiiter 5𝚌𝚘𝚗𝚝subscriptsuperscriptabsent𝚌𝚘𝚗𝚝iter 5{}^{\tt cont}_{\text{iter 5}}start_FLOATSUPERSCRIPT typewriter_cont end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT iter 5 end_POSTSUBSCRIPT Fermi ChatGPT 66.1 71.1 73.0 74.0 74.9 74.8

Continual optimization of prompts.

In the previous experiments, we assumed that the fixed dataset U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT of questions and user’s opinions is given. However, in the real-world, user often interacts frequently with LLMs, which means that the dataset could be continuously updated. Therefore, the iterative process of refining prompts might incur significant computational costs, if it should be conducted from scratch at certain intervals (e.g., when the number of new data reaches the threshold).

To mitigate this issue, we conduct additional experiments to show that the idea of continual prompt optimization [20, 33, 39] could be applied to Fermi, and hence such cost could be drastically reduced. Specifically, we first conduct Fermi by using half of the previous questions and the user’s responses U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT (denoted by Fermihalf). We remark that other parameters are kept the same such as the 10 iterations of the optimization. Then, with the entire U𝚘𝚙𝚒subscript𝑈𝚘𝚙𝚒U_{\tt opi}italic_U start_POSTSUBSCRIPT typewriter_opi end_POSTSUBSCRIPT, we continuously conduct Fermi under the limited number of iterations, by initializing the prompt pool with previously optimized prompts in Fermihalf (i.e., substituting the initialization in line 116). We denoted the results of this continuous optimization with 1 and 5 iterations as Fermiiter 1𝚌𝚘𝚗𝚝subscriptsuperscriptabsent𝚌𝚘𝚗𝚝iter 1{}^{\tt cont}_{\text{iter 1}}start_FLOATSUPERSCRIPT typewriter_cont end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT iter 1 end_POSTSUBSCRIPT and Fermiiter 5𝚌𝚘𝚗𝚝subscriptsuperscriptabsent𝚌𝚘𝚗𝚝iter 5{}^{\tt cont}_{\text{iter 5}}start_FLOATSUPERSCRIPT typewriter_cont end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT iter 5 end_POSTSUBSCRIPT, respectively.

The results are presented in Table 8. First, it is notable that even with the reduced number of data for the optimization, Fermi still outperforms the strong baselines that are based on heuristic prompt engineering (Profile) or using the optimization by LLMs under full data (ORPO). However, one can also observe that the accuracy under full data is much better (74.8 vs. 73.0), which reveals that the data quantity is still important in Fermi. Next, it is also observed that the prompts could be successfully optimized continuously when the new data is added. Here, we denote that the previously optimized prompts in Fermihalf are also re-used for the pool of Retrieval-of-Prompt, to keep the knowledge of previous iterations.444Integrating new prompts into each user’s retrieval pool adds minimal computational overhead for calculating their embeddings. Remarkably, even with only 1 additional iteration of optimization, the accuracy is significantly increased (73.0 \rightarrow 74.0). Also, when increasing the number of iterations to 5 (i.e,, the same amount of computations compared to the original Fermi), the accuracy is increased and slightly outperforms the original optimization under the full data. Such improvement might be from the enlarged pool of Retrieval-of-prompt that enables better exploitation of the previous knowledge.

These results clearly show that the proposed framework is still effective for a more realistic scenario under the continuously updated user data.

Appendix C Experimental Details

Table 9: Dataset statistics. More descriptions and statistics of datasets used in experiments.

Dataset Task Users Types of User Profiles # of Previous Opinions # of Test Questions OpinionQA Multiple Choice QA 525 Demographic and Ideology 10.5k 15.8k GlobalOpinionQA Multiple Choice QA 46 Nationality 920 1,317 LaMPtag 15-way Movie Tagging 50 Not Available 1,000 1,500 LaMPrate 5-scale Review Rating 50 Not Available 1,000 1,500

This section provides more details about the experimental setups in Section 4.

Refer to caption
Figure 7: An overview of datasets. OpinionQA [24] (1st row), GlobalOpinionQA [4] (2nd row), LaMPtag (3rd row), and LaMPrate [23] (4th row).
Table 10: Information of GlobalOpinionQA. List of 46 countries in the constructed dataset from GlobalOpinionQA.

Countries Greece, Sweden, China (Non-national sample), Colombia, Tunisia, Malaysia, Vietnam, Argentina, Bulgaria, Russia, Egypt, Indonesia, Jordan, Mexico, Pakistan, Palest. ter., Tanzania, Turkey, Ukraine, Kenya, Ghana, Canada, France, Germany, Lebanon, Peru, Poland, S. Korea, Italy, Spain, United States, Brazil, Chile, Japan, Venezuela, Senegal, Britain, Australia, Netherlands, Uganda, Nigeria, Philippines, Ethiopia, Myanmar, Maldives, Libya

C.1 Datasets

First, we present more detailed descriptions of the used datasets: OpinionQA [24], GlobalOpinionQA [4], LaMPtag, and LaMPrate [23]. Dataset statistics are presented in Table 9. Also, an example from each dataset is presented in Figure 7.

  • \circ

    OpinionQA is a multiple-choice QA dataset originally constructed based on a public opinion survey [16], to evaluate the alignment of LM with 60 US demographic groups over various topics. As OpinionQA includes the information of each respondent, this dataset has been also used to evaluate the personalization of LLMs [7] and we also adopt it. Specifically, we use a subsampled split released by Hwang et al. [7], which consists of 10.5k and 15.8k training and test QA pairs across 525 users and 15 topics; namely, each user has 20 training QA pairs and 30 test QA pairs for each topic, on average. Also, the average number of answer choices is 3.2. Then, we use training QA pairs as given previous opinions by user, and use test QA pairs to evaluate. In addition, for the experiments, we use all 12 types of user profiles included in the dataset: {Age, Citizenship, Region, Education, Income, Marital status, Political ideology, Political party, Race, Religion, Frequency of religious attendance, Gender}.

  • \circ

    GlobalOpinionQA is a multiple-choice QA dataset constructed from cross-national surveys to capture diverse opinions on global issues across different countries. Since the dataset originally included the answer distribution by multiple respondents in the same country, we converted it to have a single answer by selecting the choice with the highest probability, and treated each country as a specific user. To be specific, we set a threshold (0.8) and selectively use the data when its highest probability is higher than the threshold to guarantee the quality of the converted. It results in 920 training and 1,317 test QA pairs across 46 countries; namely, each user (country) has 20 training QA pairs and 28.6 test QA pairs for each topic, on average. Also, the average number of answer choices is 4.1. Then, we use training QA pairs as given previous opinions by user, and use test QA pairs to evaluate. Also, nationality becomes the only available profile. The full list of countries included in the dataset is presented in Table 10. Dataset could be downloaded from https://huggingface.co/datasets/Anthropic/llm_global_opinions.

  • \circ

    LaMPtag is is a 15-way classification data where an input is a movie description and a label is a corresponding movie tag among 15 categories: {Sci-fi, Based on a book, Comedy, Action, Twist ending, Dystopia, Dark comedy, Classic, Psychology, Fantasy, Romance, Thought-provoking, Social commentary, Violence, True story}. Since the original dataset is proposed to consider the scenario of fine-tuning LMs and hence it consists of a large number of examples, we construct our dataset by subsampling from its validation dataset to make it suitable to evaluate LLMs with inference. It results in 1,000 training and 1,500 test QA pairs across 50 users, respectively.

  • \circ

    LaMPrate is a regression data where an input is a user review and a label is an integer rating (1-5), i.e., 1 is mostly negative and 5 is mostly positive. Under the same motivation with LaMPtag, we construct our dataset by subsampling from its validation dataset, which results in 1,000 training and 1,500 test QA pairs across 50 users, respectively. LaMP benchmarks could be downloaded in https://github.com/LaMP-Benchmark/LaMP.

C.2 Baselines

In this section, we present the specific prompts used for the experiments in Section 4. Listing C.3-C.3 are actually used prompts for Vanilla, Profile, Few-shot, and All Info, during the experiments on GlobalOpinionQA. Also, the prompt of OPRO used for the optimization is presented in Figure 9, which is the originally used one in Yang et al. [35]. While we’re trying to adapt this prompt similar to ours in Figure 8, we observed that it degrades the performance of OPRO; for example, the average test accuracy is reduced to 70.7% from 71.1%. Therefore, we use the original prompt for all the experiments. We remark that each prompt is minimally adjusted to consider the difference between datasets. For example, as OpinionQA includes many available user profiles, we fully incorporate these with the prompt in Listing C.3, following Hwang et al. [7]. Also, we present the prompt of Vanilla method on LaMPrate dataset in Listing C.3. In addition, we present the prompt used to convert the format of the input prompt by Few-shot (Table 5) in Listing C.3.

C.3 Fermi

Refer to caption
Figure 8: Detailed prompt example. Example of detailed input prompt for 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT to generate new prompts, composed of fixed input prompt p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT (including fixed few-shot demonstrations) and optimization memory Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (Eq. 5) on OpinionQA dataset.

As denoted in Section 3.2, we need to provide an initial input prompt set P0={p0}superscriptP0superscriptp0\text{P}^{0}=\{\text{p}^{0}\}P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT }. To this end, we use the heuristically design input prompts, which are presented in C.2. Specifically, we adopt the prompts used for Profile tuned for each data, when the user profile U𝚙𝚛𝚘subscript𝑈𝚙𝚛𝚘U_{\tt pro}italic_U start_POSTSUBSCRIPT typewriter_pro end_POSTSUBSCRIPT is available (both OpinionQA and GlobalOpinionQA). Since our framework only utilizes a given few-shot previous opinions during the optimization, this way of initial prompting naturally enables us to fully utilize all the user information. When the user profile is not available, we adopt the prompts used for Vaniall. In addition, we present a more detailed version of the prompt p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT used to generate new input prompts with 𝚘𝚙𝚝subscript𝚘𝚙𝚝\mathcal{M}_{\tt opt}caligraphic_M start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT in Figure 8. We remark that p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT is minimally adjusted across dataset, to match the different task and user information of each dataset.

{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Vanilla method on GlobalOpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Choose the proper answer to the given question among the given answer choices, as if you currently reside in user profile. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Profile method on GlobalOpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ [1]. Question: question of 1st retrieval among previous opinions

Answer choices: answer choice of 1st retrieval among previous opinions

Answer: answer of 1st retrieval among previous opinions

[N]. Question: question of Nth retrieval among previous opinions

Answer choices: answer choice of Nth retrieval among previous opinions

Answer: answer of Nth retrieval among previous opinions

Based on the above previous questions and answers, choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Few-shot method. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ [1]. Question: question of 1st retrieval among previous opinions

Answer choices: answer choice of 1st retrieval among previous opinions

Answer: answer of 1st retrieval among previous opinions

[N]. Question: question of Nth retrieval among previous opinions

Answer choices: answer choice of Nth retrieval among previous opinions

Answer: answer of Nth retrieval among previous opinions

Based on the above previous questions and answers, choose the proper answer to the given question among the given answer choices, as if you currently reside in explicit_profile. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for All Info method.

Refer to caption
Figure 9: Prompt of OPRO. Prompt p𝚘𝚙𝚝subscriptp𝚘𝚙𝚝\text{p}_{\tt opt}p start_POSTSUBSCRIPT typewriter_opt end_POSTSUBSCRIPT used for prompt optimization by OPRO [35].
{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ A person can be described as follows: Age: age in user profile Citizenship in America: citizenship in America in user profile Region: region in user profile Education: education in user profile Income: income in user profile Marital status: marital status in user profile Political ideology: political ideology in user profile Political party: political party in user profile Race: race in user profile Religion: religion in user profile Frequency of religious attendance: frequency of religious attendance in user profile Gender: gender in user profile

Based on the demographic information, choose the proper answer to the given question among the given answer choices. Your answer should be a single alphabet among given answer choices:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Profile method on OpinionQA. {listing*}[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ Answer to the given question. Just answer with 1, 2, 3, 4, or 5 without further explanation:

Question: question

Answer choices: answer choice

Answer: ”’ Input prompt used for Vanilla method on LaMPrate.

{listing*}

[!ht] {minted}[fontsize=, frame=single, breaklines]python f”’ The followings are two different prompts used to answer the question.

[Input prompt]: prompt by Few-shot

[Target prompt]: prompt optimized by Fermi

You need to convert the input prompt to the format of the target prompt while preserving the original contexts in the input prompt.

Converted prompt: ”’ Prompt used to convert the format of input prompt by Few-shot to be instruction with multiple sentences.

Appendix D Additional Quantitative Results

In this section, we provide additional quantitative results that can’t be presented in the main draft due to the limited space. First, in Table 11, we present the average and standard deviation of topic-wise accuracy, i.e., the average and standard deviation are calculated across 35 users where each user receives 30 test questions in the same topic. Next, we present the test performance of Few-shot method in Section 4, under different numbers of retrieved opinions. Lastly, we present the test performance under a different number of considered training questions N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG (Eq. 7). As one can see in Table 13, N~=3~𝑁3\tilde{N}=3over~ start_ARG italic_N end_ARG = 3 which is commonly used in our experiments shows consistent improvements in general, although the optimal values are different across the datasets.

Table 11: Detailed topic-wise accuracy. Average topic-wise accuracy and standard deviation with different methods on OpinionQA.

Methods Topics Vanilla Few-shotcont OPRO Fermi Guns 45.3±plus-or-minus\pm±9.6 54.2±plus-or-minus\pm±13.7 54.7±plus-or-minus\pm±9.0 57.4±plus-or-minus\pm±14.5 Auto. vehicles 46.0±plus-or-minus\pm±10.9 48.7±plus-or-minus\pm±10.0 50.2±plus-or-minus\pm±9.5 53.2±plus-or-minus\pm±10.6 Views on gender 39.7±plus-or-minus\pm±10.4 49.0±plus-or-minus\pm±7.8 52.9±plus-or-minus\pm±11.5 58.9±plus-or-minus\pm±8.8 Sex. harassment 38.0±plus-or-minus\pm±10.9 40.4±plus-or-minus\pm±10.4 46.1±plus-or-minus\pm±9.4 47.7±plus-or-minus\pm±10.4 Biomedical & food 54.8±plus-or-minus\pm±10.6 59.9±plus-or-minus\pm±11.9 61.0±plus-or-minus\pm±11.1 63.7±plus-or-minus\pm±10.4 Gender & Leadership 49.9±plus-or-minus\pm±12.5 53.0±plus-or-minus\pm±10.6 54.9±plus-or-minus\pm±11.7 59.5±plus-or-minus\pm±9.0 America in 2050 48.6±plus-or-minus\pm±12.2 46.4±plus-or-minus\pm±10.8 44.6±plus-or-minus\pm±10.5 49.8±plus-or-minus\pm±10.8 Trust in science 49.0±plus-or-minus\pm±9.9 56.1±plus-or-minus\pm±10.8 54.8±plus-or-minus\pm±10.4 60.7±plus-or-minus\pm±7.8 Race 38.8±plus-or-minus\pm±7.8 46.8±plus-or-minus\pm±6.9 43.4±plus-or-minus\pm±11.0 49.3±plus-or-minus\pm±13.7 Misinformation 49.7±plus-or-minus\pm±11.7 50.5±plus-or-minus\pm±7.4 46.6±plus-or-minus\pm±9.2 52.3±plus-or-minus\pm±9.0 Privacy & Surveilance 41.5±plus-or-minus\pm±10.4 49.5±plus-or-minus\pm±9.2 46.6±plus-or-minus\pm±9.9 50.6±plus-or-minus\pm±10.6 Family & Relationships 51.4±plus-or-minus\pm±10.2 53.2±plus-or-minus\pm±12.1 50.9±plus-or-minus\pm±13.3 56.3±plus-or-minus\pm±11.9 Economic inequality 40.9±plus-or-minus\pm±9.2 47.0±plus-or-minus\pm±9.4 49.3±plus-or-minus\pm±12.7 53.5±plus-or-minus\pm±9.0 Global attitudes 46.3±plus-or-minus\pm±13.6 49.7±plus-or-minus\pm±12.3 47.9±plus-or-minus\pm±12.0 50.8±plus-or-minus\pm±13.9 Political views 43.2±plus-or-minus\pm±12.6 42.4±plus-or-minus\pm±9.2 48.9±plus-or-minus\pm±9.8 53.9±plus-or-minus\pm±11.8

Table 12: Different number of retrieval. Test performance of ChatGPT under different configurations for Few-shot method. Here, k𝑘kitalic_k denotes the number of retrieved opinions. The best scores are highlighted in bold.

Datasets (Metric) Methods OpinionQA GlobalOpinionQA LaMPtag LaMPrate (Acc.) (Acc.) (Acc.) (MAE) Few-shotbm25 (k=3) 49.8 59.1 34.9 0.40 Few-shotbm25 (k=8) 48.3 59.1 35.9 0.41 Few-shotcont (k=3) 49.3 61.2 35.6 0.36 Few-shotcont (k=8) 48.7 58.2 36.2 0.38 Few-shotall (k=20) 47.9 56.3 35.8 0.46

Table 13: Different N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG for RoP. Test performance of ChatGPT under different N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG for RoP (Eq. 7).

Datasets (Metric) N~~𝑁\tilde{N}over~ start_ARG italic_N end_ARG OpinionQA GlobalOpinionQA LaMPtag LaMPrate (Acc.) (Acc.) (Acc.) (MAE) N~=1~𝑁1\tilde{N}=1over~ start_ARG italic_N end_ARG = 1 54.6 74.8 37.8 0.341 N~=3~𝑁3\tilde{N}=3over~ start_ARG italic_N end_ARG = 3 54.5 74.8 37.8 0.343 N~=5~𝑁5\tilde{N}=5over~ start_ARG italic_N end_ARG = 5 54.5 74.4 37.5 0.341 N~=10~𝑁10\tilde{N}=10over~ start_ARG italic_N end_ARG = 10 54.1 74.1 37.7 0.347 N~=20~𝑁20\tilde{N}=20over~ start_ARG italic_N end_ARG = 20 54.3 74.2 36.7 0.338

Appendix E More Comparison Examples between Personalized Prompts

In this section, we present more qualitative comparisons between the prompts from different methods for personalization of LLMs. To be specific, we present the specific test query from each data, and three corresponding prompts from the heuristic design, OPRO, and Fermi. Figures 10-17 are the comparison results on four datasets used in Section 4. Somewhat interestingly, one can observe that the personalized prompts by Fermi exhibit non-trivial incorporation of user information. In addition, we present examples of format-converted versions of few-shot prompting of previous user opinions (i.e., Few-shotformat in Table 5) in Figures 18 and 19. Here, one can observe that the converted prompts have a similar form to the personalized prompts by Fermi which is more natural to understand and follow for LLMs, and hence it significantly improves the performance up to 10.1%, as shown in Table 5.

Refer to caption
Figure 10: Comparison of prompts on OpinionQA. Example of question from OpinionQA (1st row), and the prompts used to answer this question with All Info (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 11: Comparison of prompts on OpinionQA. Example of question from OpinionQA (1st row), and the prompts used to answer this question with All Info (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 12: Comparison of prompts on GlobalOpinionQA. Example of question from GlobalOpinionQA (1st row), and the prompts used to answer this question with All Info (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 13: Comparison of prompts on GlobalOpinionQA. Example of question from GlobalOpinionQA (1st row), and the prompts used to answer this question with All Info (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 14: Comparison of prompts on LaMPtag. Example of question from LaMPtag (1st row), and the prompts used to answer this question with Few-shotcont (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 15: Comparison of prompts on LaMPtag. Example of question from LaMPtag (1st row), and the prompts used to answer this question with Few-shotcont (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 16: Comparison of prompts on LaMPrate. Example of question from LaMPrate (1st row), and the prompts used to answer this question with Few-shotcont (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 17: Comparison of prompts on LaMPrate. Example of question from LaMPrate (1st row), and the prompts used to answer this question with Few-shotcont (2nd row), OPRO (3rd row), and Fermi (4th row).
Refer to caption
Figure 18: Example of format-converted prompts. Example of question from GlobalOpinionQA (1st row), and the prompts used to answer this question with Few-shotall (2nd row) and Format-converted prompts (Few-shotformat) by prompting GPT-4 to convert the format using the personalized prompts by Fermi as reference (3rd row).
Refer to caption
Figure 19: Example of format-converted prompts. Example of question from GlobalOpinionQA (1st row), and the prompts used to answer this question with Few-shotall (2nd row) and Format-converted prompts (Few-shotformat) by prompting GPT-4 to convert the format using the personalized prompts by Fermi as reference (3rd row).