When Search Engine Services meet Large Language Models: Visions and Challenges

Haoyi Xiong, Senior Member, IEEE, Jiang Bian, Member, IEEE, Yuchen Li, Xuhong Li, Mengnan Du, Member, IEEE, Shuaiqiang Wang, Dawei Yin, Senior Member, IEEE, and Sumi Helal, Fellow, IEEE
Abstract

Combining Large Language Models (LLMs) with search engine services marks a significant shift in the field of services computing, opening up new possibilities to enhance how we search for and retrieve information, understand content, and interact with internet services. This paper conducts an in-depth examination of how integrating LLMs with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search). For Search4LLM, we investigate how search engines can provide diverse high-quality datasets for pre-training of LLMs, how they can use the most relevant documents to help LLMs learn to answer queries more accurately, how training LLMs with Learning-To-Rank (LTR) tasks can enhance their ability to respond with greater precision, and how incorporating recent search results can make LLM-generated content more accurate and current. In terms of LLM4Search, we examine how LLMs can be used to summarize content for better indexing by search engines, improve query outcomes through optimization, enhance the ranking of search results by analyzing document relevance, and help in annotating data for learning-to-rank tasks in various learning contexts. However, this promising integration comes with its challenges, which include addressing potential biases and ethical issues in training models, managing the computational and other costs of incorporating LLMs into search services, and continuously updating LLM training with the ever-changing web content. We discuss these challenges and chart out required research directions to address them. We also discuss broader implications for service computing, such as scalability, privacy concerns, and the need to adapt search engine architectures for these advanced models.

Index Terms:
Large Language Models (LLMs), Search Engines, Learning-to-Rank (LTR), and Retrieve-Augmented Generation (RAG)

1 Introduction

The dawn of the Internet services age has brought forth a deluge of information, making the role of search engines more critical than ever in navigating this vast digital landscape [65, 10]. For instance, as of January 2024, the total number of websites worldwide has reached an impressive milestone of 1.079 billion. This figure marks a significant increase from the 185 million websites recorded 15 years ago, showcasing the exponential growth and expansion of the digital landscape over this period111https://siteefy.com/how-many-websites-are-there/. However, as the complexity of user queries and the expectation for precise, contextually relevant, and up-to-date responses grow, traditional search technologies face mounting challenges in meeting these demands. Considerable advancements have been made in the fields of natural language processing (NLP) and information retrieval (IR) technologies [19, 30]. These efforts aim to enhance the ability of machines to accurately fetch content from the vast expanse of websites available online, efficiently store and index this content, comprehend user queries with higher precision, and deliver relevant, accurate, and current contents crawled from massive online websites, in an organized manner [39, 40].

On the other hand, Large Language Models (LLMs)– the cornerstones of generative artificial intelligence (GenAI) have shown remarkable capabilities in understanding, generating, and augmenting human language [11, 78]. The potential integration of LLMs with search engine services presents an exciting frontier in services computing, promising to significantly enhance search functionalities and redefine user interaction with digital information systems. For example, new Bing utilizes ChatGPT to perform Retrieval-Augmented Generation (RAG) [60] by injecting search results into the contexts of the LLM, generating comprehensive responses based on the most relevant and current information searched from its database222https://www.microsoft.com/en-us/edge/features/the-new-bing. From the perspective of LLMs, this integration significantly enhances their accuracy and informativeness by allowing them to access and incorporate real-time data and diverse content from the web, thereby expanding their knowledge base beyond pre-training/fine-tuning datasets and enabling them to provide more accurate, contextually relevant, and up-to-date responses to user queries. Especially, search engines could help LLMs counter the hallucinations – an innate of almost every LLM [107, 29]. From the perspectives of search engines, leveraging LLMs equipped with RAG capabilities enriches the user experience by offering more meticulous and contextually aware responses. This not only improves the search accuracy but also elevates the overall user experience in handling complex queries, thereby increasing user satisfaction and engagement with the platform [60, 29].

In this work, we aim to explore the symbiotic relationship between LLMs and search engines, investigating how each can leverage the strengths of the other to overcome their respective limitations and to enhance thrier capabilities. As shown in Fig 1, the technologies driving the development of search engines and AI models have historically co-evolved. Both revolutionary streams of technology emerged around the same time, starting with the conceptualization of the Memex by Vannevar Bush and the pioneering work on artificial neurons by McCulloch and Pitts [94, 87]. These foundational technologies paved the way for significant advancements, such as the World Wide Web (WWW) and PageRank for search engines, and Backpropagation, Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) models in AI from the 1980s to the 1990s [8, 10, 56]. Following the historic win of AlexNet at the ImageNet competition in 2013 [56], Google elevated its retrieval and ranking components by integrating neural networks333https://blog.google/products/search/how-ai-powers-great-search-results/. This propelled the continuous advancement of AI with the introduction of Transformer models with self-attention mechanisms, and BERT for enhanced query and content understanding in the late 2010s [118, 25]. More recently, OpenAI introduced the Generative Pre-trained Transformer (GPT) to start GenAI, and launched its groundbreaking online chatbot service, known as ChatGPT [95]. Subsequently, Microsoft integrated ChatGPT into their search engine, forming the new Bing, which offers an advanced chat-based search experience in 2023.

Refer to caption
Figure 1: Technological Evolution of AI Models and Search Engine Technologies: Some of the key milestones achieved by AI and search engine (information retrieval) technologies.

In the context of services computing, the integration of LLMs and search engines is not merely an augmentation of existing capabilities but a paradigm shift towards creating more intelligent, efficient, and user-centric search services. The exploration is divided into two main themes: the benefits of enriching LLMs with search engine data and functionalities (Search4LLM) and the enhancement of search performance through the capabilities of LLMs (LLM4Search).

  • Search4LLM: Under this theme, we examine the process of leveraging the vast, diverse data repositories of search engines for the pre-training and progressive fine-tuning of LLMs [140]. This includes an examination of how high-quality, ranked documents can serve as an excellent source for training data, assisting LLMs in developing a better understanding of query contexts and improving their accuracy in generating relevant responses. Additionally, we focus on the potential of learning-to-rank (LTR) algorithms [76] in refining capabilities of LLMs to understand and prioritize information relevance.

  • LLM4Search: Conversely, this part highlights the impact that LLMs can have on search engine operations. This encompasses the utilization of LLMs for more effective content summarization [37], aiding in the indexing process [74], and providing fine-grained query optimization techniques for superior search outcomes [142]. Moreover, the potential of LLMs in analyzing document relevance for ranking purposes and facilitating data annotation in various LTR frameworks [123, 67, 68] is explored.

While this work examines a promising integration of LLMs with search engine services, it is beset by numerous challenges. These include the technical demands of deploying advanced models, ethical concerns, biases in model training, and the need for continuous updates to training datasets due to the evolving nature of web content. This study aims to offer groundbreaking insights and a systematic framework for future research and development in merging LLMs with search engines through a thorough investigation. Our exploration endeavors to augment the field of services computing, striving to develop smarter, more adaptive, and user-centric search services capable of adeptly managing the complexities of today’s digital information landscape and offering the user superior search experience. The key technical contributions of this research are summarized as follows:

  • Exploration of Innovative Utilization of Search Engine Data: Investigates the potential of using broad and diverse datasets from search engines for the initial pre-training and subsequent fine-tuning of LLMs, enhancing their comprehension of queries and improving their accuracy in generating responses.

  • Exploring Leveraging High-Quality Ranked Documents in Training: Examine the use of high-quality, ranked documents as superior sources of training data for LLMs, with the goal of improving their capability to deliver relevant and precise responses to user queries.

  • Advancement in LTR Technologies: Investigates the application of LTR algorithms to augment the effectiveness of LLMs in assessing and prioritizing the relevance of information, thereby enhancing the precision of search results and response generation.

These contributions collectively represent significant advancements in both Search4LLM and LLM4Search themes.

2 Backgrounds and Preliminaries

In this section, we present the fundamentals of search engine services and LLMs to lay the groundwork for our research.

2.1 Search Engine Services

In this section, we provide a concise review on search engine services. Referencing Figure 2, our analysis specifically concentrates on the architectural configuration of systems, the strategic implementation of algorithms, and the administration of evaluative experiments within a search engine.

Refer to caption
Figure 2: Architectural Design, Essential Components with Functionalities of a Common Search Engine Service

2.1.1 Data Collection

The performance of search engine services heavily relies on the gathering and examination of expansive online content. For this process, the use of efficient web crawlers is paramount. They systematically browse the World Wide Web to gather a wide variety of web resources including web pages, images, videos, and other multimedia content, which are crucial for maintaining the search engine’s ability to provide comprehensive responses to user queries [53].

The data collection process is complemented by term extraction module – extracting key terms and phrases from the content. Term extraction leverages advanced text analysis and NLP techniques that identify and categorize important information, thereby refining the search engine’s match between user queries and relevant documents. The optimization of this process is further enhanced by utilizing metadata like titles, descriptions, and tags, along with implementing sophisticated algorithms for entity recognition, semantic and sentiment analysis [108, 53].

2.1.2 Storage and Indexing

Document storage and indexing form the backbone of a search engine’s ability to quickly and accurately match and deliver search results. A critical part of this indexing process is the creation of an inverted index, a fundamental data structure that associates terms with the documents they are found in. This significantly reduces search time by narrowing down the search to documents containing the queried keywords [97].

Additionally, term weighting strategies such as TF-IDF are implemented to rank terms within documents based on their frequency and relevance, improving the accuracy and relevance of search results by prioritizing highly informative terms [102]. These techniques ensure a precise match between user queries and indexed materials, significantly enhancing the search experience.

2.1.3 Retrieval and Ranking

Efficient document retrieval and ranking are important for delivering relevant and valuable search engine results. The primary stages in this process include:

  • Query Processing: The first step involves analyzing and potentially reformulating the user’s query using advanced NLP techniques. This phase is crucial for understanding search intent and improving document retrieval effectiveness [114].

  • Relevance Scoring: Each document is assessed and given a relevance score, based on criteria like query term frequency, document structure, and semantic content. This step quantifies the document’s relevance to the query [144].

  • Document Ranking: Utilizing relevance scores and other factors (e.g., user metrics, site authority), algorithms like PageRank and machine learning models determine the document order, prioritizing the most pertinent results [76, 117].

  • Search Results Personalization: Personalization of results based on users’ profile, search history, locations, and devices aims to enhance user satisfaction by tailoring outcomes adaptively [110].

  • Continuous Optimization: The process is dynamically refined through A/B testing, user feedback, and technological progress to align with user preferences and content changes [88].

The ranking algorithms, particularly those based on Learning-to-Rank (LTR) models, are fundamental for search engines to sequence results with precision. LTR models, informed by user interactions and feedback, employ different approaches to ranking:

  • Pointwise approaches: View ranking as a regression or classification to predict individual document scores.

  • Pairwise approaches: Focus on the relative ranking between document pairs [12].

  • Listwise approaches: Aim to optimize the entire result list’s order [134].

Listwise methods are notably effective in achieving user satisfaction [46].

The development of LTR models is influenced by the availability of human-annotated data, leading to various methodologies:

  • Active LTR prioritizes annotating query-document pairs with uncertain predictions for efficient model training with fewer examples [123].

  • Semi-Supervised LTR combines limited labeled data with larger unlabeled datasets to enhance model training, employing strategies like self-training for effective use of annotations [68, 67].

  • Pretrain-Finetuned LTR involves pre-training on vast datasets followed by fine-tuning with annotated query-document pairs. This approach significantly improves ranking accuracy and data usage efficiency [61, 66].

Selecting among these methods is dictated by the specific LTR challenges and objectives within search engines [130].

2.1.4 Evaluation for Search Engine Services

We outline a technical framework for assessing search engine performance, incorporating critical methodologies and metrics for comprehensive experimentation and analysis.

A/B testing, which is crucial for ongoing enhancement in search engine performance, involves comparing two variations of a search engine to determine the one that performs better [101]. The A/B testing protocol involves:

  • Establishing Objectives: Defining measurable goals, such as boosting click-through rates or search result relevance [82].

  • Creating Variants: Developing a control version (A) versus an experimental version (B), with the latter introducing a new ranking model or feature [52].

  • Segmenting Users: Randomly allocating users to either variant to ensure statistical comparability and isolate the effects of changes [82].

  • Test Implementation: Running the test until reaching statistical significance, while collecting performance data for both variants [14].

Evaluating search engine performance necessitates metrics that accurately reflect user satisfaction with search results. Essential KPIs include:

  • Precision at Top-k Results (P@k) and Normalized Discounted Cumulative Gain (NDCG) for ranking accuracy of the search outcomes [111, 126].

  • Mean Reciprocal Rank (MRR) for the speed of relevant content retrieval [17].

  • Click-through Rate (CTR) and User Satisfaction for gauging engagement and satisfaction [21].

  • Conversion Metrics for assessing the economic impact of changes on commercial search engines [33].

Through A/B testing and rigorous evaluation with these KPIs, search engine developers can make informed decisions to enhance user experience, relevance, and achieve business objectives, ensuring ongoing improvement and optimization of search technology.

Refer to caption
Figure 3: The Life-cycle of LLMs: Pre-training, supervised fine-tuning, model algiments with human feedback, and building applications with agents.

2.2 Large Language Models (LLMs)

Large Language Models (LLMs) represent a significant advancement in the field of natural language processing (NLP) and artificial intelligence (AI). These models have fundamentally altered the landscape of computational linguistics, enabling a wide array of applications that range from text generation to complex question-answering systems [152]. This section delves into the foundational architectures of LLMs and the full life-cycle (shown in Fig. 3), ranging from pre-training, to supervised fine-tuning, to model alignments, and to agent-based applications, which elevate the capabilities of LLMs.

2.2.1 Foundation Models of LLMs

Introduced by “Attention is All You Need” [118], transformers revolutionized NLP by utilizing self-attention mechanisms over recurrent layers. This innovation allows simultaneous word processing in sentences, enhancing efficiency and linguistic comprehension. We, here, compare encoder-only, decoder-only, and encoder-decoder models of transformers, exemplified by BERT, GPT, and BART, respectively [32, 71].

  • Encoder-Only Models: BERT exemplifies this category with its bidirectional training enhancing context understanding. Its encoder transforms input sequences into contextualized representations, aiding in various NLP tasks [49].

  • Decoder-Only Models: GPT and related models emphasize text generation through stacked decoder layers. They predict subsequent words based on previous ones, enabling coherent text generation [1].

  • Encoder-Decoder Models: BART combines both approaches for robust language understanding and generation. This architecture supports a wide range of tasks including summarization and translation [59].

Each model type, from encoder-only to encoder-decoder, offers unique capabilities for specific NLP applications. The evolution from basic transformer models to specialized ones like BERT, GPT, and BART highlights rapid advancements in NLP technology [32, 71].

2.2.2 Pre-training Models

Pre-training models undergo essential tasks to understand and generate natural language effectively. A condensed overview of these tasks is as follows:

  • Masked Language Modeling (MLM): This method conceals specific words in a text, provoking the model to predict the omitted words based on the context. This process is critical for comprehending the bidirectional context [49].

  • Next Token Prediction: It involves predicting subsequent words in a sequence, teaching the model the likelihood of word sequences. This is essential for models aimed at text generation like GPT [6, 78].

  • Next Sentence Prediction (NSP): With this task, models are trained to assess whether a sentence logically follows another, enhancing sentence-level comprehension for tasks like text classification [113].

  • Permutation Language Modeling (PLM): This task is unique to models like XLNet where word order is scrambled for the model to predict the original arrangement, aiding in non-linear understanding of contexts [140].

  • Sentence Order Prediction (SOP): An advancement of NSP, where models reorder shuffled sentences in a text, improving their grasp on narrative flow and long-range dependencies [54].

  • Contrastive Learning: This task focuses on differentiating between correct and corrupted input versions, refining the models’ semantic comprehension [18].

These tasks collectively prepare Large Language Models (LLMs) for a wide array of NLP applications by fostering a robust linguistic foundation.

2.2.3 Supervised Fine-tuning (SFT) and Alignments

Following the broad-based learning in the pre-training stage, LLMs undergo Supervised Fine-tuning (SFT) to enhance their capabilities for particular applications. Later, model alignments, such as Reinforcement Learning from Human Feedback (RLHF), would be carried out to adjust the model’s outputs to closely match human expectations and norms, thereby improving the model’s efficacy, accuracy, and even ethical considerations in its applications [5].

SFT is a powerful technique for optimizing LLMs to perform specific tasks with enhanced accuracy and performance. This process involves leveraging the pre-existing knowledge of the LLM, gained through pre-training on extensive datasets, and adapting it to excel in targeted applications.

  • Data Preparation: The process begins by selecting task-specific, labeled datasets that align with the intended application of the LLM. This data could range from specialized corpora in sectors like healthcare or finance to structured question-answer pairs for tasks such as question-answering (QA) [113].

  • Training Procedure: SFT capitalizes token sequence prediction tasks (e.g., question-answering), enhancing the model’s adaptability and accuracy without sacrificing generalizability [113].

RLHF involves several steps designed to align the model’s outputs with qualitative judgments or desired behaviors as determined by human feedback [5, 96]. Key components include:

  1. 1.

    Reward Modeling: Training a model to predict preferred outcomes by evaluating model outputs against human judgments, aligning predictions with human values [93].

  2. 2.

    Proximal Policy Optimization (PPO): Employing PPO to update decision-making policies towards maximum reward outcomes, ensuring effective learning from complex feedback [133].

  3. 3.

    Fine-tuning with Human Feedback: Continual fine-tuning using human feedback on new samples to refine both the LLM and the reward model, enhancing model alignment with human expectations [5].

By integrating model SFT and alignments, LLMs achieve superior performance, ethical soundness, and practical value in applications [20, 13].

2.2.4 LLM Extensions and Usages

In the era of foundation models, LLMs have emerged as versatile tools with impactful applications across different domains. Harnessing the power of LLMs, notable advancements have been witnessed in the domains of Prompts, Reasoners, and Agents. Let’s delve into each of these perspectives to explore the diverse applications of LLMs.

  • Prompts: Prompting techniques are essential for effectively utilizing LLMs, enabling them to comprehend and react to user needs. Prompt engineering allows for the customization of LLM output through advanced techniques such as few-shot and zero-shot (in-context) learning, thus improving task adaptability [136, 27]. Additionally, prompts open up possibilities for innovative and dynamic interactions with LLMs, enhancing user engagement [22, 89].

  • Reasoners: LLMs, powered by reasoning techniques like chain-of-thought (CoT) and tree-of-thought (ToT), excel in complex problem-solving by mimicking human reasoning processes [129, 141]. These methods enable LLMs to extend their knowledge base, stay current with information, and address bias and fairness in their responses [73, 16, 83].

  • Agents: Acting as autonomous agents, LLMs autonomously perform tasks, interact with external tools, and learn to improve their performance over time with minimal human intervention [2, 69]. These agents are notable for their memory and planning abilities, collaboration potential, and the capacity for customization, making them versatile across various applications [69, 120, 42, 139, 137, 106].

In essence, the applications of LLMs through prompts, reasoning frameworks, and autonomous agents showcases their broad capabilities and potential for innovation across different domains. Continual advancements in this sphere promise to further enhance LLM utility and versatility.

3 Search4LLM: Enhancing LLMs with Search Engine Services

In this section, we present our vision under the theme of Search4LLM, where we specifically examine how search engine services can significantly enhance the full life-cycle of LLMs from pre-training, to fine-tuning and model alignments, and to applications of LLMs. An overview of this theme has been illustrated in Fig. 4.

Refer to caption
Figure 4: An Overview of Search4LLM Theme: leveraging the search engine functionalities to process the data crawled from web and responses from LLMs, providing datasets for pre-training, supervised fine-tuning, and model alignments.

3.1 Enhanced LLM Pre-training

Search engines play a critical role in the pre-training phase of LLMs. This initial phase is foundational, setting the groundwork upon which further model-specific training is built. The utility of search engines in this context cannot be overstated, as they provide a unique and powerful means of collecting, categorizing, and indexing vast swathes of online content. Such capabilities directly impact the quality and efficacy of LLM pre-training in several key ways.

3.1.1 Collection of Massive Online Contents as Corpus

At the core of LLM pre-training is the need for extensive and varied datasets. The functionality of search engines in scouring the internet enables the collection of a vast array of data from multiple sources, encapsulating a diverse range of languages, formats – including HTML pages, PDFs, and text files – and topics from scientific research papers to literary works and current news articles [47, 53].

The wide spectrum of content, collected and aggregated by search engines, serves as an ideal corpus for LLM pre-training. It allows the resulting language models to develop a comprehensive understanding of language patterns, semantics, and syntax. Such comprehensive corpora is instrumental in spanning the vast landscape of human language and applications, thereby ensuring the model’s broad applicability across different areas. This strategic compilation not only fosters a deeper comprehension of language intricacies but also solidifies the foundation for creating models that can adeptly navigate the complexities and subtleties of human expression found across the wide-ranging contents at web-scale [98].

3.1.2 Indexing Corpus by Domains and Quality of Texts

During the pre-training phase of Language Learning Models (LLMs), the act of categorizing the corpus according to various domains and conducting an evaluation of text quality serves a pivotal role in guaranteeing a balanced data distribution. This practice is integral to the development of a comprehensive and impartial model. This methodology involves a rigorous selection of the dataset encompassing a myriad of domains such as news, research reports, literature, and colloquial internet language, all while emphasizing the diversity, reliability, and authority of contents through specific quality indicators [57].

In this way, LLMs not only benefit from a diversified learning experience, minimizing the risk of domain biases and the over-representation of certain linguistic styles, but also are able to understand and generate language with a remarkable balance and broad applicability [81]. This method does not merely serve as a strategic preference but emerges as an essential strategy in the development of comprehensive, equitable LLMs capable of navigating through the extensive yet unique landscape of human communication [9].

3.1.3 Supporting Continuous Model Improvement

Search engines operate as dynamic repositories of information, continuously updated by web crawlers that traverse the internet to index new and revised content. This ever-evolving corpus of information serves as a vital resource for the continuous improvement of LLMs, particularly in keeping these models relevant, accurate, and reflective of current language usage and trends [58].

Take a LLM-backed chatbot as an example. As global events unfold or new discoveries are made, the chatbot must understand and provide information on these topical events. By regularly updating the LLM with content indexed by search engines—such as news articles and reports on these recent events–the model remains competent in delivering timely and relevant responses to user inquiries [119].

3.2 Enhanced LLM Fine-tuning

Search engines play a key role in the fine-tuning process of LLMs, enhancing their ability to interact with users and provide accurate, contextually relevant responses upon specific domains. This process leverages the advanced capabilities of search engines, including query rewriting, the analysis of user interactions, and the utilization of domain-specific content. By integrating these elements into the fine-tuning of LLMs, the models can significantly improve in three aspects as follows.

3.2.1 Learning to Follow User Instructions

One of the primary enhancements search engines offer to LLM fine-tuning involves teaching the model to recognize and interpret users’ intentions. The capability of instruction-following could be achieved through the mechanism of query rewriting, where search algorithms adjust or reformulate user queries to better capture the user’s intent [116].

By analyzing patterns in query rewriting, LLMs can learn to infer the underlying intentions behind users’ queries, enabling them to respond more accurately and helpfully. This technique not only improves the model’s comprehension of user requests but also its ability to engage in more intuitive and efficient dialogue [96].

3.2.2 Learning to Answer Questions

The fine-tuning process also capitalizes on structuring datasets that simulate the question-answering dynamic, utilizing actual search queries as the basis for generating questions and selecting top-relevant content or user-most-clicked items as the corresponding answers. By doing so, the model is trained on real-world examples of how users phrase queries and what information they find most useful, based on search engine results and click-through data. This approach provides the LLM with a rich dataset reflective of genuine user interactions, enabling the model to better understand and structure its responses in a manner that aligns with user expectations and the typical flow of information retrieval [100, 51]. Fig. 5 illustrates an example of leveraging search queries and top search results, collected from a search engine, to synthesize question-answer (QA) pairs for SFT. Classic NLP techniques or prompt-based tuning with LLMs could be used to organize search results into an answer to the question in the query.

Refer to caption
Figure 5: Extracting Questions and Answers for SFT from Search Queries and Top Search Results

3.2.3 Incorporating Domain-specific Knowledge

Finally, the utilization of domain-specific queries and content curated by search engines serves as a cornerstone for fine-tuning LLMs with specialized knowledge. This involves leveraging search engine capabilities to gather and categorize information specific to distinct fields or industries, such as medicine, law, or technology [99].

By fine-tuning the model on datasets comprised of domain-relevant queries and authoritative content, the LLM acquires an in-depth understanding of sector-specific terminologies, concepts, and commonly sought information. This process not only enhances the LLM’s expertise in various domains but also its ability to deliver precise, expert-level answers to queries within those specific areas [44, 31].

3.3 Enhanced LLM Alignment

Search engines, with their web crawling technologies and advanced semantic algorithms, offer invaluable tools for enhancing the alignment of LLMs with human values and improving the relevance and quality of their outputs. These technologies, developed and refined through long-term operations, provide a framework for ensuring that LLMs can represent human values accurately, prioritize content relevance, and maintain high-quality output. Fig. 6 illustrates the framework of leveraging these components to provide feedbacks for model alignment. Through integrating specific functionalities with search engines, LLMs can achieve a greater degree of alignment as follows.

3.3.1 Semantic Relevance Alignment

The LTR system, an integral component of search engines, is engineered to organize and display search results according to their semantic significance in relation to the user’s search query [76, 130]. This functionality can be particularly beneficial when LLMs produce multiple outputs in response to a single input, a common occurrence with the application of expert or decoding methods aimed at enhancing response diversity.

By applying the LTR system to these sets of results, it is possible to rank the outputs in order of their relevance and utility regarding the initial query. This practice ensures that the most relevant information is prioritized, helping users to access the most accurate and helpful content more efficiently [28].

3.3.2 Content Value Alignment

Search engines deploy elaborate crawling algorithms capable of identifying content that may be harmful, such as hatred, pornography, or violence, even when the content doesn’t clearly seem sensitive or offensive at first glance. This capability stems from long-term exposure to vast quantities of online media and the continuous refinement of content evaluation models [128].

Integrating above modules or functionalities (already existing in search engines) into the LLM fine-tuning process allows for the incorporation of human values at the core of model alignment. By leveraging the search engine’s ability to discern and filter out undesirable content, LLMs can be trained or corrected to avoid generating or promoting material that contradicts widely accepted human values, thus ensuring the model’s outputs are aligned with ethical and societal norms [80, 148].

3.3.3 Content Quality Alignment

Search engines are commonly equipped with models that evaluate the quality of online content, often trained using extensive datasets of users’ click-through data. These models assess various aspects of content, such as its credibility, informativeness, and user engagement for quality-based search or ranking [84, 105].

By applying above evaluation models to review and rate the content generated by LLMs, search engines can provide critical feedback for the continuous alignment and improvement of the models. This feedback loop enables the identification of content quality issues, guiding subsequent fine-tuning efforts to enhance the overall quality of LLM outputs. In turn, this process contributes to the optimization of LLMs, ensuring they produce high-quality, relevant information that meets user expectations [79, 109].

Refer to caption
Figure 6: Using Relevance, Content Quality, and Value Screening Components to Align Models

3.4 Enhanced LLM Applications

Incorporating search engine capabilities significantly enhances LLMs by addressing their key limitations from applications’ perspectives, such as the lack of real-time information, difficulty with out-of-distribution questions, and constraints within specific domains.

3.4.1 Real-time Information Provision

LLMs are constrained by the datasets they are trained on, which, due to the extensive time required for pre-training, fine-tuning, and subsequent updates, often lack current information. Search engines, on the other hand, offer a conduit to real-time data across various domains. By leveraging retrieval-augmented generation (RAG) [60], LLMs can dynamically integrate search-engine-sourced information into their responses [48].

Specifically, RAG involves executing real-time searches based on the input query and fusing the retrieved information with the LLM’s generated content, thus enabling the model to provide up-to-date answers and insights [29]. For example, when asking GPT-4 the question “Today’s weather in Washington DC.” without access to the Internet, GPT-4 would respond the user with notice of non-access to real-time data. However, when including the top one result of the same query from Google search into the context of prompting for RAG, GPT-4 would respond the user with accurate information.

3.4.2 Cross-domain Question Answering

LLMs, whether general-purpose or fine-tuned for specific domains, may struggle with questions that lie outside their trained datasets or knowledge domains, known as out-of-distribution or out-of-domain queries [145, 122].

In such scenarios, search engines can serve as a powerful tool to supplement the LLMs’ responses by providing cross-domain information. Specifically, when an LLM encounters a query it is ill-equipped to answer due to domain limitations, it can utilize search engines to fetch pertinent information from a broader spectrum of knowledge. This not only expands the range of questions the LLM can handle but also enhances the depth and accuracy of its responses, making the model more versatile and capable of tackling a wider array of subjects [48].

3.4.3 Addressing Miscellaneous Limitations

Beyond real-time updates and cross-domain supplementation, search engines can assist LLMs in various other aspects. For instance, improving the model’s ability to discern user intent by analyzing search patterns and query refinements, bolstering content quality through insights derived from user engagement metrics, and even refining the model’s ethical and factual alignment by filtering out unreliable sources.

In addition, the broad and continuously updated dataset a search engine handles provides a wealth of supplementary information that can be used to train and enhance LLMs in effective ways, addressing a range of miscellaneous limitations that might not be readily apparent during the initial model development stages.

3.5 Summary of Search4LLM Research

The introduction of search engine functionalities into LLMs presents a revolutionary stride in the development of AI, particularly in automating the procedures of massive data collection and fine-grained data production. This synergy provides a robust framework for enhancing the models’ capacity to interpret user intentions, generate relevant responses, and apply specialized knowledge across diverse domains. Below, we delve into several key points that highlight the significant achievements and future prospects of integrating search engine capabilities into LLMs:

  • Enhanced Understanding of User Intentions: The use of query rewriting techniques within LLMs enables a more profound comprehension of what users are actually seeking. This advancement allows for a deep understanding of queries, catering to the specific needs and contextual inquiries of the users.

  • Augmented Answer Structuring: Leveraging real-world search data, LLMs can now structure answers in a more coherent and informative manner. This not only enhances the utility of responses provided to user queries but also ensures that the information is presented in an easily digestible format, making it more accessible to users.

  • Application of Domain-Specific Knowledge: By incorporating domain-specific content and expertise into their frameworks, LLMs can offer precise and contextually relevant answers. This significantly elevates their proficiency in handling inquiries that require specialized knowledge or expertise.

  • Optimization of Model Alignment with Human Values: The integration facilitates a comprehensive approach to aligning LLM outputs with ethical standards and human values. Through content value alignment, learning-to-rank systems for prioritizing outputs, and utilizing quality assessment models for feedback, LLMs can achieve a balance between accuracy, ethical considerations, and user satisfaction.

  • Relevance and Accuracy Adjustment: The collaboration between search engines and LLMs introduces mechanisms like retrieval-augmented generation, which significantly boosts the models’ accuracy and relevance. This is particularly vital in overcoming challenges related to real-time data provision, domain-specific knowledge application, and addressing out-of-distribution queries.

  • Versatility and Dynamic Responsiveness: With the integration of search engine technologies, LLMs exhibit unprecedented versatility and adaptability. They become more adept at navigating the complex and constantly changing landscape of human knowledge and communication, effectively managing cross-domain inquiries and providing up-to-date information.

In summary, it is our unique vision to incorporate search engine functionalities in the full life-cycle of LLMs (pre-training, fine-tuning, alignment and applications). As we move forward, this synergy between search engine capabilities and LLMs heralds a new era in AI, characterized by models that are more dynamic, responsive, and comprehensive, embodying a significant leap towards achieving AGI.

Refer to caption
Figure 7: An Overview of LLM4Search Theme: Leveraging LLMs to augment information extraction & indexing, query rewriting & improvement, and information retrieval & ranking in online/offline manners.

4 LLM4Search: Augmenting Search Engines with LLMs

In this section, we present our vision under the theme of LLM4Search, where we specifically examine how large language models (LLMs) can significantly augment LLMs in terms of query understanding, information extraction & retrieval, and content ranking for web search. An overview of this theme has been illustrated in Fig. 7.

4.1 Augmented Query Rewriting

The adoption of LLMs into search engine services has the potential to augment the rewriting process of search queries, thereby improving user experiences and search result relevance [142, 75, 26], in several ways as follows.

4.1.1 Query Recommendation and Completion

LLMs can significantly enhance query recommendation and completion functionalities in search engines by leveraging their deep understanding of language and context [4]. When a user begins typing a query, LLMs can analyze the partial input and generate highly relevant keyword suggestions and complete query predictions.

  • Query Completion: LLMs can comprehend the semantic meaning behind partial queries, allowing them to predict the user’s intent and suggest relevant keywords or phrases that align with the intended search [23].

  • Query Recommendation: LLMs can be fine-tuned on search query logs and user behavior data to identify trending keywords or popular patterns of queries. This information can be incorporated into the recommendation system, ensuring that suggested keywords and query completions reflect current user interests and preferences [23, 91].

4.1.2 Query Correction and Improvement

Language Learning Models (LLMs) can serve a fundamental function in augmenting the capabilities of query correction within search engine systems. By leveraging their understanding of language and ability to identify and rectify errors, LLMs can assist users in refining their queries, even when faced with misspellings, grammatical errors, or incorrect inputs. Specifically, LLMs can be trained on large-scale text corpora to recognize and correct common spelling errors and grammatical errors in user queries. By understanding the structure and syntax of language, LLMs can suggest accurate spelling corrections or grammatically correct alternatives, ensuring that the search engine handle the query and retrieves relevant results [23, 26].

4.1.3 Contextualized and Personalized Query Extension

LLMs can significantly enhance the contextualization and personalization of query extensions in search engines. By leveraging information from cookies and browsing/search history, LLMs can tailor query extensions to individual users, providing a more relevant, personalized, and context-aware search experience [121, 63].

Specifically, LLMs can analyze user-specific data, such as browsing history, search patterns, and preferences, to build comprehensive user profiles. These profiles can be used to understand the user’s interests, expertise level, and search behavior, enabling personalized query extensions that align with their specific needs. Furthermore, LLMs can examine the context surrounding a user’s query, including the current browsing session, previous searches, and the content of the web pages visited. By understanding the broader context, LLMs can extend queries that are highly relevant to the user’s current information-seeking task [154, 142, 75].

4.2 Augmented Information Extraction and Indexing

LLMs stand at the forefront of transforming search engines’ approach to information extraction and document indexing. LLMs, with their advanced understanding of natural language processing, can significantly improve the precision and relevance of the indexing process.

4.2.1 Terms Extraction and Summarization for Indexing

LLMs possess the inherent capability to understand and interpret the contextual meaning and detailed information of text on web pages. This comprehension plays a vital role in pulling out an exact set of index terms and succinctly summarizing the content, both of which are key procedures in the task of document indexing [85], as follows.

  • Term Extraction: By deploying LLMs, search engines can comprehend every webpage in depth, distinguishing crucial information from generic data. This discernment allows for the extraction of meaningful and precise index terms that accurately reflect the page’s content [34, 86, 35].

  • Content Summary: LLMs can generate succinct and informative summaries of web content. These summaries provide a quick overview of the webpage, aiding in the efficient categorization and retrieval of documents. This capability is particularly beneficial for users and search engines alike, offering a glimpse into the content without the need to parse through the entire document [135, 147].

Fig 8 illustrates an example of terms extraction and summarization for indexing purposes. With the original content provided in-context of the prompt, GPT-4 could respond the terms extracted and a snippet for summarization. Obviously, one could run the prompt with LLMs multiple times to diversity the extraction and summarization results.

Refer to caption
Figure 8: An example of Prompts and Responses for Terms Extraction and Summarization for Indexing Purposes

4.2.2 Semantic Labeling and Categorization for Indexing

The capability of LLMs to measure the semantic distance or similarity between web pages is revolutionary, providing a reliable approach to automatically labeling and categorizing web pages based on their content. Specifically, LLMs evaluate the semantics of webpage content, identifying the subject matter and themes within the text. By measuring the semantic distance or similarity between webpages, LLMs can group related documents, enhancing the search engine’s ability to retrieve topically relevant results. This semantic analysis facilitates the automatic labeling and categorization of web pages [115]. LLMs can analyze the content and context of a webpage, assigning it to appropriate categories or labels based on its semantic characteristics. This process not only streamlines the indexing but also improves the user experience by enabling more accurate and thematic search results [131].

4.2.3 Query Candidates Generation for Indexing

LLMs can play a critical role in training neural information retrievers and in the cold start phase of a search engine by generating a list of potential queries related to the content of a webpage. This approach ensures that the search engine is primed with relevant queries for new or less-indexed content, in following two steps.

  • Query Candidate Generation from Contents: By comprehensively analyzing the content of a webpage, LLMs can generate a list of potential queries that users might input when searching for similar information [153]. This capability is essential in deciphering the context and intention behind user inquiries, thereby aligning the responses generated by the search engine more accurately with user anticipations. The generated queries provide valuable training data for neural information retrievers [146].

  • Cold Start New Contents with Generated Candidates: By simulating real-user queries, LLMs can help these models learn to predict and rank relevant web pages more effectively, even in scenarios where direct user query data may be limited. For new or niche content that may not yet have associated user queries, the list of generated queries can kickstart the search engine’s understanding and indexing of such content. This alleviates the “cold start” problem, ensuring that all content, regardless of its current popularity or visibility, can be discovered and retrieved by users [36, 45].

One could use similar prompts in Fig. 8 to generate candidate queries from the content of a webpage.

4.3 Augmented Information Retrieval, Document Ranking, and Content Recommendation

LLMs have demonstrated remarkable potential in improving the functionalities of search engines, particularly in the area of information retrieval (IR), webpage ranking, and content recommendation as shown below in the next few subsections.

4.3.1 Annotation for Retrieval and Ranking

One of the fundamental challenges in training neural networks for information retrieval lies in the necessity to accurately annotate the relevance of query-webpage pairs [3, 55]. This approach involves LLMs in the annotation process for LTR [15] from three aspects as follows.

  • Point-wise LTR Annotation: LLMs can assign ranking scores to individual documents relative to a query, based on relevance and user context. These point scores serve as training data for models that aim to replicate such scoring [130].

  • Pair-wise LTR Annotation: For pair-wise approaches, LLMs can determine the relative order between any two webpages in response to a query, considering both content relevance and user-specific information. This relative ranking aids in training algorithms to understand preferences within sets of documents [12].

  • List-wise LTR Annotation: In a more comprehensive capacity, LLMs can generate ranked lists of webpages based on their collective relevance and personalization for a query. This ranked order provides a template for list-wise LTR models to learn how to sequence document sets effectively [134].

Fig 9 illustrates an example of prompts and responses for Point-wise, Pair-wise, and List-wise LTR annotations, where the LLM predicts the relevance score of every retrieved result, the partial order of every two retrieved results, and the order of all retrieved results, subject to the query. By providing high-quality, relevance-annotated pairs, LLMs ensure that the training data for information retrieval neural networks is both accurate and representative of diverse query intents and informational needs [85, 104, 151].

Refer to caption
(a) Point-wise LTR Annotation
Refer to caption
(b) Pair-wise LTR Annotation
Refer to caption
(c) List-wise LTR Annotation
Figure 9: An Example of Prompts and Responses for Annotation of Point-wise, Pair-wise, and List-wise LTR

4.3.2 Online Ranking and Recommendation for Contextual, Personalized Search

Upon the establishment of an information retrieval model, LLMs can further enhance the search experience by leveraging users’ browsing/searching history and profiles to perform online ranking of retrieved webpages or the recommendations of contents.

  • Ranking: Given a set of retrieved webpages for a search query, LLMs can evaluate the contextual relevance of each retrieved webpage by considering the specific needs and interests of the user as reflected in their search history and profile. By comparing relevance scores and incorporating personalization factors, LLMs can dynamically adjust the ranking of retrieved webpages, ensuring that the most relevant and personalized results are prioritized for the user [112, 151]. Note that one can incorporate similar prompt in Fig. 9 while adding the user’s browsing history or profiles as part of contexts in the prompt for enabling online ranking with contextual personalization.

  • Recommendation: In addition to ranking the retrieved webpages for search, yet another way is directly recommend content that the user might be interested in. LLMs can analyze the textual data in user profiles and browsing history, which may include user preferences, demographic information, and real-time interests. For example, during a search session, if a user is looking at sports equipment, the search engine would probably recommend sports-related content or products. To achieve the goal, the so-called LLM4Rec techniques have been proposed with LLMs and prompts [132], where LLMs could be pre-trained and/or fine-tuned to understand users, items in texts [36] and predict the user-item interactions [124] accordingly.

4.3.3 Retrieval-Augmented Generation (RAG) Contents for Conversational Search

The incorporation of Retriever-Augmented Generation (RAG) into search engines significantly enriches the result output from a generation perspective. Once relevant documents have been retrieved and ranked appropriately, RAG leverages the wealth of factual information within these sources to generate coherent, contextually relevant, and information-rich content for users. Rather than simply returning a list of documents, RAG synthesizes the information extracted from the top-ranking sources to compose responses that effectively combine the retrieved knowledge into a cohesive answer.

Refer to caption
Figure 10: An Example of Prompts and Responses for RAG in Search Results Aggregation

As shown in Fig. 10, given the retrieved results ranked in an appropriate order, RAG could synthesize a coherent response (with a summary, details, and references) to the query through simple prompting with LLMs. From a generation standpoint, the benefits of RAG in search engines may include follows.

  • Enhanced Accuracy and Relevance: By drawing directly from the retrieved high-quality documents, RAG ensures that the generated responses not only accurately reflect the content of these sources but also maintain a high degree of relevance to the original query.

  • Overall Coherence: RAG models understand the broader context obtained from the entirety of the retrieved documents, allowing for the production of coherent responses that consider multiple facets of a user’s question.

  • Efficient Summary Generation: They can effectively summarize and condense information from multiple documents, distilling complex data into digestible and accessible formats for the end-user.

  • Data-Rich Responses: RAG-enabled search engines provide detailed, well-informed answers by cross-referencing various sources, leading to a richer informational value compared to search engines that only offer links.

  • Natural Language Output: Leveraging the NLP capabilities of underlying language models, RAG produces answers in a conversational tone, which can improve user engagement and understanding.

By combining the robust data retrieval aspect with the advanced natural language generation capabilities of GenAI, RAG transforms search engines into powerful tools that don’t just find information–they also present it in a instantly usable way, making the search process more seamless and the results more actionable for the user.

4.4 Augmented Evaluation for Search Engines

The evolution of search engine technology necessitates equally advanced methods for evaluating performance and user experience. LLMs offer significant potential in augmenting the evaluation of search engines through several innovative approaches, as shown below.

4.4.1 Automated A/B Testing through User Mimicking

LLMs can enhance the efficiency and effectiveness of A/B testing in search engines by acting as agents that mimic user search behaviors. This application allows the direct comparison of different search result sets and their respective ranking orders. Some key features of LLMs are as follows.

  • Traffic Mockup: By generating a diverse range of user queries based on real-world search patterns and intentions, LLMs can simulate the natural variability in search behaviors [45, 146, 154].

  • Automatic Evaluation: LLMs can evaluate two sets of search results (from A/B variants) for the same query, comparing not just the relevance but the ranking order, to gauge which set is more likely to satisfy the user’s needs [15, 112, 104].

  • User Mimicking: Apart from evaluating results, LLMs can mimic user behaviors in interacting with these results, including clicking through links according to the perceived relevance, thus offering deeper insights into the effectiveness of ranking algorithms [50, 7].

Fig. 11 illustrates an example of automatic evaluation that compares the two sets of search results under the same query, from the perspectives of relevance, timeness, and the ranking order. In this example, through encapsulating the titles and snippets of webpages in the order of search results into the prompt, GPT-4 could respond the evaluation results automatically from the perspectives desired, and formates the result in a programming-friendly way. Actually, GPT-4 can also generate interpretations on the comparison. Due to the page limit, we haven’t include the full response here.

Refer to caption
Figure 11: An Example of Prompts and Responses for Automatic Evaluation of Search Results

4.4.2 Decoding User Interactions and Intentions

Through text understanding, LLMs can interpret the user interactions with search engines. This capability allows for a deep understanding of user satisfaction and intent changes throughout their search trajectories. Some key features of LLMs are as follows.

  • Sequential Behavior Modeling: LLMs can analyze patterns in click-throughs and the order of interactions to infer the relevance and quality of the search results provided. By examining the changes in user queries during a search session, LLMs can infer shifts in user intentions or pinpoint when a user’s needs become more specific [125, 70, 50].

  • Satisfaction Assessment: The change in these sequential behaviors can signal how well the search engine accommodates evolving user needs. LLMs can assess whether the search engine helps users find desired information with minimal steps of user interactions (e.g., clicks or queries), indicating the efficiency and effectiveness of the search process [50, 72].

4.4.3 Evaluation with User Experience Dashboards

LLMs can play a vital role in transforming raw evaluation data, including A/B testing outcomes and user interaction analyses, into comprehensive dashboards that highlight key aspects of user experience, as follows.

  • Data Aggregation: By aggregating data from numerous comparisons and user interactions, LLMs can pinpoint critical performance metrics such as click-through rates, query refinement patterns, and satisfaction indicators [60, 43].

  • Data Interpreter: Advanced data synthesis capabilities enable LLMs to identify patterns and trends suggesting areas for improvement, whether in search algorithms, result ranking, or user interface design [41, 77].

  • Summary & Reporting: Leveraging their language generation capabilities, LLMs can generate insightful, understandable narratives around the data, complemented by visual dashboards that highlight search engine performance from the user’s perspective [149].

These dashboards can serve as invaluable resources for developers and designers looking to enhance search engine technologies.

4.5 Summary of LLM4Search Research

Incorporating LLMs with search engines heralds a pivotal shift in the paradigm of information retrieval, query processing, and user interaction. These advanced models offer a suite of capabilities that significantly enhance the efficiency, accuracy, and user experience of search engines. Upon examining their multifaceted contributions, it becomes evident that LLMs possess potential by four key abilities as follows.

  • Content Understanding and Information Extraction: LLMs exhibit an unparalleled proficiency in dissecting and interpreting search queries, web content, browsing history, and user profiles. They adeptly extract relevant information by understanding the semantic meaning contained within queries and documents. This deep understanding enables LLMs to parse web pages for precise index term extraction, categorize content semantically, and tailor query suggestions based on historical user data. Adopting this capability ensures that search engines can process queries with a higher degree of accuracy, leading to improved retrieval that aligns closely with user intent.

  • Semantic Relevance for Content Matching and Ranking: At the heart of LLMs’ functionality is their ability to analyze and evaluate semantic relevance, allowing for more advanced content matching and ranking algorithms. LLMs leverage their extensive knowledge base and language understanding to match queries with the most relevant content, even when the match is not immediately apparent from the query’s keywords alone. This semantic analysis extends to the generation of contextual queries and enhances the search engine’s ability to categorize web pages, contributing to a more comprehensive understanding of content relevance and significantly improving the quality of search results.

  • User Profiling and Context Modeling: LLMs stand out for their capability to offer highly contextualized and personalized search experiences through in-context learning. By analyzing current search queries in conjunction with historical user data, LLMs can craft responses that are tailored to the individual’s specific needs, preferences, and patterns of behavior. This level of personalization not only enhances user satisfaction but also makes information retrieval more efficient by prioritizing results that are most relevant to the user’s context and search history.

  • Comparative Analysis for Ranking and Evaluation: Finally, LLMs excel in their ability to conduct comparative analyses, whether for the purpose of webpage ranking or for evaluating the effectiveness of search results. Through automated relevance annotation, contextually personalized online ranking, and annotating capabilities for learning-to-rank tasks, LLMs can significantly refine webpage ranking processes. Additionally, their role in automating A/B testing and synthesizing user interaction data into actionable insights marks a significant advancement in search engine evaluation. This ability to dynamically adapt and refine ranking parameters ensures that search engines can continuously evolve to meet and exceed user expectations with a high degree of precision.

As technology continues to advance, the partnership between LLMs and search engines will undoubtedly lead to even more innovative solutions that will shape the future of how we interact with information.

5 Challenges and Future Directions

Search engines - LLMs symbiosis under the research themes of Search4LLM and LLM4Search is very promising. However, there are many technical challenges that must be overcome. We discuss some of these challenges and future research directions along these lines.

5.1 Memory-Decomposable LLMs

Efficiently managing and updating the extensive memory stores of LLMs, whether for enhancing search engines with LLMs or vice versa, poses a significant challenge in delivering accurate, context-aware responses in real-time [90, 92]. We identify several technical issues as follows.

  • Memory at Scale: The scalability of CRUD operations, including creation, read, update, and detection, within the memory components of LLMs is critical to their effective functioning.

  • Consistency and Integrity: Maintaining data consistency and integrity during CRUD operations in LLMs is challenging, especially when dealing with real-time data updates and deletions, which could render part of the model’s knowledge obsolete or incorrect.

  • Support to Efficient Retrieval and Editing: Ensuring that the LLM can accurately understand and utilize its decomposed memory segments for CRUD operations to maintain contextual relevance and coherence in responses. This involves sophisticated understanding and integration of user queries and stored knowledge.

Considering the technical challenges outlined above, it may be worthwhile to explore the following research directions:

  • Novel Architecture: Developing more advanced memory management algorithms and architectures that permit LLMs to more effectively access and update their knowledge base. This design could involve techniques like dynamic memory allocation or memory networks that can selectively store and retrieve information.

  • Incremental Learning: Introducing techniques for incremental learning that allow LLMs to update their memory with new information efficiently, or even forget specific piece of (outdated) information, are crucial for implementing CRUD operations effectively.

  • Exact Retrieval by Generation: Current LLMs’ responses are generated through predicting token sequences with maximal probabilities. These responses are often plausible but incorrect information, a phenomenon known as “hallucination” [62]. However, in search engine settings, a memory-decomposable LLM might need to perform information retrieval from all data it has traversed during pre-training or fine-tuning. Thus, there is a need to innovate a method for “exact-recovery” of (part of) the training corpus.

5.2 Explainability of LLMs for Web Search

LLMs often operate as “black boxes”, making it difficult to understand how they arrive at their outputs. This lack of explainability can be problematic when using LLMs to augment search engines, as it may be challenging to interpret or trust the results [64, 138].

  • Model Complexity: LLMs are usually built on over-parameterized architectures with hundreds of millions, or even billions, of parameters. This complexity makes it difficult to pinpoint the exact reasoning behind a given output. The inability to understand the inner workings complicates the task of ensuring the reliability and trustworthiness of LLMs in query understanding, rewriting and so on [150].

  • Opaque Decision-Making Process: LLMs provide limited insight into their decision-making process. This opaqueness hinders the ability to identify the source of errors or biases in the model’s outputs. When LLMs are used to augment search engines (LLM4Search), or when search engines are used to improve the performance of LLMs (Search4LLM), the lack of transparency can erode user trust in the search results [64].

  • Scale of Data: LLMs are trained on massive datasets. Tracing back the influence of specific data points on the output becomes practically impossible. The use of web-scale datasets scales up the challenge in understanding which part of the data contributed to any misinformation or biased information being relayed through the model [143, 138].

Addressing the explainability challenges of LLMs, under the themes of Search4LLM and LLM4Search, is critical for advancing the trustworthiness and utility of these technologies. Future research would need to balance the trade-offs between explainability, accuracy, and efficiency to build more transparent, accountable, and user-friendly LLMs. Some promising directions are as follows.

  • Development of Interpretable Models: There’s a pressing need for research on creating LLM architectures and training methodologies that inherently support explainability. This includes developing models that can articulate the reasoning behind their responses in a manner that is understandable to users.

  • Explainable AI (XAI) Techniques for LLMs: Investigating and adapting XAI techniques to the specific context of LLM4Search and Search4LLM could offer insights into how these models process and retrieve information. This includes creating visualization tools and summary techniques that can help demystify the model’s internal processes [150].

  • Bias and Fairness in Model Explanations: Ensuring that explanations not propagate biases present in training data or amplify unfair representations requires dedicated research. This could involve developing methods to audit and refine model explanations for equity and inclusiveness [64].

  • User-Centric Explanation Frameworks: Developing frameworks that can adapt explanations based on the user’s background knowledge and the context of the query. This could involve personalized explanation systems that adjust the complexity and detail of explanations accordingly.

5.3 Agents for Search4LLM and LLM4Search

Both themes request LLMs being able to understand queries and contents while working with other components to fulfill the goals. As was defined in [69], an agent is built upon the capabilities of memory, planning, and action with LLMs. Thus, to enable agents for Search4LLM and LLM4Search, some technical challenges should be addressed as follows.

  • Integration Complexity: Agents need to be seamlessly integrated with the vast and complex submodules, components, and tools of LLMs and search engines. This includes the capability to access and interpret vast datasets, understand context from partial or ambiguous queries, and manage real-time data fetching without compromising response times [103, 137].

  • Long/Short-Term Memory for Interactions: For agents to effectively contribute to both LLM4Search and Search4LLM, they must possess sophisticated memory capabilities. This involves not only storing and retrieving information but understanding the relevance of historical interactions in current contexts. How they adapt their memory systems for dynamic and efficient use is a key challenge [38].

  • Adaptive Planning: Agents must plan their actions in environments that are constantly evolving. In the context of LLM4Search and Search4LLM, this issue requests adapting to changes in user behavior, search patterns, and the availability of online content. Planning in such an adaptive manner requires continuous learning and adjustment mechanisms [127].

  • Action and Interaction: Taking appropriate actions based on the contextual understanding and planning involves interacting with both internal systems of LLMs and external tools (e.g., crawlers, databases, and search engines). Ensuring these actions are both relevant and timely, while minimizing errors or irrelevant outputs, is challenging [103, 24].

To address above challenges, promising directions for future research are as follows.

  • Improved Memory Space: Developing advanced memory spaces that allow agents to more effectively store, retrieve, and utilize knowledge over time. This could involve exploring neuromorphic computing models or advanced neural network designs that mimic human memory processes more closely, or leveraging transformer models that can handle extreme/infinite length of context to restore all previous interactions by texts.

  • Dynamic Planning Algorithms: Researching algorithms that enable more flexible and dynamic planning based on real-time data and changing environments. This could include reinforcement learning approaches that adapt based on success/failure feedback loops.

  • Interactive Learning Models: Developing models that allow agents to learn from their interactions not just with users but with other AI systems and online databases. This approach could lead to more comprehensive understanding and action-taking abilities.

  • Cross-domain Knowledge Transfer: Exploring methods for more effective cross-domain knowledge transfer and application. This involves agents not just specializing in one area but being able to apply insights from one domain to another fluently.

  • Real-time Data Processing and Action Taking: As the need for immediate and pertinent information grows, investigations into how agents can manage real-time data and make prompt decisions without compromising precision or relevance becomes a focus area.

5.4 Miscellaneous

In addition to above well-structured discussions, some miscellaneous technical challenges and promising research directions are as follows.

  • Data Quality and Bias: Ensuring the accuracy and fairness of information retrieved or utilized by LLMs. The inherent biases in training data can skew search results or LLM responses, potentially propagating misinformation.

  • User Satisfaction and Trust: Building and maintaining user trust in the accuracy and reliability of LLM-augmented search engines. Users might be skeptical about algorithmic transparency and the quality of personalized results.

  • Intellectual Property and Privacy Concerns: Using content from the web to train LLMs raises significant concerns over copyright infringement and personal data privacy.

  • Legal and Ethical Considerations: Navigating the complex landscape of regulations governing AI and data use across different jurisdictions. The use of LLMs in decision-making processes further complicates this, requiring ethical and responsible AI systems.

With respect to above challenges, some promising research directions are as follows.

  • Enhanced Methods for Bias Detection and Correction: Innovating more sophisticated AI algorithms that can detect various forms of bias in data and automatically correct or mitigate these biases.

  • User-Centric Design and Feedback Mechanisms: Implementing design principles that put the user first, including customizable privacy settings and the introduction of mechanisms where users can provide real-time feedback on the relevance and quality of search results.

  • Cross-Disciplinary Research on Legal and Ethical AI Use: Conducting cross-disciplinary research that involves legal scholars, ethicists, technologists, and policymakers to develop standards and guidelines for the ethical use of AI in search and information retrieval.

By focusing on these challenges and research avenues, the development of LLM4Search and Search4LLM can be guided toward more cost-effective, responsible, user-friendly, and legally compliant systems that leverage the strengths of LLMs while addressing their inherent limitations.

6 Discussions and Conclusions

This work endeavors to elucidate the reciprocal relationship between Large Language Models (LLMs) and search engines, dissecting how each entity could potentially enrich and augment the functionalities of the other.

6.1 Core Visions

Under the Search4LLM umbrella, the focus is placed on how the vast, diverse datasets available from search engines could be harnessed to enhance the pre-training and fine-tuning processes of LLMs. This approach aims at bolstering the LLMs’ grasp on query contexts thereby elevating their precision in generating responses that are both relevant and accurate. The premise hinges on utilizing high-quality, ranked documents as prime training data, underlining the significance of such data in improving overall understanding and response generation capabilities of LLMs. The exploration into Learning To Rank (LTR) algorithms further underscores an attempt to refine abilities of LLMs in analyzing and prioritizing information relevance, essentially sharpening their effectiveness in response accuracy and relevancy.

In contrast, the theme LLM4Search shifts focus to examine the potential influence that Latent Language Models (LLMs) may impart on improving the functional aspects of search engines. Here, the narrative shifts to leveraging LLMs for tasks like effective content summarization, aiding in the indexing process, and employing fine-grained query optimization techniques to yield superior search outcomes. Moreover, the role of LLMs in analyzing document relevance for ranking purposes and facilitating data annotation in various LTR frameworks is underscored. This segment hints at a realm where LLMs do not just passively benefit from search engine data but actively contribute to improving the efficiency, accuracy, and user experience of search engine platforms.

6.2 Challenges and Opportunities

In conclusion, the intersection of LLMs and search engine technologies presents a fertile ground for innovation, offering avenues to transcend current limitations in both domains. The Search4LLM initiative underscores the rich potential that search engine datasets have in refining the operational intelligence of LLMs, enabling these models to more adeptly handle query complexities—a leap towards smarter, more adaptive, and user-centric search services. Meanwhile, LLM4Search showcases the transformative impact that LLMs could have on the search engine ecosystem, enhancing content understanding, search precision, and user satisfaction.

However, the path to fully integrating LLMs with search engines is fraught with challenges, including technical implementation hurdles, ethical considerations, biases in model training, and the need to keep training datasets current with the evolving internet landscape. Despite these challenges, this work illustrates a promising horizon where the synergistic marriage between LLMs and search engines could herald a new era of intelligent, efficient, and user-centric search services. This exploration not only contributes to the advancement of services computing but also lays a systematic framework for future research and development in this dynamic intersection of technologies.

References

  • [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [2] S. V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018.
  • [3] O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. In ACM SIGIR Forum, volume 42, pages 9–15. ACM, 2008.
  • [4] A. Anand, A. Anand, V. Setty, et al. Query understanding in the age of large language models. arXiv preprint arXiv:2306.16004, 2023.
  • [5] Y. Bai, A. Jones, K. Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • [6] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. NeurIPS, 13, 2000.
  • [7] N. Bernard. Leveraging user simulation to develop and evaluate conversational information access agents. In WSDM, pages 1136–1138, 2024.
  • [8] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret. The world-wide web. CACM, 37(8):76–82, 1994.
  • [9] K. Bhardwaj, R. S. Shah, and S. Varma. Pre-training llms using human-like development data corpus. arXiv preprint arXiv:2311.04666, 2023.
  • [10] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998.
  • [11] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  • [12] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, pages 129–136, 2007.
  • [13] S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. TMLR, 2023.
  • [14] N. Chen, M. Liu, and Y. Xu. How a/b tests could go wrong: Automatic diagnosis of invalid online experiments. In WSDM, pages 501–509, 2019.
  • [15] C.-H. Chiang and H.-y. Lee. A closer look into using large language models for automatic evaluation. In EMNLP, 2023.
  • [16] A.-V. Chisca, A.-C. Rad, and C. Lemnaru. Prompting fairness: Learning prompts for debiasing large language models. In Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, pages 52–62, 2024.
  • [17] A. Chowdhury and I. Soboroff. Automatic evaluation of world wide web search services. In SIGIR, pages 421–422, 2002.
  • [18] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In ICLR, 2019.
  • [19] W. B. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice, volume 520. Addison-Wesley, 2010.
  • [20] J. Dai, X. Pan, R. Sun, et al. Safe rlhf: Safe reinforcement learning from human feedback. In ICLR, 2023.
  • [21] O. Dan and B. D. Davison. Measuring and predicting search engine users’ satisfaction. ACM CSUR, 49(1):1–35, 2016.
  • [22] H. Dang, L. Mecke, F. Lehmann, et al. How to prompt? opportunities and challenges of zero-and few-shot learning for human-ai interaction in creative applications of generative models. arXiv preprint arXiv:2209.01390, 2022.
  • [23] C. De Silva and T. Halloluwa. Human-centered artificial intelligence: The solution to fear of ai. Blog of ACM Interactions, 2024.
  • [24] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. NeurIPS, 36, 2024.
  • [25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [26] K. D. Dhole and E. Agichtein. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. In ECIR, pages 326–335. Springer, 2024.
  • [27] Q. Dong, L. Li, D. Dai, et al. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  • [28] G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In WSDM, pages 181–190, 2010.
  • [29] P. F. Foulds, R. James, and S. Pan. Ragged edges: The double-edged sword of retrieval-augmented chatbots. arXiv preprint arXiv:2403.01193, 2024.
  • [30] J. Gao, X. He, and L. Deng. Deep learning for web search and natural language processing. Technical Report MSR-TR-2015-7, January 2015.
  • [31] Y. Ge, W. Hua, K. Mei, J. Tan, S. Xu, Z. Li, Y. Zhang, et al. Openagi: When llm meets domain experts. NeurIPS, 36, 2024.
  • [32] B. Ghojogh and A. Ghodsi. Attention mechanism, transformers, bert, and gpt: tutorial and survey. 2020.
  • [33] A. Ghose and S. Yang. An empirical analysis of sponsored search performance in search engine advertising. In WSDM, pages 241–250, 2008.
  • [34] J. Giguere. Leveraging large language models to extract terminology. In Proceedings of the First Workshop on NLP Tools and Resources for Translation and Interpreting Applications, pages 57–60, 2023.
  • [35] A. Goel, A. Gueta, O. Gilon, et al. Llms accelerate annotation for medical information extraction. In ML4H, pages 82–100. PMLR, 2023.
  • [36] Y. Gong, X. Ding, Y. Su, K. Shen, Z. Liu, and G. Zhang. An unified search and recommendation foundation model for cold-start scenario. In CIKM, pages 4595–4601, 2023.
  • [37] T. Goyal, J. J. Li, and G. Durrett. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356, 2022.
  • [38] K. Hatalis, D. Christou, J. Myers, S. Jones, K. Lambert, A. Amos-Binks, Z. Dannenhauer, and D. Dannenhauer. Memory matters: The need to improve long-term memory in llm-agents. In AAAI 2024 Spring Symposium, volume 2, pages 277–280, 2023.
  • [39] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information retrieval, 4:33–59, 2001.
  • [40] D. He, A. Kannan, T.-Y. Liu, et al. Scale effects in web search. In WINE, pages 294–310. Springer, 2017.
  • [41] S. Hong, Y. Lin, B. Liu, B. Wu, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, M. Zhuge, et al. Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679, 2024.
  • [42] S. Hong, M. Zhuge, et al. Metagpt: Meta programming for multi-agent collaborative framework. In ICLR, 2023.
  • [43] P. Hosseini, D. A. Broniatowski, and M. Diab. Knowledge-augmented language models for cause-effect relation classification. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 43–48, 2022.
  • [44] D. Huang, Z. Wei, A. Yue, X. Zhao, et al. Dsqa-llm: Domain-specific intelligent question answering based on large language model. In AIGC, pages 170–180. Springer, 2023.
  • [45] F. Huang, Z. Yang, J. Jiang, Y. Bei, Y. Zhang, and H. Chen. Large language model interaction simulator for cold-start item recommendation. arXiv preprint arXiv:2402.09176, 2024.
  • [46] J. Jiang and J. Allan. Correlation between system and user metrics in a session. In CHIIR, pages 285–288, 2016.
  • [47] M. A. Kausar, V. Dhaka, and S. K. Singh. Web crawler: a review. International Journal of Computer Applications, 63(2):31–36, 2013.
  • [48] D. Kelly, Y. Chen, S. E. Cornwell, N. S. Delellis, A. Mayhew, S. Onaolapo, and V. L. Rubin. Bing chat: The future of search engines? Proceedings of ASIS&T, 60(1):1007–1009, 2023.
  • [49] J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
  • [50] A. Khandelwal, A. Agrawal, A. Bhattacharyya, Y. Kumar, S. Singh, U. Bhattacharya, I. Dasgupta, S. Petrangeli, R. R. Shah, C. Chen, et al. Large content and behavior models to understand, simulate, and optimize content and behavior. In ICLR, 2023.
  • [51] J. Kim, J. Nam, S. Mo, J. Park, S.-W. Lee, M. Seo, J.-W. Ha, and J. Shin. Sure: Improving open-domain question answering of llms via summarized retrieval. In ICLR, 2023.
  • [52] R. Kohavi and R. Longbotham. Online controlled experiments and a/b tests. Encyclopedia of machine learning and data mining, pages 1–11, 2015.
  • [53] M. Kumar, R. Bhatia, and D. Rattan. A survey of web crawlers for information retrieval. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(6):e1218, 2017.
  • [54] Z. Lan, M. Chen, S. Goodman, et al. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2019.
  • [55] J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop, 2010.
  • [56] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
  • [57] A. Lee, B. Miranda, S. Koyejo, et al. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. Signal, page 1, 2020.
  • [58] D. Lewandowski, H. Wahlig, and G. Meyer-Bautor. The freshness of web search engine databases. Journal of information science, 32(2):131–148, 2006.
  • [59] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, pages 7871–7880, 2020.
  • [60] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474, 2020.
  • [61] H. Li, J. Chen, W. Su, Q. Ai, and Y. Liu. Towards better web search performance: Pre-training, fine-tuning and learning to rank. arXiv preprint arXiv:2303.04710, 2023.
  • [62] J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen. Helma: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023.
  • [63] X. Li, L. Su, P. Jia, X. Zhao, S. Cheng, J. Wang, and D. Yin. Agent4ranking: Semantic robust ranking via personalized query rewriting using multi-agent llm. arXiv preprint arXiv:2312.15450, 2023.
  • [64] X. Li, H. Xiong, X. Li, X. Wu, X. Zhang, J. Liu, J. Bian, and D. Dou. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowledge and Information Systems, 64(12):3197–3234, 2022.
  • [65] Y. Li. Toward a qualitative search engine. IEEE Internet Computing, 2(4):24–29, 1998.
  • [66] Y. Li, H. Xiong, L. Kong, J. Bian, S. Wang, G. Chen, and D. Yin. Gs2p: a generative pre-trained learning to rank model with over-parameterization for web-scale search. Machine Learning, pages 1–19, 2024.
  • [67] Y. Li, H. Xiong, L. Kong, Q. Wang, S. Wang, G. Chen, and D. Yin. S2phere: Semi-supervised pre-training for web search over heterogeneous learning to rank data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4437–4448, 2023.
  • [68] Y. Li, H. Xiong, Q. Wang, L. Kong, H. Liu, H. Li, J. Bian, S. Wang, G. Chen, D. Dou, et al. Coltr: Semi-supervised learning to rank with co-training and over-parameterization for web search. TKDE, 2023.
  • [69] Z. Li and H. Ning. Autonomous gis: the next-generation ai-powered gis. arXiv preprint arXiv:2305.06453, 2023.
  • [70] J. Lin, R. Shan, C. Zhu, K. Du, B. Chen, S. Quan, R. Tang, Y. Yu, and W. Zhang. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. arXiv preprint arXiv:2308.11131, 2023.
  • [71] T. Lin, Y. Wang, X. Liu, and X. Qiu. A survey of transformers. AI Open, 3:111–132, 2022.
  • [72] Y.-C. Lin, J. Neville, J. W. Stokes, L. Yang, T. Safavi, M. Wan, S. Counts, S. Suri, R. Andersen, X. Xu, et al. Interpretable user satisfaction estimation for conversational systems with large language models. arXiv preprint arXiv:2403.12388, 2024.
  • [73] C. Ling, X. Zhang, X. Zhao, Y. Wu, Y. Liu, W. Cheng, H. Chen, and L. Zhao. Knowledge-enhanced prompt for open-domain commonsense reasoning. In 1st AAAI Workshop on Uncertainty Reasoning and Quantification in Decision Making, 2023.
  • [74] J. Liu. Llamaindex, 11 2022. URL https://github. com/jerryjliu/llama_index, 2022.
  • [75] J. Liu and B. Mozafari. Query rewriting via large language models. arXiv preprint arXiv:2403.09060, 2024.
  • [76] T.-Y. Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  • [77] X. Liu, Z. Wu, X. Wu, P. Lu, K.-W. Chang, and Y. Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644, 2024.
  • [78] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang. Gpt understands, too. AI Open, 2023.
  • [79] Y. Liu, D. Iter, Y. Xu, et al. G-eval: Nlg evaluation using gpt-4 with better human alignment. In EMNLP, 2023.
  • [80] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. Guo, H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. In Socially Responsible Language Modelling Research, 2023.
  • [81] Z. Liu, Y. Xu, Y. Xu, Q. Qian, H. Li, X. Ji, A. Chan, and R. Jin. Improved fine-tuning by better leveraging pre-training data. NeurIPS, 35:32568–32581, 2022.
  • [82] C. Lyon and R. H. Rothman. A/b test configuration environment, Dec. 1 2015. US Patent 9,201,572.
  • [83] H. Ma, C. Zhang, Y. Bian, L. Liu, Z. Zhang, P. Zhao, S. Zhang, H. Fu, Q. Hu, and B. Wu. Fairness-guided few-shot prompting for large language models. NeurIPS, 36, 2024.
  • [84] T. Mandl. Implementation and evaluation of a quality-based search engine. In Hypertext, pages 73–84, 2006.
  • [85] F. Maoro, B. Vehmeyer, and M. Geierhos. Leveraging semantic search and llms for domain-adaptive information retrieval. In ICST, pages 148–159. Springer, 2023.
  • [86] R. Y. Maragheh, C. Fang, C. C. Irugu, P. Parikh, J. Cho, J. Xu, S. Sukumar, M. Patel, E. Korpeoglu, S. Kumar, et al. Llm-take: Theme-aware keyword extraction using large language models. In BigData, pages 4318–4324. IEEE, 2023.
  • [87] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
  • [88] J. McKeeth. Method and system for updating a search engine, July 12 2011. US Patent 7,979,427.
  • [89] L. Meincke, E. R. Mollick, and C. Terwiesch. Prompting diverse ideas: Increasing ai idea variance. arXiv e-prints, pages arXiv–2402, 2024.
  • [90] K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer. In The Eleventh ICLR, 2022.
  • [91] H. Mishra and S. Soundarajan. Balancedqr: A framework for balanced query recommendation. In ECML-PKDD, pages 420–435. Springer, 2023.
  • [92] E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. In ICML, pages 15817–15831. PMLR, 2022.
  • [93] T. Moskovitz, A. K. Singh, D. Strouse, T. Sandholm, R. Salakhutdinov, A. Dragan, and S. M. McAleer. Confronting reward model overoptimization with constrained rlhf. In ICLR, 2023.
  • [94] J. M. Nyce and P. Kahn. From Memex to hypertext: Vannevar Bush and the mind’s machine. Academic Press Professional, Inc., 1991.
  • [95] OpenAI. Gpt-4 technical report, 2023.
  • [96] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
  • [97] M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. Inverted indexes for phrases and strings. In SIGIR, pages 555–564, 2011.
  • [98] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, and J. Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. NeurIPS, 36, 2024.
  • [99] K. Pham, A. Santos, and J. Freire. Learning to discover domain-specific web content. In WSDM, pages 432–440, 2018.
  • [100] R. Puri, R. Spring, M. Shoeybi, M. Patwary, and B. Catanzaro. Training question answering models from synthetic data. In EMNLP, pages 5811–5826, 2020.
  • [101] F. Quin, D. Weyns, M. Galster, and C. C. Silva. A/b testing: a systematic literature review. JSS, page 112011, 2024.
  • [102] P. Raghavan and H. Schütze. Scoring, term weighting and the vector space model, 2008.
  • [103] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. TMLR, 2022.
  • [104] J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476, 2023.
  • [105] S. Schultheiß, H. Häußler, and D. Lewandowski. Does search engine optimization come along with high-quality content? a comparison between optimized and non-optimized health-related web pages. In CHIIR, pages 123–134, 2022.
  • [106] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. NeurIPS, 36, 2024.
  • [107] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  • [108] T. J. Skluzacek, R. Kumar, R. Chard, G. Harrison, P. Beckman, K. Chard, and I. T. Foster. Skluma: An extensible metadata extraction pipeline for disorganized data. In e-Science, pages 256–266. IEEE, 2018.
  • [109] F. Song, B. Yu, M. Li, et al. Preference ranking optimization for human alignment. AAAI, 2024.
  • [110] M. Speretta and S. Gauch. Personalized search based on user search histories. In WI, pages 622–628. IEEE, 2005.
  • [111] P. Sujatha and P. Dhavachelvan. Precision at k in multilingual information retrieval. Int J Comput Appl, 24:40–3, 2011.
  • [112] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. In EMNLP, pages 14918–14937, 2023.
  • [113] Y. Sun, Y. Zheng, C. Hao, and H. Qiu. Nsp-bert: A prompt-based few-shot learner through an original pre-training task——next sentence prediction. In ACL, pages 3233–3250, 2022.
  • [114] N. Tonellotto, C. Macdonald, I. Ounis, et al. Efficient query processing for scalable web search. Foundations and Trends® in Information Retrieval, 12(4-5):319–500, 2018.
  • [115] P. Törnberg. Best practices for text annotation with large language models. arXiv preprint arXiv:2402.05129, 2024.
  • [116] P. Tucker, A. Singhal, and E. Jackson. Methods and systems for efficient query rewriting, Nov. 23 2010. US Patent 7,840,547.
  • [117] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing ndcg measure. NeurIPS, 22, 2009.
  • [118] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
  • [119] T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.-H. Sung, D. Zhou, Q. Le, et al. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
  • [120] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
  • [121] L. Wang, N. Yang, and F. Wei. Query2doc: Query expansion with large language models. In EMNLP, pages 9414–9423, 2023.
  • [122] P. Wang, K. He, Y. Wang, X. Song, Y. Mou, J. Wang, Y. Xian, X. Cai, and W. Xu. Beyond the known: Investigating llms performance on out-of-domain intent detection. arXiv preprint arXiv:2402.17256, 2024.
  • [123] Q. Wang, H. Li, H. Xiong, W. Wang, J. Bian, Y. Lu, S. Wang, Z. Cheng, D. Dou, and D. Yin. A simple yet effective framework for active learning to rank. Machine Intelligence Research, 21(1):169–183, 2024.
  • [124] X. Wang, L. Wu, L. Hong, H. Liu, and Y. Fu. Llm-enhanced user-item interactions: Leveraging edge information for optimized recommendations. arXiv preprint arXiv:2402.09617, 2024.
  • [125] Y. Wang, Z. Liu, J. Zhang, W. Yao, S. Heinecke, and P. S. Yu. Drdt: Dynamic reflection with divergent thinking for llm-based sequential recommendation. arXiv preprint arXiv:2312.11336, 2023.
  • [126] Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu. A theoretical analysis of ndcg type ranking measures. In COLT, pages 25–54. PMLR, 2013.
  • [127] Z. Wang, S. Cai, G. Chen, A. Liu, X. S. Ma, and Y. Liang. Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. NeurIPS, 36, 2023.
  • [128] W. Warner and J. Hirschberg. Detecting hate speech on the world wide web. In Proceedings of the second workshop on language in social media, pages 19–26, 2012.
  • [129] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
  • [130] T. Werner. A review on instance ranking problems in statistical learning. Machine Learning, 111(2):415–463, 2022.
  • [131] K. Wu, E. Wu, A. Cassasola, A. Zhang, K. Wei, T. Nguyen, S. Riantawan, P. S. Riantawan, D. E. Ho, and J. Zou. How well do llms cite relevant medical references? an evaluation framework and analyses. arXiv preprint arXiv:2402.02008, 2024.
  • [132] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen, C. Qin, C. Zhu, H. Zhu, Q. Liu, et al. A survey on large language models for recommendation. arXiv preprint arXiv:2305.19860, 2023.
  • [133] T. Wu, B. Zhu, R. Zhang, Z. Wen, K. Ramchandran, and J. Jiao. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212, 2023.
  • [134] F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: theory and algorithm. In ICML, pages 1192–1199, 2008.
  • [135] L. Xiao and X. Chen. Enhancing llm with evolutionary fine tuning for news summary generation. arXiv preprint arXiv:2307.02839, 2023.
  • [136] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In ICLR, 2021.
  • [137] H. Xiong, J. Bian, S. Yang, X. Zhang, L. Kong, and D. Zhang. Natural language based context modeling and reasoning with llms: A tutorial. arXiv preprint arXiv:2309.15074, 2023.
  • [138] H. Xiong, X. Li, X. Zhang, J. Chen, X. Sun, Y. Li, Z. Sun, and M. Du. Towards explainable artificial intelligence (xai): A data mining perspective. arXiv e-prints, pages arXiv–2401, 2024.
  • [139] H. Yang, S. Yue, and Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
  • [140] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS, 32, 2019.
  • [141] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 36, 2024.
  • [142] F. Ye, M. Fang, S. Li, and E. Yilmaz. Enhancing conversational search: Large language model-aided informative query rewriting. In EMNLP, 2023.
  • [143] C.-K. Yeh, A. Taly, M. Sundararajan, F. Liu, and P. Ravikumar. First is better than last for language data influence. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, NeurIPS, volume 35, pages 32285–32298. Curran Associates, Inc., 2022.
  • [144] D. Yin, Y. Hu, J. Tang, T. Daly, M. Zhou, H. Ouyang, J. Chen, C. Kang, H. Deng, C. Nobata, et al. Ranking relevance in yahoo search. In SIGKDD, pages 323–332, 2016.
  • [145] L. Yuan, Y. Chen, G. Cui, H. Gao, F. Zou, X. Cheng, H. Ji, Z. Liu, and M. Sun. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations. NeurIPS, 36, 2024.
  • [146] X. Yuan, T. Wang, Y.-H. Wang, E. Fine, R. Abdelghani, H. Sauzéon, and P.-Y. Oudeyer. Selecting better samples from pre-trained llms: A case study on question generation. In Findings of the ACL, pages 12952–12965, 2023.
  • [147] H. Zhang, X. Liu, and J. Zhang. Summit: Iterative text summarization via chatgpt. In EMNLP, 2023.
  • [148] Z. Zhang, C. Zhang, N. Liu, S. Qi, Z. Rong, S.-C. Zhu, S. Cui, and Y. Yang. Heterogeneous value alignment evaluation for large language models. In AAAI-2024 Workshop on Public Sector LLMs: Algorithmic and Sociotechnical Design, 2024.
  • [149] F. Zhao, F. Yu, T. Trull, and Y. Shang. A new method using llms for keypoints generation in qualitative data analysis. In CAI, pages 333–334. IEEE, 2023.
  • [150] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du. Explainability for large language models: A survey. TIST, 2023.
  • [151] W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen. Dense text retrieval based on pretrained language models: A survey. TOIS, 42(4):1–60, 2024.
  • [152] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [153] Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou. Neural question generation from text: A preliminary study. In NLPCC, pages 662–671. Springer, 2018.
  • [154] Y. Zhou, J. Hao, M. Rungta, Y. Liu, E. Cho, X. Fan, Y. Lu, V. Vasudevan, K. Gillespie, and Z. Raeesy. Unified contextual query rewriting. In ACL, pages 608–615, 2023.