Is It Really Long Context if All You Need Is Retrieval?
Towards Genuinely Difficult Long Context NLP

Omer Goldman,  Alon Jacovi11footnotemark: 1,  Aviv Slobodkin11footnotemark: 1,
Aviya Maimon,  Ido Dagan,   Reut Tsarfaty
Bar-Ilan University
omer.goldman@gmail.com
Equal contribution
Abstract

Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including – for example – Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

Is It Really Long Context if All You Need Is Retrieval?
Towards Genuinely Difficult Long Context NLP


Omer Goldmanthanks: Equal contribution,  Alon Jacovi11footnotemark: 1,  Aviv Slobodkin11footnotemark: 1, Aviya Maimon,  Ido Dagan,   Reut Tsarfaty Bar-Ilan University omer.goldman@gmail.com


1 Introduction

Refer to caption
Figure 1: A taxonomy of long context tasks based on the distribution of the needed information in the text. Tasks with larger scope and higher diffusion are more difficult and more indicative of LLMs long context capabilities.

The ability to deal with ever-longer contexts has been one of the most notable trends among the emerging capabilities of large language models (LLMs). Starting with a few hundred tokens as the maximal input length of the first attention-based LLMs Devlin et al. (2019); Raffel et al. (2020), contemporary models are – technically – able to process up to 128k and even 1M tokens (Gemini Team Google, 2024; OpenAI, 2024).

The demand to evaluate LLMs in this setting has led to a line of research on designing long-context tasks and benchmarks, in order to systematically understand models’ capabilities and drive their development. However, the field has generally a sole recurring descriptor to define such measurements by—simply, the length of the context. For example, long-context benchmarks group tasks mostly by length in words (e.g., Shaham et al., 2022; Bai et al., 2023; Zhang et al., 2024). This leads to qualitatively different measurements being conflated together, with conclusions about long-context capabilities being extended from one class of tasks to others. The community is, of course, aware that, for example, tasks which require a small part of the input are different from tasks that require a large part of it. But we ask the more general question: What are the properties that differentiate tasks when conditioned on their context length? What can we accomplish with such a distinction?

In this position paper, we claim that the current landscape of works on long-context evaluation will greatly benefit from a more fine-grained characterization of long-context task design. We argue that judging LLMs by their ability to process long sequences, while disregarding the task they process them for, overlooks the characteristics that make longer inputs more difficult, and interesting to research, to begin with (§2).

For example, Needle in a Haystack tasks (NIAH; Ivgi et al., 2023; Mohtashami and Jaggi, 2023) involve queries whose main challenge is finding the relevant information in a long context, without requiring much further processing. Synthetic NIAH datasets are, of course, easier than their natural equivalents Ivgi et al. (2023), but the “natural vs. artificial” classification is not informative in our setting, since it applies equally for tasks regardless of context-length. What, then, is an informative property? What makes long-context tasks different from each other? For example, multiple-needle variants of NIAH Hsieh et al. (2024), or those that position the “needles” closer or farther apart Levy et al. (2024). Evidently, “the number of tokens in the input” is not a sufficient descriptor.

To resolve this roadblock, we present a taxonomy of long-context tasks for the different factors that make them harder when controlling for context length (§3). This taxonomy is derived by surveying the long-context literature and surfacing the most salient points of distinction between various tasks. We focus on (I) how difficult it is to find and extract the required information from the input (its diffusion in the input), and (II) the absolute quantity of required information to solve the task (its scope). See Figure 1 for a summary.

To understand this categorization and its utility, we review the literature on long-context evaluation and position the works with respect to those factors. We find that the most challenging setting, in which a large quantity of required information is present in diffused manner that is difficult to extract, is significantly under-explored (§4).

Finally, acknowledging the inherent and legitimate reasons behind the focus on context length as the sole descriptor of difficulty, we discuss possible paths forward for designing more reliable measurements of long-context capabilities when utilizing a more nuanced vocabulary (§5).

2 Task Design in Long Context

Evaluating the performance of NLP models over very long contexts is a fast-changing area of research Bishop et al. (2024); Wu et al. (2024). Measurements are regularly updated to account for new capabilities which improve with extrapolation architectures Vaswani et al. (2017); Su et al. (2024) and training data He et al. (2023); Chen et al. (2023). Evaluators were tasked with designing measurements of long-context capabilities cheaply, efficiently, and quickly, while matching realistic use-cases as much as possible. The most common way of differentiating long-context tasks, besides the context’s length, is whether they are naturally-constructed or synthetically-constructed Tay et al. (2020); Bai et al. (2023); Hsieh et al. (2024).

Natural construction.

A simple yet effective way of “moving the goalpost” for context length is by modeling long-context tasks based on short-context tasks. This was done, for example, with QA (Kočiský et al., 2018, cf. Dunn et al., 2017), summarization (Huang et al., 2021, cf. Narayan et al., 2018), and NLI (Koreeda and Manning, 2021, cf. Williams et al., 2018). Specialized domains like legal Bruno and Roth (2022); Nguyen et al. (2024) and literature (Wang et al., 2022; Kryscinski et al., 2022) often involve longer texts, turning typically short-context tasks such as QA and NLI into long-context scenarios. Another more native methodology is to create new tasks which inherently require a long context, such as multi-document summarization Fabbri et al. (2019); Angelidis et al. (2021), survey generation Gao et al. (2024), and structured data aggregation Caciularu et al. (2024). Both methodologies share the constraint that, due to their natural construction (i.e., using natural text), once created, they are difficult to modify for longer contexts as models’ long-context capabilities improve.

Synthetic construction.

A more flexible approach, sacrificing natural construction for length control, is to use distractors to synthetically increase the context length. This method allows for cheap and efficient (in terms of task construction cost) evaluation of models’ full context length capabilities, with difficulty adjusted by controlling the distractors. Tasks like Needle-in-a-Haystack (NIAH; Ivgi et al., 2023; Kamradt, 2023) and PassKey retrieval (Mohtashami and Jaggi, 2023) were created to evaluate a model’s ability to pinpoint specific information amid lengthy distractors. Flexible and effective against existing models, they became standard benchmarks for evaluating new long-context models (GLM Team, 2024; Jiang et al., 2024). Followup studies have complicated these tasks by increasing the number of critical details to locate (Arora et al., 2023; Liu et al., 2024a) and changing their position within the input Liu et al. (2024b); Levy et al. (2024).

Limitations of the status quo.

NIAH-like tasks aim to assess information retrieval capabilities, yet many “naturally constructed” QA and reading-comprehension tasks with trivial questions about a long context accomplish the same goal. At the same time, “multiple needles” NIAH can increase difficulty not by increasing the quantity of needles or length of input, but by adding distractors between needles Levy et al. (2024). What can systematically explain the different variables at play, in order to inform better task design in the future?

Clearly, there are underlying qualitative differences that discriminate between these various tasks besides their natural and synthetic constructions, and besides their actual context length. Therefore, we require a more informative vocabulary to discuss the goals of each task design, what it accomplishes, and what it does not, in terms of measuring long-context capabilities.

3 What Makes Long Context More than Retrieval?

We require a taxonomy to capture task difficulty variations beyond mere “number of tokens”. We focus on the information that is canonically required to solve the task as the conditioning variable. Our classification can be summarized via the following two questions, when asked about a given task:

(I) How difficult is it to find and extract the required information?

(II) How much information is needed to be found?

For instance, consider the task of summarizing a book, in comparison to a NIAH task of identifying a numerical detail in a long financial report (e.g., “how much did the company earn in 2015?”). Although both tasks involve long texts, information requirement and accessibility vary significantly. The NIAH task focuses on localized, identifiable information, while summarization requires extracting key details dispersed throughout the text, tangled together with irrelevant content. Therefore, we can say that the book summarization task is more difficult on both axes (I) and (II).

Below we give more formal descriptions of the two axes characterized by the questions above.

(I) Diffusion.

Although the question above intuitively defines “difficulty of information finding”, we offer a more concrete description. Between two similar tasks, we consider the information harder to find in one task compared to another if: (1) It is more obscured (e.g., linguistically, semantically, contextually, etc); (2) it is more sparse, such that it is interspersed with non-required information; (3) its indicators are less redundant, such that there are less places where the same information is available.

(II) Scope.

The property of scope is simpler, and refers to the minimal quantity of information needed to solve the task. Importantly, we are not concerned with precise metric for “quantity of information” at this stage – it can refer to quantity of tokens, sentences, relations, cells in a table, etc. Any metric that reliably captures difficulty for an established solver is sufficient for our purposes.

Illustrative example.

To illustrate, consider the Wikipedia entry for New York City and a simple question: “What is the estimated population of the city?” Since the answer needs a small snippet of information, we say that the task has small scope. And since it is easily-accessible, we say that it has low diffusion. Consider, instead, the question “how many syllables are in this document?” – since this question requires the entire document to answer, we say that it has large scope, but if we consider counting syllables as straightforward, then we say its diffusion is still low. Finally, with the question “Was the city’s mayor elected before or after the city was affected by Hurricane Sandy?” – since it requires snippets from two different areas of the text, we can say that when compared to the question about the city’s population, the diffusion is higher, but not as high as for the question “What makes the city a prominent place on the world stage?” which poses a challenge on both axes.

4 Challenging Long Context Is Under-Explored

Revisiting the works surveyed in §2, they clearly differ with respect to both scope and diffusion.

With respect to diffusion.

The information needed for tasks ranges from easily accessible to highly diffused and difficult to detect. On low diffusion we have NIAH Kamradt (2023); Mohtashami and Jaggi (2023) and a myriad of factual single-hop QA datasets (Tseng et al., 2016; Kočiský et al., 2017; Kwiatkowski et al., 2019; Dasigi et al., 2021, inter alia) in which the answer is relatively accessible. Adding more snippets of information separated by distractors, either in the form of several needles Arora et al. (2023); Hsieh et al. (2024) or of hops in a multi-hop question Trivedi et al. (2022); Zhao et al. (2022), complicates the information detection due to the need to find at least two snippets Levy et al. (2024), thereby increasing diffusion. Diffusion can also be increased by making the detection of the information less straightforward (e.g., Pang et al., 2022) or requiring aggregation Shaham et al. (2023). Lastly, summarization tasks are of a very high diffusion Huang et al. (2021); Wang et al. (2022), as they require the non-trivial detection of salient facts that are interwoven with the irrelevant text.

With respect to scope.

Tasks overwhelmingly target relatively small scope. In addition to the aforementioned NIAH tasks and their variants, many QA datasets apply as well (Li et al., 2023; Zhao et al., 2023; Reddy et al., 2024, inter alia). A somewhat higher scope is achieved by datasets for query-based summarization Zhong et al. (2021); Wang et al. (2022), and QA datasets with more obfuscated answers that require reading the text surrounding the answer for its verification An et al. (2023); He et al. (2023). Although much higher on the scope ladder, book summarization is still limited in its scope as datasets include texts that are only of up to 20k𝑘kitalic_k tokens Huang et al. (2021); Chen et al. (2022); Shaham et al. (2023). Currently, tasks with the highest scope, requiring information across the entire input length, are artificial and of low diffusion, like common words extraction Hsieh et al. (2024).

Conclusion.

We conclude that (1) the majority of tasks designed to challenge LLMs in a long-context setting target either scope or diffusion, such that (2) tasks that push current models’ capabilities on both axes are under-represented in the current landscape.

5 Discussion: Towards Genuinely Difficult Long-Context Task Design

Challenges.

Designing meaningful long-context tasks amidst rapid model progress is profoundly challenging, making the deficiency in tasks representing difficulty on both the diffusion and scope axes unsurprising. One source of this challenge is the lack of diverse, coherent long texts, as models’ context windows can now be comparable to the length of the New Testament111www.readinglength.com/book/isbn-0190909005 and the Odyssey.222www.readinglength.com/book/isbn-0140268863 The methodologies discussed in §2 for creating long context tasks – lengthening short context tasks and synthetically creating length-adjustable tasks – are preferred for their straightforward definition and the incremental adjustments they require for existing data. They rely on the common understanding of machine comprehension as formulated with short context in mind Dunietz et al. (2020), and therefore they are intuitive and easy to comprehend for NLP researchers without domain expertise (e.g., in law or biomedical fields that have long contexts).

Future work.

The goals laid forward in this work are clear: For more durable and robust measurement of long-context capabilities, we must design tasks that explicitly target both the diffusion and scope capabilities of models. How can this be achieved? As mentioned, one possible avenue is to rely more on tasks that require domain expertise, such as legal documents Bruno and Roth (2022), financial reports Reddy et al. (2024), biomedical publications Stylianou et al. (2021), and so on. In specialized domains, it is common that diffusion will be naturally higher Zhao et al. (2022). Tasks that involve implicit aggregations over structured data, such as table manipulation Caciularu et al. (2024), are possible avenues for increasing both scope and diffusion synthetically by leveraging the data structure. In this work, we argue that an explicit vocabulary for such properties of difficulty is what can enable more informed techniques to design difficult tasks in the future.

6 Conclusions

We present a taxonomy of factors that make long-context tasks more challenging compared to short ones. This is in contrast with the existing literature that refers only to the length of the input as the hallmark of long context, and as a result ends up conflating tasks of different character when assessing the ability of models to understand longer text. We reviewed works on evaluation in a long-context setting and found that the most challenging setting, in which the information needed is of large scope and is highly diffused within the input, is under-explored. Finally, we suggested some leads for future work to tackle this imbalance towards a more informative long context evaluation.

7 Limitations

Formality.

In the context of this work, we have deliberately adhered to a taxonomy based on an intuitive description, in the interest of utility to a wide diversity of research and flexibility for future work. Difficulty in searching for and extracting information, and quantity of information, are both vague terms that can only be grounded in the context of a specific family of tasks and use-cases. We intend for this work to serve as a call to action and a tool for a shared vocabulary in the interest of more informed long-context task design in the future, rather than to anchor the taxonomy to a specific and fragile point in time.

Retrieval is still interesting.

Although we argue that small scope and low diffusion tasks are the least indicative of the model’s ability to long-context capabilities, tasks that are well-served by implicit retrieval or by traditional retrieval-based pipelines are certainly relevant and useful in a variety of common use-cases Stylianou et al. (2021); Bruno and Roth (2022); Gao et al. (2023).

Other uses for a long-context window.

This paper deals only with long inputs that serve as inputs to a task. The long context of course can have other purposes as well, like containing many in-context examples Bertsch et al. (2024) or containing other modalities and structures Jiang et al. (2023).

References