WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models
Authors:
Youssef Benchekroun,
Megi Dervishi,
Mark Ibrahim,
Jean-Baptiste Gaya,
Xavier Martinet,
Grégoire Mialon,
Thomas Scialom,
Emmanuel Dupoux,
Dieuwke Hupkes,
Pascal Vincent
Abstract:
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of…
▽ More
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
Near-Optimal Pool Testing under Urgency Constraints
Authors:
Éric Brier,
Megi Dervishi,
Rémi Géraud-Stewart,
David Naccache,
Ofer Yifrach-Stav
Abstract:
Detection of rare traits or diseases in a large population is challenging. Pool testing allows covering larger swathes of population at a reduced cost, while simplifying logistics. However, testing precision decreases as it becomes unclear which member of a pool made the global test positive.
In this paper we discuss testing strategies that provably approach best-possible strategy - optimal in t…
▽ More
Detection of rare traits or diseases in a large population is challenging. Pool testing allows covering larger swathes of population at a reduced cost, while simplifying logistics. However, testing precision decreases as it becomes unclear which member of a pool made the global test positive.
In this paper we discuss testing strategies that provably approach best-possible strategy - optimal in the sense that no other strategy can give exact results with fewer tests. Our algorithms guarantee that they provide a complete and exact result for every individual, without exceeding $1/0.99$ times the number of tests the optimal strategy would require.
This threshold is arbitrary: algorithms closer to the optimal bound can be described, however their complexity increases, making them less practical.
Moreover, the way the algorithms process input samples leads to some individuals' status to be known sooner, thus allowing to take urgency into account when assigning individuals to tests.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.