white bar graph with a magnifying glass icon on a green gradient background

Benchmarks, Evaluation and Measurement

Academic research plays such an important role in advancing science, technology, culture, and society. This grant program helps ensure this community has access to the latest and leading AI models.
Brad Smith, Vice Chair and President

green icon of a person standing on a circle with four smaller circles connected

AFMR Goal: Align AI with shared human goals, values, and preferences via research on models

which enhances safety, robustness, sustainability, responsibility, and transparency, while ensuring rapid progress can be measured via new evaluation methods

Evaluating the functionality, efficiency, and reliability of language models is the main theme of this group of research projects. They cover various scenarios and applications, such as assessing the models’ comprehension, processing, and generation of better responses, with topics such as uncertainty quantification, abstraction & reasoning, knowledge distillation, structured pruning, and skills-based framework; developing the models’ instruction-following ability, task-agnostic distillation, and sequential planning skills.

Stanford University: Christopher Ré (PI)

This proposal suggests utilizing a skills-based framework to understand how foundation models acquire different capabilities from training data. This framework will then be used to select and order data to improve the performance of these models. The central question of the research is how best to understand the properties of skills in terms of scaling, data, and model architecture in order to develop a skills-based training paradigm for foundation models.
University of California, Los Angeles: Hongjing Lu (PI)

This research proposal aims to evaluate and improve the reasoning capacities of large-scale AI systems using a cognitive science approach. The project seeks to evaluate these systems in three domains. First, multimodal vision-and-text models (GPT-4) will be evaluated on a series of visual reasoning tasks, to assess the extent to which they can reason about objects and visual relations. Second, generative text-to-image models (Dall-E 3) will be evaluated in tasks that require relational and compositional image generation. Third, language models (GPT-3 and GPT-4) will be evaluated for their ability to perform physical reasoning and problem solving. Finally, in addition to evaluating these capacities, the project also seeks to improve the reasoning and planning abilities of these systems through the development of modular architectures. A significant aspect of this research lies in its dependency on Microsoft Azure services, including language models and generative text-to-image models, as well as its commitment to releasing open-source benchmarks associated with these projects.
University of California, Berkeley: Ion Stoica (PI)

This proposal seeks to develop a scalable, automatic approach for evaluating Foundation Models, specifically LLMs, on open-ended tasks. It proposes using ‘judge’ LLMs, like GPT-4, to assess the quality of AI model responses. It aims to conduct both controlled and crowd-sourced experiments to develop a comprehensive benchmark for evaluation.

Related paper:
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (opens in new tab)
Pennsylvania State University: Qingyun Wu (PI)

In this project, we aim to collaborate with researchers from Microsoft Research on the topic of AI agent evaluation and benchmarking. Recognizing the limitations of current benchmarks in providing a holistic evaluation of agent-based systems, we aim to develop: (1) a comprehensive test suite for debugging and evaluating agent-based systems; (2) construct novel agent-centric benchmarking datasets, and relevant evaluation metrics; and (3) deliver an evaluation framework for AutoGen around (1) and (2), providing an in-depth understanding of AI agent-based systems.
University of Massachusetts Amherst: Chuang Gan (PI)

The proposal introduces innovative techniques to make Large Language Models more efficient and reliable, tackling the problem of their high resource consumption and aligning them better with human values. Advancements include the development of a novel Sparse Transformer incorporating the Mixture of Experts (MoE) and the implementation of a Principle-Driven Self-Alignment approach that reduces the need for extensive human annotations
University of Darmstadt: Iryna Gurevych (PI)

Gain a comprehensive understanding of foundation models by conducting a comparative analysis of models of various sizes and exploring their reasoning, creative potential, and adaptability to new tasks. Our team will focus on evaluating self-explanations and emergent capabilities, including reasoning and creativity, through comparative studies and the design of novel datasets.

Related paper:
- Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs (opens in new tab)
Mohamed bin Zayed University of Artificial Intelligence: Monojit Choudhury (PI)

This research aims to address the cultural bias and awareness of Large Language Models (LLMs), which have been shown to predominantly favor Western culture and values. Current probing techniques relying on blackbox prompting are sensitive to the prompt and are limited to the study of objectives and values across cultures, where data is readily available. This proposal presents two objectives to address these concerns: (1) the development of a systematic probing technique that utilizes the internal states of the models (if available) or constructs and analyzes large-scale response matrices; and (2) creation of large-scale datasets that enable LLMs probing for cultural ‘common ground’ and ‘aboutness’. The research strives to evaluate various popular LLMs using developed techniques and datasets. The proposed outcomes of the project include a multicultural and multilingual dataset assessing cultural aspects, methods and tools for systematic cultural study, and a comprehensive report on cultural awareness and bias in popular LLMs.

Related paper:
- “They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations (opens in new tab)
University of Cambridge: Adrian Weller (PI)

Humans are very effective in modeling functions intuitively such as predicting the trajectory of a ball. Furthermore, humans have good intuition for what types of function forms are common in the real world. Do Large Language Models (LLMs), that are trained with massive human knowledge, have similar abilities? In particular, can LLMs model real-world functions accurately without gradient-based learning, utilizing the prior knowledge learned from the internet? One conjecture is that LLMs can acquire the capability to simulate the output of such systems without going through explicit computations. This is akin to how humans can predict (albeit with error) the output of such systems when modeling simple functions. In this research, we design a new evaluation paradigm to evaluate the ability of LLMs to model functions like humans. This will help to advance the development and the application of LLMs.

Highlights: Using carefully designed evaluation methods, we validate that LLMs indeed possess strong intuitions about real-world functions and can apply this understanding to downstream tasks. Specifically, we discover that LLM can not only identify a wide range of patterns in the data but also leverage domain knowledge to model the underlying function, all without necessitating gradient-based learning or precise computations. In settings where domain knowledge is pivotal or the data is scarce, LLMs can even outperform commonly-used machine learning models. At the same time, we also identify several drawbacks of LLMs for function modelling tasks. Our research highlights both the potentials and limitations of LLMs for data science.
Stanford University: Diyi Yang (PI)

Large Language Models (LLMs) have recently achieved unprecedented performance across diverse tasks. By pinpointing gaps for improvements, evaluation becomes the bedrock that enhances the understanding of current models and ensures AI’s continued progression. Efforts to evaluate LLMs have intensified significantly. Despite the proliferation of LLMs evaluations, current evaluation benchmarks face multiple fundamental challenges such as static datasets and data contamination issues. This work investigates two main research questions: How can we dynamically evaluate the performance of LLMs in diverse domains with different levels of complexities? How can we generate evaluation protocols on the fly to support such dynamic evaluation? Through multiple dynamic evaluation thrusts, this work aims to mitigate potential data contamination issues and provide a holistic, new methodology for dynamic evaluation.
Stanford University: Sanmi Koyejo (PI)

The proposal outlines a project aimed at enhancing the robustness of foundation models through innovative, human-centric comparative oversight mechanisms. It addresses the limitations of current human feedback mechanisms, which often rely on simplistic binary judgments. Instead, it proposes a more nuanced approach called the “oversight agreement mechanism.” This mechanism introduces an additional layer of human evaluators who assess the agreement between the original task outputs, allowing for richer, natural language assessments. The project will use resources such as foundation models (GPT-3 and GPT-4 on Azure) to generate samples for comparative assessment. The outcomes will be evaluated based on agreement rates and predictive power, with successful outcomes expected to improve the models’ accuracy, fairness, and interpretability. Additionally, the project is anticipated to provide insights into the differences between human and model-generated oversight. The potential impact of this project extends to establishing a new paradigm for human-AI collaboration where AI systems are critiqued and refined based on meaningful human oversight, making them more responsive to human values.
University of California, Santa Cruz: Cihang Xie (PI)

This research aims at understanding how reliably LLMs can follow given algorithmic procedures and the effects of adversarial or corrupted prompts on this capability. The project will design environments where AI agents perform multi-step planning and reasoning to accomplish objectives, allowing for a quantitative measure of the reliability of LLMs in real-world applications. The proposed research will also investigate factors impacting the LLMs ability to follow algorithmic procedures.

Related paper:
- AQA-Bench: An Interactive Benchmark for Evaluating LLMs’ Sequential Reasoning Ability (opens in new tab)
Cornell University: Aditya Vashistha (PI)

The project aims to investigate the cultural biases and shortcomings associated with large language models (LLMs) and text-to-image models (T2I models). It intends to probe aspects of non-Western cultures that are overrepresented or underrepresented in these models, and understand the potential harms of lack of cultural representation on users. The proposed research will primarily focus on analyzing these AI technologies’ understanding of Indian culture, in terms of performance in identifying cultural artifacts and norms. In addition, it will evaluate how underrepresentation of non-Western cultures impacts users, through real-world tasks done by users with and without the aid of AI technologies. This project requests USD 30,000 in Azure credits and full access to Azure OpenAI Services and Azure Cognitive Services, and plans to release a public dataset of expanded cultural artifacts and a manuscript for CSCW, a top HCI journal.
University of Arizona: Eduardo Blanco (PI)

This proposal focuses on devising new methods for evaluating foundation models that consider linguistic capabilities rather than specific task benchmarks. Key aspects will include considering the effect of negation and factuality, augmenting existing benchmarks with these linguistic phenomena, and considering alternative solutions.
Santa Fe Institute: Melanie Mitchell (PI)

This proposal focuses on a research to systematically evaluate the performance of GPT-4V. The project aims to test the model on existing benchmarks involving planning and abstract reasoning provided in both text-only and vision-augmented formats. The tests involve assessing vision-augmented planning using spatial maps and vision-augmented abstract reasoning via the ConceptARC benchmark. The researchers anticipate that the study will yield a comprehensive paper discussing the benchmarks and performance results, and an open-source toolbox with these tasks. They also aim to compare the results of the tests involving GPT-4V with human performance to provide baseline data for future research.
University of Waterloo: Charles Clarke (PI)

The research proposes the usage of Microsoft’s AgentEval framework to evaluate retrieval augmented generation (RAG) systems. With an increase in search systems incorporating RAG approaches, which devise entirely new responses, the research identifies the need for new assessment methodologies. AgentEval automates task-specific criteria formulation and evaluation, with two agents, the Critic and the Quantifier, defining and applying the criteria respectively in a RAG setting. The research aims to address RAG’s unique challenges with comprehensive evaluation adapting to generative responses’ dynamic nature. It also seeks to release the methodologies via AutoGen and submit academic papers.
Universitat Politècnica de València: Jose Hernandez-Orallo (PI)

The project will address the challenge of low predictability in foundation models. The overarching aim is to measure their validity (accuracy, originality, non-toxicity, fairness, etc.) both prior to and after running each instance and compare the results with human expectation metrics. The proposed methods include the use of smaller models to predict the validity of larger ones and build scalable oversight mechanisms. The study will particularly focus on setting up predictive demands at instance level and assessing the resultant validity predictors. The outcome will include new methodologies, actual monitors for existing foundation models, and benchmarks comparing human annotations/predictions with those of several models.
Michigan State University: Jiliang Tang (PI)

The research proposal intends to evaluate Large Language Models’ (LLMs) expressive power in graph-related tasks. It aims to conduct a theoretical analysis of LLM predictors to understand their expressive potential for graph tasks, followed by empirical experiments which will assess their practical expressiveness. Additionally, the research will explore the impact of techniques like prompt-tuning, fine-tuning, and retrieval-augmented generation (RAG) on enhancing the capabilities of LLM predictors. The research will particularly focus on testing permutation equivariance to check if LLMs can maintain consistent outputs when the nodes of a given graph are permuted. The results will be provided in a comprehensive evaluation report, detailing the strengths and weaknesses of existing LLM predictors.
Duke University: Neil Gong (PI)

This proposal aims to bridge the gap that the literature lacks a systematic evaluation and understanding of the robustness of LLMs against prompt injection. Prompt injection aims to perturb a prompt such that an LLM performs the intended task incorrectly or performs a different task. The team will first formalize prompt injection to LLMs. Based on the formalization, the team will conduct a systematic evaluation on the robustness of LLMs against prompt injection with various LLMs and tasks. Furthermore, armed with insights derived from the systematic evaluation, our team will develop new methods to improve robustness of LLMs against prompt injection, eventually safeguarding hundreds of millions of users
University of Virginia: Tom Hartvigsen (PI)

Despite their capabilities, deployed Large Language Models (LLMs) can sometimes generate harmful content. This proposal introduces an evaluation framework aimed at continually debiasing LLMs, adjusting for changes in user needs and regulations. The framework focuses on dynamic benchmarks which reflect that biased behaviors and societal norms are recognized and alter over time. It employs realistic shifts broken down by demographic groups for assessment. Existing state-of-the-art models will be evaluated using the framework, applying debiasing techniques reliant on both black-box prompting and finetuning methods. This research builds on previous work including the ToxiGen project and the method GRACE. The outcomes will be concrete evaluations for language models and a range of publicly-available benchmarking tasks, including datasets, metrics, and debiasing strategies.
Morehouse College: Kinnis Gosha (PI)

With the advent of ChatGPT, modern companies (i.e. shareholders) are infatuated with the use of AI in order to maintain competitiveness, expand employee productivity, and increase profit margins. When it comes to analyzing worker performance, trustfulness and transparency in the AI system is critical. Instead of using AI solely to determine employee performance, human feedback can be integrated into algorithms in a way that reduces the presence of various biases that could plague an evaluation. The following proposal will outline the development of a hybrid framework for performance evaluation driven by artificial intelligence for college faculty. Findings from the study can be used in the development of best practices for hybrid artificial intelligence performance evaluation across multiple employment sectors.
University of Central Florida: Yogesh Singh Rawat (PI)

We have witnessed significant advancements in large multimodal foundation models like GPT-4V, LLaVA, CLIP, InstructBLIP, and Gemini, trained on extensive datasets. These models showcase remarkable proficiency in solving various tasks related to visual perception. Leveraging their strong generalization abilities, they demonstrate effectiveness in open-world scenarios, particularly in zero-shot settings for fundamental computer vision tasks. Despite these accomplishments, it remains uncertain whether these models possess conceptual or common-sense understanding inherent to human cognition. The absence of such capabilities raises concerns about the reliability of these models for real-world applications. This research will focus on exploring the conceptual understanding of geometric reasoning and motion perception within these models. The objective is to systematically assess their capability of basic geometric and motion properties in visual data, aiming to determine the extent of their suitability for practical deployment followed by improving this capability.
Carnegie Mellon University: Yiming Yang (PI)

This proposal presents a research plan to improve the functionality of large language models (LLMs) such as GPT-3.5 and GPT-4 by leveraging feedback mechanisms. The research team at Carnegie Mellon University will investigate techniques for making LLMs more responsive to feedback and train an augmented universal reward model to enhance the models’ judgements. Expected outcomes of this research include more adaptable LLMs that can perform better in real-world scenarios and more aligned LLMs that are guided by a powerful reward model
Princeton University: Danqi Chen (PI)

The proposal aims to enhance the capabilities of large language models (LLMs) by focusing on two key areas: efficient pre-training and evaluation of instruction-following models. The first project revolves around creating efficient foundation models through structured pruning. The goal is to derive smaller models from existing LLMs, saving computational resources. The second project aims to develop a robust, unbiased benchmark for assessing LLMs’ instruction-following abilities, hence better aligning them with human goals. The proposal suggests that access to resources, such as Azure compute, OpenAI APIs, and Azure-hosted LLaMA family, would greatly aid the projects.
George Mason University: Antonios Anastasopoulos (PI)

The proposal aims to investigate and quantify the multilingual capabilities of Large Language Models (LLMs), focusing on under-served languages and cultures. An in-depth, fine-grained evaluation is presented, encompassing dialectal evaluation, cultural relevance, and human-centric biases in LLM output. The research envisions a comprehensive evaluation framework that benefits everyone. Three research thrusts are described in the plan: fine-grained dialectal evaluation, cultural relevance assessment, and detailed exploration of downstream biases in LLMs in cross-cultural contexts. These will datasets. The proposal aims to produce new human-centric, culturally-relevant datasets expanding the current LLM evaluation and develop metrics to measure cultural relevance in LLM outputs.
University of Washington: Tanushree Mitra (PI)

The research proposal addresses the socio-cultural limitations of large language models (LLMs). The research will approach this problem through two thrusts. The first will develop methods to conduct systematic audits of LLMs across socio-cultural contexts, particularly in relation to the Global South, focusing on recruitment and loan review. The second thrust will tackle the challenge of determining cultural insensitivity in a quantifiable manner, by developing mixed-initiative systems that combine machine-assisted workflows with crowdsourced and expert contribution. The aim is to improve the capabilities of LLMs in understanding and reflecting socio-cultural factors in language use. An expected outcome is new tools and methods for socio-cultural audits of generative AI, as well as rich datasets and evaluation metrics that can inspire further research into socio-cultural components of generative AI technology.

Related paper:
- “They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations (opens in new tab)
University of Illinois Urbana-Champaign: Dilek Hakkani Tur (PI)

The research proposal focuses on assessing the task completion abilities of large language models (LLMs) through multi-turn conversational interactions within a multi-agent framework. Three agents will be incorporated: a LLM, a user simulator, and an evaluator. The assessment will be based on the performance of the LLM in accomplishing complex tasks via interactions, responses, and negotiations with the user simulator, and the user goal accomplishment evaluated by the third agent. The project will use existing multi-domain interaction corpora like MultiWOZ and SPIDER for back-end resources and user goal formation. Agenda-based user simulators will be created to drive diverse user interactions. The LLM task completions will be prompted using templates in various task domains e.g., GPT-4, LLAMA 2. Additionally, a human evaluation will be carried on to compare with the assessments of the evaluation agent. The project plans to make open-source resources available on a public GitHub repository and write a paper detailing the methodologies, experimentation, and findings.
University of Toronto: Xujie Si (PI)

This research proposal aims to evaluate and amplify the reasoning ability of foundation models for program invariant inference, a key aspect of software verification. The team will initially assess standard foundation models on loop invariant and data invariant inference benchmarks, frequently utilized in the software verification research sector. Post-evaluation, the team plans to architect a foundation model-driven program invariant inference framework by integrating foundation models with static and dynamic program analyzers. This proposed research targets substituting the expertise-reliant and labor-intensive invariant labeling process with the guidance of foundation models, thereby propelling software verification applications. The research is motivated by the potential of foundation models to revolutionize the software development industry by not only creating but also verifying program correctness. The anticipated outcomes include a benchmark suite appropriate for foundation model evaluation and a methodology demonstrating how these models engage with classic dynamic and static program analyzers to generate program invariants.
Cornell University: Matthew Wilkens (PI)

This project seeks to evaluate foundation models such as GPT-4 on a complex task that requires specialized domain knowledge: identifying types of legal interpretation. A large dataset of legal interpretations will be used to explore the efficacy of these models. The project will involve comparing the performance of various foundation models in conjunction with prompts, including those excluding legal jargon and chain-of-thought prompts.
North Carolina State University: Dongkuan Xu (PI)

This research proposal outlines innovative methods for scalable and adaptable assessment of large language models (LLMs) trustworthiness using generative approaches. The team will first analyze current evaluation methods, highlighting their limitations in terms of high evaluation sample requirements, deep domain expertise necessities, and difficulties in generalizing evaluation results across varying domains and applications. To address these issues, the research will develop and refine generative methods for creating diverse and domain-specific evaluation benchmarks, and design a framework that employs these methods to adaptively assess LLMs’ trustworthiness in varied contexts. The study aims to significantly reduce the time and resources needed for comprehensive LLM evaluations, promoting more robust and reliable applications, notably in sensitive areas like healthcare and finance.
Stanford University: Tatsunori Hashimoto (PI)

The proposal focuses on studying and improving the alignment of Reinforcement Learning from Human Feedback (RLHF) methods using simulated human feedback with an emphasis on the usefulness of simulation-based feedback and aspects of robustness such as reward hacking and distribution shift
Stanford University: Chelsea Finn (PI)

The proposal centers on developing efficient distillation techniques for Foundation Models (FMs) to reduce their computational demands and broaden their applicability in scenarios with latency constraints. In this project, knowledge from large FMs will be transferred to smaller, more manageable models, making FMs more accessible.
University of Washington: Yulia Tsvetkov (PI)

This proposal will develop a robust evaluation framework of cultural biases in existing off-the-shelf language models, across languages, and new methods to augment LLMs with culturally- and socially-relevant information. To develop an evaluation framework, we will extend our prior work “From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models” to incorporate new theoretically-grounded tests of cultural biases, drawn from anthropology research. To adapt the models, we will experiment with new approaches to represent cultural and social norms and incorporate them into LLMs via prompting.
Lehigh University: Lichao Sun (PI)

This research proposal aims to construct a comprehensive evaluation ecosystem, TrustLLM, to monitor the trustworthiness of Large Language Models (LLMs) across eight dimensions – truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability. The proposal focuses on developing guidelines for evaluating LLM trustworthiness through literature review and subsequently, establishing a benchmark covering these aspects using various datasets. Extensive evaluations of LLMs will be conducted considering performance based on each trustworthiness dimension. The project anticipates outcomes including a detailed benchmark for LLM trustworthiness assessment, an open-source dataset and code, a regularly updated public leaderboard, and an automated evaluation platform.
Georgia Institute of Technology: Chao Zhang (PI)

The proposal focuses on enhancing the reliability of Large Language Models (LLMs) by calibrating their responses’ confidence and leveraging estimated uncertainty for better decision-making and efficient exploration of LLM agents. The researchers aim to utilize a tool-augmented multi-agent debate mechanism to calibrate LLMs’ confidence and to harness uncertainty estimates for improving LLM agents’ planning efficiency.
Carnegie Mellon University: Bhiksha Raj (PI)

The proposal intends to evaluate and enhance the generalization capabilities of multimodal foundation models (like CLIP, Stable Diffusion, GPT-4, and Llama2) through an in-depth examination of their pre-training data. The research aims to explore the effects of data quality, especially the effects of noise and corruption on a model’s generalization. The proposed research involves creating controlled, synthetic datasets with varying levels of corruption/noise for training and analyzing training dynamics and transfer learning capabilities. The study extends beyond traditional performance metrics to calibration, alignment, hallucination tendencies, and failure cases, aiming to provide a comprehensive understanding of pre-training data’s impact on model robustness. The expected outcome includes new pre-training strategies for improved generalization of foundations models across applications.
University of Michigan, Ann Arbor: Rada Mihalcea (PI)

Large language models (LLMs) have been demonstrated to lead to impressive performance and have been adopted for numerous NLP tasks. However, many of the models currently available overly represent certain cultures, at the cost of under-representing others. This can result in “cultural bias” in these models, as they can lack familiarity with certain cultural groups. This limitation is especially problematic when it comes to cultural commonsense — the practical knowledge that is commonly shared among most people in a group. In this project, we plan to plan to pursue the following two main goals:
University of Illinois Urbana-Champaign: Varun Chandrasekaran (PI)

As LLMs are becoming more capable, they can be increasingly used for tasks humans are currently deployed for. One such task is crowdsourcing collective responses. In the future, when LLMs are to be trained and current data runs out, one avenue for data creation and curation stems from the LLM itself. But to perform this task, several challenges need to be overcome. Understanding bias: Since the LLM is trained using data from the public internet, the “persona” it possesses is unclear. We wish to better understand this persona, as it dictates various biases exhibited by the LLM. For example, work by Santurkar et al.~\cite{santurkar2023opinions (opens in new tab)} describes how aligned LLMs are currently left-leaning when quizzed about political preferences; their paper states “In fact, models such as text-davinci-003 fail to model the subtleties of human opinions entirely – they tend to just express the dominant viewpoint of certain groups“), and Tjuatja et al.~\cite{tjuatja2023llms (opens in new tab)} highlights the biased, but misaligned with human responses, of LLMs in a different domain. However, estimating such forms of bias requires an involved methodology which is very task-specific.