GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians
Abstract
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at https://github.com/Liu-Hy/GenoTex.
1 Introduction
In biomedical research, gene analysis is crucial for understanding biological mechanisms and advancing clinical applications such as disease marker identification and personalized medicine. Advances in next-generation sequencing and other technologies have led to a surge in the volume of transcriptomic data. Genomics research is expected to produce between 2 and 40 exabytes of data in the next decade [26], greatly facilitating research and discoveries in genomics.
Despite the scientific value of gene data analysis, these tasks are often repetitive, labor-intensive, and prone to errors [5]. The rapid increase in transcriptomic data and potentially inefficient workflows lead to considerable financial burden [27]. The genetics research industry incurs an annual expense of around $848.3 million on manual data analysis tasks [44], with costs expected to increase at a compound annual growth rate (CAGR) of 12% [44] to 16% [45] by 2030. Bioinformaticians spend significant effort on these repetitive tasks, valued at around $29 per hour [41]. This high volume of routine tasks greatly impacts job satisfaction among bioinformatics professionals, as surveys show that data scientists, including bioinformaticians, prefer engaging in advanced analytical tasks rather than routine data processing. Currently, up to 45% of their work hours are spent on tasks that could be automated [65]. These financial and workforce challenges highlight the urgent need for more efficient and cost-effective data analysis solutions in genetics research [3].
Meanwhile, the increasing abilities of Large Language Models (LLMs) [39] have enabled methods for automating certain data analysis tasks [33, 2], and relevant benchmarks have been proposed [50, 15]. However, these studies have mostly focused on simplified synthetic datasets, or specific steps in the analyze pipeline such as missing data imputation or hyper-parameter tuning. In contrast, analysis on real-world gene expression data involves complex domain-specific procedures, and inherently requires the flexible planning, troubleshooting, and domain knowledge inference typically performed by a human bioinformatician, posing higher demands on automatic methods.
To facilitate the evaluation and development of such methods, we propose the Genomics Data Automatic Exploration Benchmark (GenoTEX), a benchmark dataset for the automated analysis of gene expression datasets to identify disease-associated genes while considering the influence of other biological factors. Following the standards of computational genomics and based on the common practices of skilled human bioinformaticians, we unified the process of analyzing various gene expression datasets for solving different gene identification problems into a standardized pipeline with detailed procedures, documented in a guidelines file (Appendix B). We then trained and organized a group of bioinformaticians to manually perform the data analysis according to these guidelines, creating a benchmark dataset consisting of input data, annotated code, and intermediate and final analysis results. Based on this benchmark, we propose three tasks, namely, dataset selection, data preprocessing, and statistical analysis, with corresponding metrics for evaluating different aspects of the automatic exploration of gene expression data.
Furthermore, to provide baselines for these tasks, we propose GenoAgent, a team of LLM-based agents that simulate the behavior of bioinformaticians in gene data analysis. To tackle the challenges in gene data exploration, GenoAgent employs a collaborative workflow featured with context-aware planning, iterative correction, and domain expert consultation. The agents are instructed with the detailed guidelines to perform the full pipeline of data analysis for solving gene identification problems.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x1.png)
Our evaluation suggests that GenoAgent is able to automate the process of gene expression data analysis with good overall accuracy, affirming the promise of integrating LLMs into genomics research.
In summary, our contributions are as follows:
-
•
We propose a benchmark dataset, GenoTEX, that presents the analysis pipeline for a rich set of gene identification problems, with documented code and output. We believe it will serve as a useful resource for the evaluation and development of advanced methods for automatic gene expression data analysis.
-
•
We define three challenging tasks: dataset selection, data preprocessing, and statistical analysis, along with corresponding metrics, to support more systematic evaluation.
-
•
We propose a baseline method, GenoAgent, a team of LLM-based agents to collaboratively explore gene expression datasets. Our evaluation demonstrates the promise of LLM-based approaches in genomics data analysis, and error analysis reveals areas for future improvement.
2 Related work
LLMs for collaborative problem-solving
Large Language Models (LLMs) have shown the potential to achieve human-level intelligence [59, 38, 55, 56]. Research has tried to enhance their problem-solving abilities through techniques such as goal decomposition [64, 75, 16, 37], tree and graph structures [71, 23, 4], consistency [60], self-refinement [67, 34, 62, 10], and the use of external tools [32, 74, 43].
The collaboration of multiple agents can further enhance problem-solving capacities [63, 52, 14, 58], often through role-playing with distinct expertise [69, 13]. MetaGPT [24] promotes collaboration among various agent roles, and studies have shown the effectiveness of role-playing in software development [42, 13]. Other works explore sociological phenomena [48, 51, 76, 63, 31], such as virtual towns for interactions among AI agents [40]. Recent research emphasizes task management and feedback for performance improvement [25, 68, 20, 72], with task management shown to enhance multi-agent systems [52, 69].
LLMs for scientific discovery
Researchers have also been incorporating LLMs into scientific discovery in fields such as chemistry [6, 21], biotechnology [35], and medicine [49, 70] by training or fine-tuning LLMs on domain-specific data. In contrast to these works, we leverage current state-of-the-art LLMs without additional training. We employ structured prompting and communication strategies to equip LLM-based agents with the planning, analysis, and coding abilities required for scientific exploration.
To tackle the challenging tasks in our benchmark, we propose a baseline method that employs a team of LLM-based agents, each contributing their own expertise, to collaboratively conduct gene expression data analysis.
3 Benchmark
This section describes our GenoTEX benchmark. Specifically, we introduce our proposed standardized pipeline for gene expression data analysis, the process for creating and ensuring the quality of the benchmark, and the tasks and metrics defined for evaluation.
3.1 Standardized pipeline of gene expression data analysis
Our study aims to automate the gene expression data analysis process to address a class of important problems: What are the significant genes associated with a specific trait, given the influence of some condition? Here, a “trait” refers to a characteristic such as a disease (e.g., diabetes), and a “condition” refers to a factor like age, gender, or a co-existing trait (e.g., hypertension). This problem is scientifically important because the key genes linked to traits often vary based on the diverse physical conditions of patients. By incorporating these factors into our analysis, we aim to gain a more comprehensive understanding of the genetic underpinnings of these traits.
Evaluating the automatic exploration of this kind of problems is complex due to its nature. The combination of different traits and conditions leads to a multitude of gene identification scenarios, many of which remain understudied in biomedical literature. This absence of a clear “ground truth” complicates the evaluation of our analysis results. Moreover, while data-driven approaches provide valuable insights, they must ultimately be combined with interventional biological experiments or clinical trials to confirm the significance of identified genes. Defining the exact insights that should be extracted from our data analysis is therefore complex.
Given these challenges, instead of creating a benchmark to test whether the automated discoveries can align with a “ground truth” that can only be discovered through interventional methods, we designed the benchmark to evaluate how well the automatic analysis process and results align with those of a skilled bioinformatician following standard procedures.
Thus, to enhance the reliability of our benchmark, we have developed a standardized analysis pipeline. This pipeline mirrors the steps a skilled bioinformatician would follow, enabling systematic evaluation of the automated methods against established human expertise. By adhering to this standardized approach, we aim to facilitate not only the evaluation of our method but also the future development of more advanced methods. In the following subsection, we introduce this pipeline in detail and provide the necessary background knowledge to understand its significance and application in our research.
3.1.1 Data preprocessing
The preprocessing of gene expression data involves a comprehensive pipeline with several main steps such as dataset filtering and selection, gene data preprocessing, trait data extraction, and data linking. Below we introduce the preprocessing steps for gene expression data within our pipeline.
Dataset filtering and selection
In our paper, unless otherwise specified, a “dataset” refers to a cohort dataset, which is the overall collection of samples and their associated genetic and clinical information from a biomedical study. The selection of datasets involves three steps:
-
1.
Initial filtering At the beginning of analysis, we start with a list of potentially useful datasets, and determine the relevance of each dataset to the problem by reading the metadata. This involves verifying the availability of gene expression data (as opposed to miRNA or methylation data) and the traits of interest.
-
2.
Quality verification In case there are abnormalities in the dataset that were not handled successfully in the preprocessing step, we discard the dataset to ensure quality.
-
3.
Dataset selection As gene expression data are often high-dimensional and scarce, the analysis can be bottlenecked by sample size. Therefore, if multiple preprocessed datasets are available for statistical analysis about a trait, we select the one with the largest sample size. If the analysis requires integrating datasets about two traits, we sort the possible pairs of datasets for both traits by the product of their sample sizes, and select the pair with the largest product.
Gene data preprocessing
In this step, we prepare a data table where each attribute represents the expression level of a specific gene within a sample. The preprocessing steps vary depending on the measurement technique. For microarray data, we start with raw datasets identified by probe IDs, which are DNA sequences complementary to the target RNA sequences used to measure gene expression. For RNA-seq data, we handle sequence reads that require alignment to a reference genome. In both cases, we map the initial identifiers to gene symbols using platform-specific gene annotation data. We then normalize and deduplicate these gene symbols by querying gene databases via APIs, to prevent potential inaccuracies due to different gene naming conventions. This process requires flexible planning and proficient use of bioinformatics tools to ensure accuracy and consistency.
Trait data extraction
The clinical information of samples is recorded in certain rows or columns of the raw data table with indefinite attribute names specific to each dataset. In this step, we identify the attributes containing the trait or condition information of interest, design conversion rules, and write functions to encode the attributes into binary, ordinal, or categorical variables. Often this information is indirectly given, requiring us to infer it based on domain knowledge about acronyms or jargon related to the trait, combined with an understanding of the data measurement and collection process described in the metadata. Some examples of this step are shown in Appendix C.
Data linking
In this step, we merge the preprocessed gene data with the extracted trait data based on the sample IDs. This integration creates a data table containing both genetic and clinical features for the same samples, ready for association studies to identify significant genes.
The preprocessing also involves common operations such as missing value imputation and column matching, some of which are substeps of the main steps. Please refer to our guidelines file in Appendix B for more details. Fig. 2 shows the pipeline of preprocessing a series dataset from the GEO database.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x2.png)
3.1.2 Statistical analysis
After preprocessing, one can perform basic regression analysis to identify the genes that are predictive of the disease (or trait) [18, 66]. Lasso [53] is often chosen as the model due to its ability to identify a sparse set of genes. In addition to directly using regression model, some other steps are often taken.
Confounding factor correction
To ensure reliable identification of genes, the pipeline often involves steps to correct potential confounding factors [30, 8]. One type of confounding factor arises when the distribution of gene expressions varies across subgroups within the data due to different background distributions rather than the disease itself [73]. This variation can introduce significant bias, leading to incorrect conclusions where the association between certain genes and the disease might be mistakenly attributed to differences in gene expression distributions across groups, rather than a true link to the disease [57].
Incorporating conditions in regression
Additionally, one can include additional covariates in the regression model, such as patient demographics and co-occurrence of other diseases [29]. Including these conditions allows for identifying gene expression patterns that are not only associated with the disease status but also modulated by these conditions. This nuanced analysis supports the development of more personalized treatment strategies by identifying how different conditions affect gene-disease relationships [46]. This practice is encouraged due to the need for “precision medicine” [22, 9].
3.2 Benchmark creation
This subsection describes our process of building the benchmark, including the design of gene identification problems, downloading data from open gene expression databases, the collection of manual analysis data, and quality control and assessment.
Gene identification problem design
To ensure the scientific relevance of our benchmark, we began by curating a list of human traits that are either important to public health or interesting to genomics research. A computational biologist compiled this list, resulting in 82 traits spanning 9 main categories such as cardiovascular diseases and neurological disorders. This yields 82 problems in the form: What are the significant genes related to the trait? (hereafter referred to as "unconditional gene identification").
Next, each trait was paired with a condition, which could be another trait from the list or demographic attributes like age or gender, generating 6806 possible trait-condition pairs. We selected some of the more scientifically important pairs to frame problems in the form: What are the significant genes related to the trait when considering the influence of the condition? (hereafter referred to as "conditional gene identification").
To choose these pairs, we first applied manually designed criteria about which pairs must or must not be chosen based on the main categories or grouping of traits (Appendix D). For each undecided pair, we measured its trait-condition association by calculating the Jaccard similarity between the sets of genes related to the trait and the condition retrieved from the NCBI Gene database [7]. Pairs with a Jaccard similarity greater than 0.1 were chosen, as these pairs are more likely to share underlying genetic mechanisms, making them particularly valuable for understanding the complex interactions between traits and conditions in our gene identification analysis. This selection process resulted in 1064 pairs of significant scientific interest. Together with the 82 unconditional gene identification problems, this collectively forms the problem set of our benchmark.
Gene Identification Problems | |
---|---|
Total problems | 1146 |
Unconditional problems | 82 |
Conditional problems | 1064 |
Input Dataset | |
Total size | 32.22 GB |
Datasets | 795 |
Samples per dataset | 167121 |
Total samples | 132,673 |
Manual Analysis and Results | |
Relevant datasets | 181 |
Datasets successfully preprocessed | 163 |
Lines of code for analyzing per dataset | 9032 |
Total lines of code for analysis | 71,669 |
Normalized gene features per dataset | 14174 5851 |
Significant genes identified per problem | 4265 |
Input Dataset
To address the formulated research problems, we downloaded cohort datasets containing gene expression and corresponding clinical data from public databases: (1) The Gene Expression Omnibus (GEO) [11], the largest gene expression database currently available; and (2) The Cancer Genome Atlas (TCGA) [54], the largest gene expression database focused on cancer. The TCGA data were acquired via the UCSC Xena platform [19]. Additionally, domain knowledge regarding gene symbols associated with traits was sourced from the NCBI Gene database [7]. For more detailed information about these data sources, please refer to Appendix E.
Manual analysis
Four researchers in our team curated the problem list and extracted relevant input data from public sources. During the pilot stage, a computational biologist collaborated with a doctoral student to develop a guidelines file for the standardized pipeline and example code for solving problems related to two traits. This initial work was iteratively refined based on their experience with manual analysis on a subset of 200 problems.
In the subsequent phase of manual curation, nine bioinformaticians developed the gold standard for analyzing the input data for all problems in our benchmark. This involved writing code for data preprocessing and regression analysis, and compiling the analysis results. Equipped with detailed instructions in the guidelines file and example analysis code, these researchers crafted the gold standard over 12 weeks. The data for each trait were independently analyzed by two researchers, with an experienced researcher adjudicating the annotation by selecting the better analysis and making further refinements.
To evaluate the consistency of annotations, we measured the Inter-Annotator Agreement (IAA) between the two annotation versions. The results indicate high annotation quality, with an F1 score of 94.73% for the task of dataset filtering. We also used IAA as a baseline for human performance in gene data analysis, with additional results presented in Section 5.
3.3 Tasks and metrics
Dataset selection and filtering
We evaluate the performance of Dataset Filtering and Dataset Selection seperately. The former is a binary classification task, and we use F1 as the primary metric; For the latter, we use accuracy to measure the percentage of problems for which the method chooses the same dataset (or pairs of datasets) as the bioinformations did in our benchmark.
Preprocessing
Due to the complexity of gene expression data preprocessing, both the attributes and samples of the resulting preprocessed data depend largely on the decisions made during this process. To evaluate the performance of different methods, we adopted the following metrics: (i) Attribute Jaccard (AJ) is the Jaccard similarity between sets of attributes of two datasets. It evaluates how well the method extracts attributes from the dataset by encoding clinical features and normalizing gene symbols. (ii) Sample Jaccard (SJ) is the Jaccard similarity between sets of sample IDs of two datasets. It measures how well the method integrates features of the same samples and handles missing values. Based on these metrics, we define (iii) Composite Similarity Correlation (CSC) as the product of the Attribute Jaccard, Sample Jaccard, and the Pearson correlation of the common feature vectors (common rows and columns) between the datasets. This metric captures both the structural and content similarity of the resulting datasets, so we consider it as the primary metric for evaluation preprocessing alignment.
Statistical analysis
The goal of statistical analysis is to identify sigificant genes related to traits. To evaluate this process, we adopt multiple metrics such as precision, recall, and Jaccard index. The Jaccard index evaluates the similarity between the sets of genes identified by our method and the gold standard. We also consider gene identification as a binary classification problem of predicting whether a gene is related to the trait, and use Precision, Recall, and F1 to measure the performance.
4 Method
Recent studies have attempted to leverage LLM-based agents to tackle challenging problems [25, 72], including a range of data analysis tasks [33, 2]. While these methods each have their own novelties and strengths, our preliminary experiments reveal that none of them can generate functional code that runs data analysis on our benchmark. This is not surprising, considering the full complexity of the analysis required for solving real-world gene data analysis problem, a more tailored approach is probably needed. This section describes our method for exploring and setting up a baseline for this task.
4.1 Motivation and role design
When a human expert writes programs for gene expression data analysis, they exhibit the following abilities: (i) Context-aware planning. They complete a task step by step, planning the next action based on the overall goal and the results of previous steps; (ii) Tool utilization. They select and use library functions to assist with data preprocessing and statistical analysis; (iii) Domain knowledge inference. They observe the metadata of the dataset and intermediate processing results, using domain knowledge to infer the desired information from the data and use these observations to check whether their code works as expected; (iv) Error correction. they analyze the errors in program execution and correct them.
We believe that integrating these components is essential for enabling agent systems to effectively tackle the complex task of gene expression data analysis. Thus, to propose reasonable baselines for our benchmark, inspired by the workflow of human bioinformaticians in gene data analysis, we propose GenoAgents, a team of LLM-based agents, each playing different roles in a genomic data science team and contributing their own expertise to the problem. A Project Manager coordinates the analysis process for solving each gene identification problem, assigning tasks to agents with the standardized pipeline in our benchmark as instructions. Two programming agents, Data Engineer and Statistician, focus on the data preprocessing and statistical analysis tasks, respectivey. To enable context-aware planning, the agents maintain a task context recording the text instruction, code, and the execution output for each of the previous steps. Before taking a step, the agents observe the current task context, and then decides whether to perform or skip the next step, or revert to a previous step if necessary. If it chooses to write code to perform a step, it can read the source code of function tools in a library file and choose to use them as needed. A Code Reviewer agents help the programming agents debugging code and verifying that their code follows the instructions. A Domain Expert agent provides professional knowledge consultation to programming agents when required for data processing.
4.2 Collaboration among LLM agents
This subsection introduces the two main patterns of collaboration between agents.
Code review and iterative debugging
This process involves the interaction between the Code Reviewer and a programming agent (Statistician or Data Engineer). If the execution fails, the Code Reviewer evaluates the code based on its execution result, error-free status, and compliance with the given instructions. Then it makes a decision to either approve the code, or reject it with detailed feedback for revision and improvements, as shown in Figure 4 in Appendix. Based on the feedback, the agent iteratively refines the code, extending the context with new versions until approval or the maximum debugging rounds are reached. This mechanism facilitates troubleshooting and also improves adherence to task instructions.
Domain-guided programming
The second collaboration pattern involves a Data Engineer consulting a Domain Expert for data preprocessing tasks that require specialized knowledge. The Data Engineer sends questions to the Domain Expert, providing the necessary context such as metadata, summary information about a dataset, or other intermediate results in data processing. The Domain Expert then provides answers in the form of executable code. This type of programming also undergoes a debugging process, but the execution results are sent back to the same Domain Expert. Some questions are complex enough that the Domain Expert may not provide the correct answer immediately, necessitating further refinement based on the execution results.
5 Experiment
This section describes our experiments to evaluate GenoAgent and other baseline methods on the GenoTEX benchmark. We conducted an end-to-end evaluation where methods process raw input data to complete the full analysis for solving gene identification problems. Additionally, we assessed the performance of each task individually to gain a deeper understanding of their strengths and weaknesses. The tasks and metrics used are defined in Section 3.3. All experiments were conducted on a RunPod cluster [47] with two 16-core CPUs and 62 GB RAM. GenoAgent utilizes GPT-4o [39] models accessed via the OpenAI API.
5.1 Results
End-to-end performance
We evaluated the end-to-end data analysis capabilities of GenoAgent and baseline methods by measuring their performance in gene identification from raw input data. The results in Table 3 show that GenoAgent achieved an F1 score of 51.19%. While this is promising given the task difficulty, there is still a significant gap compared to human inter-annotator agreement scores, indicating substantial room for improvement. Ablation results demonstrated the importance of the collaborative approach involving the Code Reviewer and Domain Expert agents, as well as the number of review rounds. Additionally, we included a simple baseline where GPT-4o was directly asked to answer the significant genes in each problem, resulting in low performance (2.4% F1), which highlights the difficulty of this task. For completeness, we also reported the trait prediction accuracy of the agents’ models, reflecting the validity of the data and models they used.
Methods | DF (%) | DS (%) |
---|---|---|
GenoAgent (Ours) | 87.32 | 80.25 |
GenoAgent (Rounds=1) | 85.29 | 76.04 |
GenoAgent (No Reviewer) | 82.13 | 69.57 |
GenoAgent (No Domain Expert) | 84.28 | 78.63 |
Inter-Annotator Agreement | 94.73 | 90.26 |
Dataset filtering and selection
The performance of dataset filtering and selection is shown in Table 2. The agents show decent performance, likely because determining dataset relevance based on metadata often does not require complex inference. However, errors in this step can propagate to subsequent steps, impacting overall performance.
Dataset preprocessing
We evaluated the preprocessing performance of GenoAgent by comparing its output with that of human bioinformaticians in our benchmark. The results are presented in Table 4. GenoAgent generally performed well in preprocessing gene expression and merged data, achieving high CSC scores (80.63% for genes). However, preprocessing of trait data was significantly weaker, with a CSC score of 32.28%, due to the complexity of clinical data extraction and the need for nuanced knowledge inference.
Methods | Benchmark Performance | Trait Prediction | Efficiency | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Prec.(%) | Rec.(%) | F1(%) | Jac.(%) | Acc.(%) | Prec.(%) | Rec.(%) | F1(%) | Tk.(k) | Time(s) | |
GenoAgent (Ours) | 54.64 | 52.28 | 51.19 | 48.07 | 94.40 | 91.97 | 89.48 | 86.26 | 31.90 | 183.36 |
GenoAgent (Round=1) | 50.38 | 49.48 | 48.37 | 43.18 | 89.82 | 79.26 | 81.78 | 82.84 | 26.44 | 152.47 |
GenoAgent (No Reviewer) | 21.35 | 20.20 | 20.10 | 18.77 | 62.81 | 57.76 | 62.58 | 59.31 | 23.85 | 128.63 |
GenoAgent (No Domain Expert) | 47.94 | 43.80 | 41.33 | 37.19 | 27.82 | 24.68 | 26.59 | 24.79 | 29.23 | 158.37 |
Inter-Annotator Agreement | 75.58 | 70.64 | 69.66 | 68.64 | - | - | - | - | - | 10.74 |
GPT-4o zero-shot | 8.47 | 0.12 | 2.41 | 2.69 | - | - | - | - | 0.06 | 8.32 |
Methods | Merged Data | Gene Data | Trait Data | ||||||
---|---|---|---|---|---|---|---|---|---|
AJ(%) | SJ(%) | CSC(%) | AJ(%) | SJ(%) | CSC(%) | AJ(%) | SJ(%) | CSC(%) | |
GenoAgent (Ours) | 89.82 | 86.98 | 79.71 | 92.80 | 89.87 | 80.63 | 46.81 | 63.71 | 32.28 |
GenoAgent (Round=1) | 87.04 | 82.15 | 74.43 | 88.04 | 82.34 | 76.11 | 45.04 | 59.25 | 30.74 |
GenoAgent (No Reviewer) | 35.18 | 35.06 | 32.73 | 36.01 | 35.7 | 33.62 | 24.02 | 32.58 | 6.45 |
GenoAgent (No Domain Expert) | 78.54 | 75.93 | 70.01 | 80.79 | 76.38 | 69.67 | 25.14 | 23.48 | 4.68 |
Statistical analysis
For the statistical analysis task, we used datasets preprocessed by human bioinformaticians and instructed various baseline methods to perform statistical analysis following our standardized pipeline. The results are shown in Table 5. Unlike data preprocessing, this task primarily involves leveraging Python libraries for generic statistical modeling, allowing several LLMs or agent-based models to achieve decent performance.
Methods | Benchmark Performance(%) | Trait Prediction(%) | ||||||
---|---|---|---|---|---|---|---|---|
Prec. | Rec. | F1 | Jac. | Acc. | Prec. | Rec. | F1 | |
GenoAgent (Ours) | 68.18 | 62.84 | 67.08 | 68.67 | 57.7 | 57.73 | 58.67 | 57.42 |
MetaGPT [24] | 64.90 | 67.20 | 70.28 | 67.14 | 60.63 | 60.85 | 57.04 | 58.55 |
GPT-4o [39] | 61.61 | 62.75 | 60.48 | 63.85 | 55.39 | 50.72 | 52.50 | 50.42 |
Llama 3 (8B) [36] | 8.29 | 10.42 | 8.58 | 12.68 | 8.36 | 8.90 | 5.54 | 5.45 |
5.2 Discussions
While the results demonstrate the potential of LLM-based methods in gene analysis, they also highlight the limitations of current approaches.
Instability of the feedback mechanism
For complex tasks, the ideal scenario is for the agent to iteratively improve its code based on feedback to eventually reach the correct solution. However, the results in Table 3 indicate that while a single round of feedback significantly improves performance compared to no feedback, additional rounds provide marginal benefits, leaving a notable gap compared to human performance. By examining the agents’ operations (Appendix G), we found that feedback from the Code Reviewer agent often varied randomly and was sometimes incorrect. RLHF-tuned large models appear susceptible to being misled rather than adhering to initial insights. A promising direction is to design collaborative modes that encourage agents to discuss differing opinions iteratively to enhance their understanding of the task.
6 Conclusion
In this work, we introduced GenoTEX, a benchmark dataset designed to facilitate the automatic exploration of gene expression data for identifying disease-associated genes. GenoTEX encompasses a comprehensive analysis pipeline, reflecting the standards of computational genomics, and includes annotated code and results curated by expert bioinformaticians. By defining three core tasks—dataset selection, data preprocessing, and statistical analysis—we provide a robust framework for evaluating and developing automated methods. Furthermore, our proposed GenoAgent, a team of LLM-based agents, demonstrates the potential of integrating large language models into the field of genomics. Our experiments highlight both the strengths and limitations of these agents, underscoring the need for further research to address challenges in nuanced human judgment and data anomalies. GenoTEX is poised to be a useful resource in advancing AI-driven genomics data analysis, promoting efficiency, accuracy, and scalability in biomedical research.
Acknowledgments and Disclosure of Funding
This research was supported by the Accelerate Foundation Models Research (AFMR) initiative funded by Microsoft Research. We used the Microsoft Azure OpenAI service for our experiments and are thankful for the computation credits and technical support provided.
References
- Akhtar et al. [2024] M. Akhtar, O. Benjelloun, C. Conforti, P. Gijsbers, J. Giner-Miguelez, N. Jain, M. Kuchnik, Q. Lhoest, P. Marcenac, M. Maskey, P. Mattson, L. Oala, P. Ruyssen, R. Shinde, E. Simperl, G. Thomas, S. Tykhonov, J. Vanschoren, J. van der Velde, S. Vogler, and C.-J. Wu. Croissant: A metadata format for ml-ready datasets. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, SIGMOD/PODS ’24. ACM, June 2024. doi: 10.1145/3650203.3663326. URL http://dx.doi.org/10.1145/3650203.3663326.
- Arasteh et al. [2024] S. T. Arasteh, T. Han, M. Lotfinia, C. Kuhl, J. N. Kather, D. Truhn, and S. Nebelung. Large language models streamline automated machine learning for clinical studies. Nature Communications, 15(1603), 2024. doi: 10.1038/s41467-024-45879-8. URL https://www.nature.com/articles/s41467-024-45879-8.
- Bartley [2023] K. Bartley. Big data statistics: How much data is there in the world?, 2023. URL https://rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/.
- Besta et al. [2023] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv: 2308.09687, 2023.
- BPC [2023] R. BPC. Navigating the intersection of biostatistics, bioinformatics, and machine learning., 2023. URL https://medium.com/@RR-BPC/navigating-the-intersection-of-biostatistics-bioinformatics-and-machine-learning-d1b1337757b9.
- Bran et al. [2023] A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv: 2304.05376, 2023.
- Brown et al. [2015] G. R. Brown, V. Hem, K. S. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. D. Pruitt, and D. R. Maglott. Gene: a gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42, 2015. doi: 10.1093/nar/gku1055. URL https://doi.org/10.1093/nar/gku1055.
- Bruning et al. [2016] O. Bruning, W. Rodenburg, P. F. Wackers, C. Van Oostrom, M. J. Jonker, R. J. Dekker, H. Rauwerda, W. A. Ensink, A. De Vries, and T. M. Breit. Confounding factors in the transcriptome analysis of an in-vivo exposure experiment. PLoS One, 11(1):e0145252, 2016.
- Chan and Ginsburg [2011] I. S. Chan and G. S. Ginsburg. Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244, 2011.
- Chen et al. [2023] X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv: 2304.05128, 2023.
- Clough and Barrett [2016] E. Clough and T. Barrett. The gene expression omnibus database. Methods in Molecular Biology, 1418:93–110, 2016. doi: 10.1007/978-1-4939-3578-9_5.
- Commons [2013] C. Commons. Creative commons attribution 4.0 international public license, 2013. URL https://creativecommons.org/licenses/by/4.0/.
- Dong et al. [2023] Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv: 2304.07590, 2023.
- Du et al. [2023] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv: 2305.14325, 2023.
- Eldeeb et al. [2024] H. Eldeeb, M. Maher, R. Elshawi, and S. Sakr. Automlbench: A comprehensive experimental evaluation of automated machine learning frameworks. Expert Systems with Applications, 243:122877, 2024.
- Feng et al. [2023] G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. NEURIPS, 2023.
- Gebru et al. [2020] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2020.
- Ghosh and Chinnaiyan [2005] D. Ghosh and A. M. Chinnaiyan. Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147, 2005.
- Goldman et al. [2020] M. J. Goldman, B. Craft, M. Hastie, et al. Visualizing and interpreting cancer genomics data via the xena platform. Nature Biotechnology, 2020. doi: 10.1038/s41587-020-0546-8. URL https://doi.org/10.1038/s41587-020-0546-8.
- Gou et al. [2023] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
- Guo et al. [2023] T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N. V. Chawla, O. Wiest, and X. Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365, 2023.
- Hamburg and Collins [2010] M. A. Hamburg and F. S. Collins. The path to personalized medicine. New England Journal of Medicine, 363(4):301–304, 2010.
- Hao et al. [2023] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Wang, and Z. Hu. Reasoning with language model is planning with world model. Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.48550/arXiv.2305.14992.
- Hong et al. [2023] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv: 2308.00352, 2023.
- Huang et al. [2023] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
- Institute [2024] N. H. G. R. Institute. Genomic data science, 2024. URL https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science. Accessed: 2024-06-03.
- Intelligence [2023] M. Intelligence. Bioinformatics market size & share analysis - growth trends & forecasts source, 2023. URL https://www.mordorintelligence.com/industry-reports/global-bioinformatics-market-industry.
- Kellogg and Lanthaler [2020] G. Kellogg and M. Lanthaler. Json-ld 1.1: A json-based serialization for linked data. https://www.w3.org/TR/json-ld11/, July 2020. World Wide Web Consortium (W3C) Recommendation.
- Kyalwazi et al. [2023] B. Kyalwazi, C. Yau, M. J. Campbell, T. F. Yoshimatsu, A. J. Chien, A. M. Wallace, A. Forero-Torres, L. Pusztai, E. D. Ellis, K. S. Albain, et al. Race, gene expression signatures, and clinical outcomes of patients with high-risk early breast cancer. JAMA Network Open, 6(12):e2349646–e2349646, 2023.
- Leek et al. [2010] J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.
- Li et al. [2023] H. Li, Y. Q. Chong, S. Stepputtis, J. Campbell, D. Hughes, M. Lewis, and K. Sycara. Theory of mind for multi-agent collaboration via large language models. arXiv preprint arXiv:2310.10701, 2023.
- Liu et al. [2023] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv: 2304.11477, 2023.
- Ma et al. [2023] P. Ma, R. Ding, S. Wang, S. Han, and D. Zhang. Insightpilot: An llm-empowered automated data exploration system. arXiv preprint arXiv:2304.00477, 2023. URL https://arxiv.org/abs/2304.00477.
- Madaan et al. [2023] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- Madani et al. [2023] A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr, C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pages 1–8, 2023.
- Meta [2024] Meta. Lamma-3, 2024. URL https://ai.meta.com/blog/meta-llama-3/. The state-of-the-art open source large language model of Meta.
- Ning et al. [2023] X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
- OpenAI [2023] OpenAI. Gpt-4 technical report. PREPRINT, 2023.
- OpenAI [2024] OpenAI. Gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/. Latest Large language model of OpenAI.
- Park et al. [2023] J. Park, J. C. O’Brien, C. J. Cai, M. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. ACM Symposium on User Interface Software and Technology, 2023. doi: 10.1145/3586183.3606763.
- [41] I. Payscale. Bioinformatics hourly rate. URL https://www.payscale.com/research/US/Skill=Bioinformatics/Hourly_Rate. Accessed: 2024-06-20.
- Qian et al. [2023] C. Qian, X. Cong, W. Liu, C. Yang, W. Chen, Y. Su, Y. Dang, J. Li, J. Xu, D. Li, Z. Liu, and M. Sun. Communicative agents for software development. arXiv preprint arXiv: 2307.07924, 2023.
- Qin et al. [2023] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv: 2307.16789, 2023.
- Research and Markets [2024] Research and Markets. Next generation sequencing (ngs) data analysis - global strategic business report, 2024. URL https://www.researchandmarkets.com/reports/5303640/next-generation-sequencing-ngs-data-analysis. Accessed: 2024-06-03.
- Research [2024] D. B. M. Research. Global next generation sequencing data analysis market – industry trends and forecast to 2030, 2024. URL https://www.databridgemarketresearch.com/reports/global-next-generation-sequencing-data-analysis-market. Accessed: 2024-06-03.
- Rosenquist et al. [2023] R. Rosenquist, E. Bernard, T. Erkers, D. W. Scott, R. Itzykson, P. Rousselot, J. Soulier, M. Hutchings, P. Östling, L. Cavelier, et al. Novel precision medicine approaches and treatment strategies in hematological malignancies. Journal of Internal Medicine, 294(4):413–436, 2023.
- RunPod [2024] RunPod. Runpod: The cloud built for ai. https://www.runpod.io/, 2024. Accessed: 2024-06-06.
- Shapiro et al. [2023] D. Shapiro, W. Li, M. Delaflor, and C. Toxtli. Conceptual framework for autonomous cognitive entities. arXiv preprint arXiv: 2310.06775, 2023.
- Singhal et al. [2023] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Stühler et al. [2023] H. Stühler, M.-A. Zöller, D. Klau, A. Beiderwellen-Bedrikow, and C. Tutschku. Benchmarking automated machine learning methods for price forecasting applications. arXiv preprint arXiv:2304.14735, 2023.
- Sumers et al. [2023] T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents. arXiv preprint arXiv: 2309.02427, 2023.
- Talebirad and Nadiri [2023] Y. Talebirad and A. Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv: 2306.03314, 2023.
- Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
- Tomczak et al. [2015] K. Tomczak, P. Czerwińska, and M. Wiznerowicz. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology (Poznan), 19(1A):A68–77, 2015. doi: 10.5114/wo.2014.47136.
- Touvron et al. [2023a] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv: 2302.13971, 2023a.
- Touvron et al. [2023b] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023b.
- Wang et al. [2022a] H. Wang, B. Aragam, and E. P. Xing. Trade-offs of linear mixed models in genome-wide association studies. Journal of Computational Biology, 29(3):233–242, 2022a.
- Wang et al. [2023a] K. Wang, Y. Lu, M. Santacroce, Y. Gong, C. Zhang, and Y. Shen. Adapting llm agents through communication. arXiv preprint arXiv: 2310.01444, 2023a.
- Wang et al. [2023b] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023b.
- Wang et al. [2022b] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
- Wang et al. [2024] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji. Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030, 2024.
- Wang et al. [2023c] Y. Wang, Z. Jiang, Z. Chen, F. Yang, Y. Zhou, E. Cho, X. Fan, X. Huang, Y. Lu, and Y. Yang. Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296, 2023c.
- Wang et al. [2023d] Z. Wang, S. Mao, W. Wu, T. Ge, F. Wei, and H. Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint arXiv:2307.05300, 2023d.
- Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Woodie [2020] A. Woodie. Data prep still dominates data scientists’ time, survey finds, 2020. URL https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/.
- Wu et al. [2009] T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 2009.
- Xi et al. [2023] Z. Xi, S. Jin, Y. Zhou, R. Zheng, S. Gao, T. Gui, Q. Zhang, and X. Huang. Self-polish: Enhance reasoning in large language models via problem refinement. arXiv preprint arXiv:2305.14497, 2023.
- Xu et al. [2023] Z. Xu, S. Shi, B. Hu, J. Yu, D. Li, M. Zhang, and Y. Wu. Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv: 2311.08152, 2023.
- Yang et al. [2023a] H. Yang, S. Yue, and Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv: 2306.02224, 2023a.
- Yang et al. [2023b] R. Yang, T. F. Tan, W. Lu, A. J. Thirunavukarasu, D. S. W. Ting, and N. Liu. Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4):255–263, 2023b.
- Yao et al. [2023] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Yin et al. [2023] Z. Yin, Q. Sun, C. Chang, Q. Guo, J. Dai, X.-J. Huang, and X. Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153, 2023.
- Yu et al. [2006] J. Yu, G. Pressoir, W. H. Briggs, I. Vroh Bi, M. Yamasaki, J. F. Doebley, M. D. McMullen, B. S. Gaut, D. M. Nielsen, J. B. Holland, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics, 38(2):203–208, 2006.
- Zhao et al. [2023] R. Zhao, X. Li, S. R. Joty, C. Qin, and L. Bing. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. Annual Meeting of the Association for Computational Linguistics, 2023. doi: 10.48550/arXiv.2305.03268.
- Zheng et al. [2023] L. Zheng, R. Wang, and B. An. Synapse: Leveraging few-shot exemplars for human-level computer control. arXiv preprint arXiv:2306.07863, 2023.
- Zhou et al. [2023] P. Zhou, A. Madaan, S. P. Potharaju, A. Gupta, K. R. McKee, A. Holtzman, J. Pujara, X. Ren, S. Mishra, A. Nematzadeh, S. Upadhyay, and M. Faruqui. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv: 2310.03051, 2023.
Supplementary Material
The supplementary material is organized as follows:
-
•
Appendix A describes the dataset accessibility, documentation, and maintenance of our benchmark.
-
•
Appendix B introduces the guidelines file used to standardize the manual curation of our benchmark.
-
•
Appendix C provides examples of manual analysis on trait data extraction.
-
•
Appendix D outlines the criteria for forming trait-condition pairs for gene identification problems in our benchmark.
-
•
Appendix E describes our data acquisition process.
-
•
Appendix F presents our preliminary experiments highlighting the challenges faced by existing LLMs and agent-based methods on our benchmark.
-
•
Appendix G discusses the limitations of GenoAgent on our benchmark.
Appendix A Dataset accessibility, documentation, and maintenance
A.1 Documentation and intended uses
GenoTEX is documented following the Datasheets for Datasets [17] framework, providing a comprehensive description of the data collection process, preprocessing steps, and statistical analysis methods employed. The detailed datasheet is available here. Each step of the pipeline aims to mirror the standards of computational genomics, ensuring that the dataset is both accurate and reliable. The intended uses of GenoTEX are broad, encompassing the evaluation and development of AI-driven methods for genomics data analysis. By providing a standardized benchmark, GenoTEX aims to facilitate the advancement of machine learning models capable of automating the complex task of gene expression analysis. Researchers in bioinformatics and related fields can leverage this dataset to benchmark their algorithms, fostering innovation and improving the scalability of gene identification processes.
A.2 Open access and maintenance
To ensure the accessibility and usability of GenoTEX, we have made the dataset publicly available here. The dataset is hosted on GitHub, ensuring long-term availability and ease of access. The metadata associated with the dataset is documented here using the Croissant Metadata Record [1], providing a structured and detailed overview of the dataset’s features and attributes. We have structured the metadata according to the JSON-LD standard [28] to enhance interoperability and organization. We will maintain the dataset with regular updates and ongoing support to address any issues or improvements that may arise.
A.3 Licensing and responsibility
The GenoTEX dataset is released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license [12], which allows for broad usage while protecting the rights of the creators. The authors bear full responsibility for ensuring that the dataset adheres to this license and for any potential violations of rights. This responsibility includes ensuring that all data included in GenoTEX is ethically sourced and legally compliant. Throughout the curation process, we engaged in extensive discussions and consultations to address ethical considerations and legal requirements. Despite our best efforts, we remain aware that ethical landscapes can be complex and evolving, and we continually ask ourselves whether we are meeting the highest standards. This involved careful examination of each dataset to ensure the absence of personally identifiable information and compliance with all relevant standards. Our approach aims to ensure that GenoTEX meets the ethical and legal standards expected in the field of machine learning and computational genomics research.
A.4 Data format and persistent identifiers
GenoTEX is provided in open and widely used data formats, including CSV and JSON, ensuring compatibility with a wide range of analytical tools and platforms. Detailed instructions on how to read and use the dataset are included in the documentation, making it accessible to both novice and experienced researchers. To enhance the dataset’s stability and ease of reference, we have minted a DOI for GenoTEX. This DOI will serve as a reliable means of access and citation for the dataset, promoting its use in academic and professional research.
A.5 Reproducibility
To support reproducibility, we have included all the necessary datasets, code, and evaluation procedures in our documentation. We have worked diligently to ensure that others can replicate the results of our analyses. By our commitment to transparency and reproducibility, we hope to facilitate the wider adoption and validation of AI-driven methods in genomics research.
Appendix B Guidelines for gene expression data analysis
To tackle the complexities of gene expression data analysis, we have established a set of comprehensive guidelines shown below. These guidelines try to replicate the detailed processes of a skilled bioinformatician, covering dataset preprocessing, selection, and statistical analysis. By following these standardized procedures, we seek to improve consistency and reliability in our manual benchmark curation.
Appendix C Examples of manual analysis
In addition to the guidelines file, we provide example files to the participants of our benchmark curation. These examples include code and results for analyzing gene identification problems related to traits such as Breast Cancer and Epilepsy. These illustrations have proven helpful in familiarizing participants with these tasks quickly. Among the many steps in the analysis pipeline, a key step is the trait data extraction during the preprocessing of GEO data. This step requires biomedical knowledge and an understanding of the dataset collection process described in the metadata. In this section, we will introduce the part of the manual analysis examples related to this crucial step.
C.1 Problem statement
Our goal was to extract clinical traits from GEO datasets. For each trait of interest, we aimed to determine its availability and develop encoding rules to automate the extraction process. Below are two examples focusing on Breast Cancer and Epilepsy, respectively.
C.2 Breast Cancer example
C.2.1 Input data
C.2.2 Inference process
The dataset summary indicated that tissue samples from primary breast cancer (BC) and lung adenocarcinoma (LUAD), along with their matched-paired brain metastases, were included. By examining the sample characteristics dictionary, combined with domain knowledge, we identified subtypes such as ’TNBC’, ’ER+’, ’PR+’, and ’HER2+’ associated with breast cancer, and ’adenocarcinoma’ associated with lung cancer. Based on this, we developed a rule: tissues labeled with ’TNBC’, ’ER+’, ’PR+’, or ’HER2+’ are coded as having breast cancer (1), while ’adenocarcinoma’ is coded as not having breast cancer (0).
C.3 Epilepsy example
C.3.1 Input data
C.3.2 Inference process
The dataset summary indicated that brain tissues from patients with temporal lobe epilepsy with hippocampal sclerosis (TLE+HS) and control samples were included. By examining the sample characteristics dictionary, we identified tissue types such as ’Hippocampus’, ’Temporal lobe’, and ’Parietal lobe’. We inferred that ’Hippocampus’ and ’Temporal lobe’ tissues are associated with TLE+HS (epilepsy), while ’Parietal lobe’ tissues are from control samples. Based on this, we developed a rule: tissues labeled with ’Hippocampus’ or ’Temporal lobe’ are coded as having epilepsy (1), while ’Parietal lobe’ is coded as control (0).
C.4 Validation and conclusion
By executing the provided Python functions, we confirmed the accuracy of our trait extraction process. For instance, applying the convert_trait function for the epilepsy dataset, we verified the presence of exactly six samples with the positive Epilepsy trait, consistent with the metadata description. Similarly, for the breast cancer dataset, the function accurately identified 22 samples with the Breast Cancer trait. These examples highlight the dataset context understanding and domain knowledge inference required for the accurate preprocessing of gene expression data.
Appendix D Criteria for manual correction of trait-condition pairs
To ensure the scientific validity of our benchmark questions, we apply specific rules for including and excluding certain trait-condition pairs. Each biomedical entity in our list can be considered a trait and paired with a condition, where the condition is either another entity from the list or a demographic attribute like "age" or "gender." The following criteria are designed to maintain scientific relevance and robustness:
-
•
Trait-Condition Role Assignment: Entities such as language abilities, Vitamin D levels, and bone density are included only as conditions and not as traits. This distinction ensures that the primary focus remains on traits with more direct clinical implications, while these entities serve as influential factors that could affect those traits.
-
•
Universal Conditions: Entities such as obesity, hypertension, and mental disorders like anxiety disorder and bipolar disorder are designated as conditions to be paired with all other traits. This is because these conditions are widespread and significantly impact various health outcomes, making them critical factors to consider in any genetic analysis.
-
•
Gender-Specific Considerations: Gender-specific entities such as prostate cancer, endometriosis, and breast cancer are not conditioned on gender. Furthermore, entities from different genders are not paired. This approach respects the biological distinctions between genders and ensures that the resulting questions remain relevant and meaningful.
-
•
Cancer Category Exclusion: Pairs where both the trait and the condition belong to the cancer category are excluded. This is because investigating genetic factors behind one type of cancer conditioned on another type of cancer is often less scientifically important. The focus is placed on broader, more impactful genetic relationships that offer greater insight into cancer biology.
These criteria are used in combination with the Jaccard similarity of related genes (Section 3.2), to uphold the scientific integrity and relevance of the benchmark questions, facilitating meaningful and insightful gene expression analysis.
Appendix E Details about the data sources
GEO
The Gene Expression Omnibus (GEO) [11] is a public archive for high-throughput gene expression data and various other types of genomic data. We leveraged the Entrez programming utility to perform a systematic search of the GEO database for human series data relevant to each trait on our list, prioritizing datasets with large sample sizes. We downloaded both SOFT and matrix files for each series and used heuristic evaluations of file sizes to pinpoint datasets likely containing gene expression data. When automated searches failed to yield results for specific traits, we conducted manual searches using expanded synonyms from Medical Subject Headings (MeSH) terms.
TCGA-Xena
The Cancer Genome Atlas (TCGA) [54], accessed through the Xena platform [19], offers a rich repository of RNAseq gene expression and clinical data for numerous cancer types. We obtained data for 36 traits from the TCGA cohort using the UCSC Xena platform, which provides high-quality, cancer-related gene expression and clinical data linked by patient IDs.
NCBI Gene
The NCBI Gene database [7] is an important resource for comprehensive information on gene sequences, functions, and their links to diseases and conditions. For each trait, we queried the database to compile sets of gene symbols associated with the trait. This data was crucial for identifying disease-disease associations for question generation and for selecting common regressors in two-step regression analyses.
Appendix F Challenges faced by existing methods on our benchmark
Gene expression data analysis is a complex and specialized task. Despite their problem-solving abilities, state-of-the-art LLMs and agent-based methods struggle with gene expression data. Our evaluations of methods such as GPT-4o [39], MetaGPT [24], and CodeAct [61] revealed consistent failures across various settings.
We tested these methods under three different settings: (i) providing general task instructions, (ii) providing detailed task instructions used by GenoAgent, and (iii) providing detailed task instructions and all necessary library functions as in GenoAgent. Each setting was tested on a subset of 50 gene identification problems. Our results show that none of the methods generated runnable code for preprocessing datasets downloaded from GEO. Persistent errors in the generated code prevented testable outputs, regardless of the level of detail provided.
First, we find that when preprocessing GEO data, these methods often fail at dataset loading in the initial steps. The gene expression data files follow special formats. The agent struggles to extract tabular data embedded in the text file by identifying special markers, skipping metadata rows, and setting other parameters correctly, resulting in data reading failures.
We manually corrected the data loading code for the baseline methods and continued with the tasks. However, they were still unable to conduct the inference required to extract clinical features. This step is inherently difficult and often requires at least one round of debugging by the Domain Expert agent in our GenoAgent method to achieve a higher success rate.
The challenges faced by methods like MetaGPT and CodeAct in processing gene expression data primarily stem from their difficulty in handling specialized data formats and the absence of flexible feedback mechanisms. MetaGPT, primarily designed for software engineering tasks, operates with an independent execution model and limited context-awareness, which can impede dynamic adaptation during task execution and lead to errors when dealing with the nuanced formats of gene expression datasets. CodeAct, while effective at generating executable code through structured prompts, lacks the context-aware planning and iterative refinement necessary for the intricate steps involved in gene expression data preprocessing. Its static approach does not easily accommodate the dynamic adjustments required for diverse and complex gene expression data, leading to errors during initial data loading and clinical feature extraction.
In contrast, GenoAgent employs a team of specialized agents that maintain a comprehensive task context and leverage expert consultation, allowing for context-aware planning and iterative correction. This enables GenoAgent to handle the complexities of genomics data analysis more effectively, improving its reliability in data preprocessing.
Appendix G Discussion on the limitations of GenoAgent
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x3.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x4.png)
This section discusses the observed limitations of our baseline method, GenoAgent, on the GenoTEX benchmark. We identified that certain steps are inherently challenging, and instability in the feedback mechanism may hinder the agents’ iterative improvement process. Figures 4 and 4 illustrate the two types of feedback mechanisms in GenoAgent.
G.1 Error example in preprocessing
The results in Table 4 of the main paper indicate that the preprocessing performance of GenoAgent is primarily constrained by the clinical feature extraction step, which shows a CSC of only 32.28%. This step is conducted through Domain-Guided Programming (Section 4.2), where the Domain Expert iteratively improves its output based on feedback from the execution environment. Although one round of self-review significantly enhances performance, increasing the maximum review rounds from 1 to 2 yields only marginal benefits. Detailed examination of the agent system’s operation log at this step across different experimental runs reveals that the Domain Expert’s answers to the same question can vary randomly. During multiple rounds of self-review, it often provides feedback that contradicts previous suggestions, making it difficult to achieve consistent task performance.
For example, consider the following function used to encode the Breast Cancer trait:
In one run, the code review provided the following feedback:
However, in another run with the identical setting, the code review provided different feedback:
G.2 Error example in statistical analysis
Analysis of failure cases during the statistical analysis task reveals various low-frequency random failures, including errors in extracting data matrices from dataframes and incorrect parameter passing to the regression model. Although no single bottleneck was identified, the cumulative error risk significantly impacts performance, resulting in a suboptimal F1 score of 67.08%. This task involves collaboration between the Statistician and Code Reviewer (Section 4.2). Similar to preprocessing, we observed unstable and inconsistent feedback from the Code Reviewer.
The following is an example of erroneous code generated by the Statistician agent:
Discussion
The randomness observed may stem from the LLM itself, suggesting a need to prevent one agent from misleading another. During the development of our baseline methods, we implemented several prompt engineering techniques to mitigate this issue: (i) Limiting the Reviewer’s feedback to three main suggestions to focus on problem-solving rather than providing numerous distracting comments about code quality, and (ii) Encouraging the agent receiving the review to critically evaluate the feedback and possibly retain its original code. While these measures have alleviated some issues, they persist to some extent in our GenoAgent baseline. A promising future direction involves designing collaborative modes that foster iterative discussions among agents to reconcile differing opinions and enhance their task performance abilities.
We hope this discussion highlights the challenges of our benchmark tasks and encourages future work to address these issues.