subscribe to arXiv mailings

doi 10.1038/s42256-023-00735-0

Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge network

Authors: Mario Krenn, Lorenzo Buffoni, Bruno Coutinho, Sagi Eppel, Jacob Gates Foster, Andrew Gritsevskiy, Harlin Lee, Yichao Lu, Joao P. Moutinho, Nima Sanjabi, Rishi Sonthalia, Ngoc Mai Tran, Francisco Valente, Yangxinyu Xie, Rose Yu, Michael Kopp

Abstract: A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human re… ▽ More A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human researchers to keep track of the progress. Here, we use AI techniques to predict the future research directions of AI itself. We develop a new graph-based benchmark based on real-world data -- the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 100,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. It indicates a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools. △ Less

Submitted 23 September, 2022; originally announced October 2022.

Comments: 13 pages, 7 figures. Comments welcome!

Journal ref: Nature Machine Intelligence 5, 1326 (2023)

arXiv:2112.01716 [pdf, other]

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Authors: Bernard Koch, Emily Denton, Alex Hanna, Jacob G. Foster

Abstract: Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine lear… ▽ More Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia

arXiv:2110.04442 [pdf, other]

A Primer on Deep Learning for Causal Inference

Authors: Bernard Koch, Tim Sainburg, Pablo Geraldo, Song Jiang, Yizhou Sun, Jacob Gates Foster

Abstract: This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize… ▽ More This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize accessibility, we also introduce prerequisite concepts from causal inference and deep learning. The survey differs from other treatments of deep learning and causal inference in its sharp focus on observational causal estimation, its extended exposition of key algorithms, and its detailed tutorials for implementing, training, and selecting among deep estimators in Tensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference. △ Less

Submitted 28 November, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: Forthcoming in Sociological Methods and Research

arXiv:2106.14365 [pdf, other]

doi 10.1073/pnas.2108801119

Integrating topic modeling and word embedding to characterize violent deaths

Authors: Alina Arseniev-Koehler, Susan D. Cochran, Vickie M. Mays, Kai-Wei Chang, Jacob Gates Foster

Abstract: There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a… ▽ More There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data. △ Less

Submitted 27 June, 2021; originally announced June 2021.

Journal ref: PNAS 2022

arXiv:2104.14703 [pdf, other]

Adapting Coreference Resolution for Processing Violent Death Narratives

Authors: Ankith Uppunda, Susan D. Cochran, Jacob G. Foster, Alina Arseniev-Koehler, Vickie M. Mays, Kai-Wei Chang

Abstract: Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper… ▽ More Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper, we analyzed the challenges of coreference resolution in an exemplary form of administrative text written in English: violent death narratives from the USA's Centers for Disease Control's (CDC) National Violent Death Reporting System. We developed a set of data augmentation rules to improve model performance using a probabilistic data programming framework. Experiments on narratives from an administrative database, as well as existing gender-inclusive coreference datasets, demonstrate the effectiveness of data augmentation in training coreference models that can better handle text data about LGBT individuals. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: NAACL 2021

arXiv:2003.12133 [pdf, other]

Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat

Authors: Alina Arseniev-Koehler, Jacob G. Foster

Abstract: As we navigate our cultural environment, we learn cultural biases, like those around gender, social class, health, and body weight. It is unclear, however, exactly how public culture becomes private culture. In this paper, we provide a theoretical account of such cultural learning. We propose that neural word embeddings provide a parsimonious and cognitively plausible model of the representations… ▽ More As we navigate our cultural environment, we learn cultural biases, like those around gender, social class, health, and body weight. It is unclear, however, exactly how public culture becomes private culture. In this paper, we provide a theoretical account of such cultural learning. We propose that neural word embeddings provide a parsimonious and cognitively plausible model of the representations learned from natural language. Using neural word embeddings, we extract cultural schemata about body weight from New York Times articles. We identify several cultural schemata that link obesity to gender, immorality, poor health, and low socioeconomic class. Such schemata may be subtly but pervasively activated in public culture; thus, language can chronically reproduce biases. Our findings reinforce ongoing concerns that machine learning can also encode, and reproduce, harmful human biases. △ Less

Submitted 13 June, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

arXiv:1302.6906 [pdf]

Tradition and Innovation in Scientists' Research Strategies

Authors: Jacob G. Foster, Andrey Rzhetsky, James A. Evans

Abstract: What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to r… ▽ More What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to represent the evolving state of scientific knowledge, as expressed in publications. We then define research strategies relative to these networks. Scientists can introduce novel chemicals or chemical relationships--or delve deeper into known ones. They can consolidate existing knowledge clusters, or bridge distant ones. Analyzing such choices in aggregate, we find that the distribution of strategies remains remarkably stable, even as chemical knowledge grows dramatically. High-risk strategies, which explore new chemical relationships, are less prevalent in the literature, reflecting a growing focus on established knowledge at the expense of new opportunities. Research following a risky strategy is more likely to be ignored but also more likely to achieve high impact and recognition. While the outcome of a risky strategy has a higher expected reward than the outcome of a conservative strategy, the additional reward is insufficient to compensate for the additional risk. By studying the winners of 137 different prizes in biomedicine and chemistry, we show that the occasional "gamble" for extraordinary impact is the most plausible explanation for observed levels of risk-taking. Our empirical demonstration and unpacking of the "essential tension" suggests policy interventions that may foster more innovative research. △ Less

Submitted 27 February, 2013; originally announced February 2013.

Comments: 18 pages, 4 figures

arXiv:1012.2384 [pdf, other]

doi 10.1103/PhysRevE.84.066117

Clustering Drives Assortativity and Community Structure in Ensembles of Networks

Authors: David V. Foster, Jacob G. Foster, Peter Grassberger, Maya Paczuski

Abstract: Clustering, assortativity, and communities are key features of complex networks. We probe dependencies between these attributes and find that ensembles with strong clustering display both high assortativity by degree and prominent community structure, while ensembles with high assortativity are much less biased towards clustering or community structure. Further, clustered networks can amplify smal… ▽ More Clustering, assortativity, and communities are key features of complex networks. We probe dependencies between these attributes and find that ensembles with strong clustering display both high assortativity by degree and prominent community structure, while ensembles with high assortativity are much less biased towards clustering or community structure. Further, clustered networks can amplify small homophilic bias for trait assortativity. This marked asymmetry suggests that transitivity, rather than homophily, drives the standard nonsocial/social network dichotomy. △ Less

Submitted 5 January, 2011; v1 submitted 10 December, 2010; originally announced December 2010.

Comments: 4 pages, 4 figures

arXiv:0908.4288 [pdf, other]

doi 10.1073/pnas.0912671107

Edge direction and the structure of networks

Authors: Jacob G. Foster, David V. Foster, Peter Grassberger, Maya Paczuski

Abstract: Directed networks are ubiquitous and are necessary to represent complex systems with asymmetric interactions---from food webs to the World Wide Web. Despite the importance of edge direction for detecting local and community structure, it has been disregarded in studying a basic type of global diversity in networks: the tendency of nodes with similar numbers of edges to connect. This tendency, call… ▽ More Directed networks are ubiquitous and are necessary to represent complex systems with asymmetric interactions---from food webs to the World Wide Web. Despite the importance of edge direction for detecting local and community structure, it has been disregarded in studying a basic type of global diversity in networks: the tendency of nodes with similar numbers of edges to connect. This tendency, called assortativity, affects crucial structural and dynamic properties of real-world networks, such as error tolerance or epidemic spreading. Here we demonstrate that edge direction has profound effects on assortativity. We define a set of four directed assortativity measures and assign statistical significance by comparison to randomized networks. We apply these measures to three network classes---online/social networks, food webs, and word-adjacency networks. Our measures (i) reveal patterns common to each class, (ii) separate networks that have been previously classified together, and (iii) expose limitations of several existing theoretical models. We reject the standard classification of directed networks as purely assortative or disassortative. Many display a class-specific mixture, likely reflecting functional or historical constraints, contingencies, and forces guiding the system's evolution. △ Less

Submitted 7 November, 2010; v1 submitted 28 August, 2009; originally announced August 2009.

Comments: 13 pages, 6 figures, 3 tables

Journal ref: Proceedings of the National Academy of Sciences of the United States of America 2010, Vol. 107, No. 24

Showing 1–9 of 9 results for author: Foster, J G