-
Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge network
Authors:
Mario Krenn,
Lorenzo Buffoni,
Bruno Coutinho,
Sagi Eppel,
Jacob Gates Foster,
Andrew Gritsevskiy,
Harlin Lee,
Yichao Lu,
Joao P. Moutinho,
Nima Sanjabi,
Rishi Sonthalia,
Ngoc Mai Tran,
Francisco Valente,
Yangxinyu Xie,
Rose Yu,
Michael Kopp
Abstract:
A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human re…
▽ More
A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human researchers to keep track of the progress. Here, we use AI techniques to predict the future research directions of AI itself. We develop a new graph-based benchmark based on real-world data -- the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 100,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. It indicates a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.
△ Less
Submitted 23 September, 2022;
originally announced October 2022.
-
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Authors:
Bernard Koch,
Emily Denton,
Alex Hanna,
Jacob G. Foster
Abstract:
Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine lear…
▽ More
Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
A Primer on Deep Learning for Causal Inference
Authors:
Bernard Koch,
Tim Sainburg,
Pablo Geraldo,
Song Jiang,
Yizhou Sun,
Jacob Gates Foster
Abstract:
This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize…
▽ More
This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize accessibility, we also introduce prerequisite concepts from causal inference and deep learning. The survey differs from other treatments of deep learning and causal inference in its sharp focus on observational causal estimation, its extended exposition of key algorithms, and its detailed tutorials for implementing, training, and selecting among deep estimators in Tensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference.
△ Less
Submitted 28 November, 2023; v1 submitted 8 October, 2021;
originally announced October 2021.
-
Integrating topic modeling and word embedding to characterize violent deaths
Authors:
Alina Arseniev-Koehler,
Susan D. Cochran,
Vickie M. Mays,
Kai-Wei Chang,
Jacob Gates Foster
Abstract:
There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a…
▽ More
There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
Adapting Coreference Resolution for Processing Violent Death Narratives
Authors:
Ankith Uppunda,
Susan D. Cochran,
Jacob G. Foster,
Alina Arseniev-Koehler,
Vickie M. Mays,
Kai-Wei Chang
Abstract:
Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper…
▽ More
Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper, we analyzed the challenges of coreference resolution in an exemplary form of administrative text written in English: violent death narratives from the USA's Centers for Disease Control's (CDC) National Violent Death Reporting System. We developed a set of data augmentation rules to improve model performance using a probabilistic data programming framework. Experiments on narratives from an administrative database, as well as existing gender-inclusive coreference datasets, demonstrate the effectiveness of data augmentation in training coreference models that can better handle text data about LGBT individuals.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat
Authors:
Alina Arseniev-Koehler,
Jacob G. Foster
Abstract:
As we navigate our cultural environment, we learn cultural biases, like those around gender, social class, health, and body weight. It is unclear, however, exactly how public culture becomes private culture. In this paper, we provide a theoretical account of such cultural learning. We propose that neural word embeddings provide a parsimonious and cognitively plausible model of the representations…
▽ More
As we navigate our cultural environment, we learn cultural biases, like those around gender, social class, health, and body weight. It is unclear, however, exactly how public culture becomes private culture. In this paper, we provide a theoretical account of such cultural learning. We propose that neural word embeddings provide a parsimonious and cognitively plausible model of the representations learned from natural language. Using neural word embeddings, we extract cultural schemata about body weight from New York Times articles. We identify several cultural schemata that link obesity to gender, immorality, poor health, and low socioeconomic class. Such schemata may be subtly but pervasively activated in public culture; thus, language can chronically reproduce biases. Our findings reinforce ongoing concerns that machine learning can also encode, and reproduce, harmful human biases.
△ Less
Submitted 13 June, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Tradition and Innovation in Scientists' Research Strategies
Authors:
Jacob G. Foster,
Andrey Rzhetsky,
James A. Evans
Abstract:
What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to r…
▽ More
What factors affect a scientist's choice of research problem? Qualitative research in the history, philosophy, and sociology of science suggests that this choice is shaped by an "essential tension" between the professional demand for productivity and a conflicting drive toward risky innovation. We examine this tension empirically in the context of biomedical chemistry. We use complex networks to represent the evolving state of scientific knowledge, as expressed in publications. We then define research strategies relative to these networks. Scientists can introduce novel chemicals or chemical relationships--or delve deeper into known ones. They can consolidate existing knowledge clusters, or bridge distant ones. Analyzing such choices in aggregate, we find that the distribution of strategies remains remarkably stable, even as chemical knowledge grows dramatically. High-risk strategies, which explore new chemical relationships, are less prevalent in the literature, reflecting a growing focus on established knowledge at the expense of new opportunities. Research following a risky strategy is more likely to be ignored but also more likely to achieve high impact and recognition. While the outcome of a risky strategy has a higher expected reward than the outcome of a conservative strategy, the additional reward is insufficient to compensate for the additional risk. By studying the winners of 137 different prizes in biomedicine and chemistry, we show that the occasional "gamble" for extraordinary impact is the most plausible explanation for observed levels of risk-taking. Our empirical demonstration and unpacking of the "essential tension" suggests policy interventions that may foster more innovative research.
△ Less
Submitted 27 February, 2013;
originally announced February 2013.
-
Clustering Drives Assortativity and Community Structure in Ensembles of Networks
Authors:
David V. Foster,
Jacob G. Foster,
Peter Grassberger,
Maya Paczuski
Abstract:
Clustering, assortativity, and communities are key features of complex networks. We probe dependencies between these attributes and find that ensembles with strong clustering display both high assortativity by degree and prominent community structure, while ensembles with high assortativity are much less biased towards clustering or community structure. Further, clustered networks can amplify smal…
▽ More
Clustering, assortativity, and communities are key features of complex networks. We probe dependencies between these attributes and find that ensembles with strong clustering display both high assortativity by degree and prominent community structure, while ensembles with high assortativity are much less biased towards clustering or community structure. Further, clustered networks can amplify small homophilic bias for trait assortativity. This marked asymmetry suggests that transitivity, rather than homophily, drives the standard nonsocial/social network dichotomy.
△ Less
Submitted 5 January, 2011; v1 submitted 10 December, 2010;
originally announced December 2010.
-
Edge direction and the structure of networks
Authors:
Jacob G. Foster,
David V. Foster,
Peter Grassberger,
Maya Paczuski
Abstract:
Directed networks are ubiquitous and are necessary to represent complex systems with asymmetric interactions---from food webs to the World Wide Web. Despite the importance of edge direction for detecting local and community structure, it has been disregarded in studying a basic type of global diversity in networks: the tendency of nodes with similar numbers of edges to connect. This tendency, call…
▽ More
Directed networks are ubiquitous and are necessary to represent complex systems with asymmetric interactions---from food webs to the World Wide Web. Despite the importance of edge direction for detecting local and community structure, it has been disregarded in studying a basic type of global diversity in networks: the tendency of nodes with similar numbers of edges to connect. This tendency, called assortativity, affects crucial structural and dynamic properties of real-world networks, such as error tolerance or epidemic spreading. Here we demonstrate that edge direction has profound effects on assortativity. We define a set of four directed assortativity measures and assign statistical significance by comparison to randomized networks. We apply these measures to three network classes---online/social networks, food webs, and word-adjacency networks. Our measures (i) reveal patterns common to each class, (ii) separate networks that have been previously classified together, and (iii) expose limitations of several existing theoretical models. We reject the standard classification of directed networks as purely assortative or disassortative. Many display a class-specific mixture, likely reflecting functional or historical constraints, contingencies, and forces guiding the system's evolution.
△ Less
Submitted 7 November, 2010; v1 submitted 28 August, 2009;
originally announced August 2009.