subscribe to arXiv mailings

Density Ratio Estimation via Sampling along Generalized Geodesics on Statistical Manifolds

Authors: Masanari Kimura, Howard Bondell

Abstract: The density ratio of two probability distributions is one of the fundamental tools in mathematical and computational statistics and machine learning, and it has a variety of known applications. Therefore, density ratio estimation from finite samples is a very important task, but it is known to be unstable when the distributions are distant from each other. One approach to address this problem is d… ▽ More The density ratio of two probability distributions is one of the fundamental tools in mathematical and computational statistics and machine learning, and it has a variety of known applications. Therefore, density ratio estimation from finite samples is a very important task, but it is known to be unstable when the distributions are distant from each other. One approach to address this problem is density ratio estimation using incremental mixtures of the two distributions. We geometrically reinterpret existing methods for density ratio estimation based on incremental mixtures. We show that these methods can be regarded as iterating on the Riemannian manifold along a particular curve between the two probability distributions. Making use of the geometry of the manifold, we propose to consider incremental density ratio estimation along generalized geodesics on this manifold. To achieve such a method requires Monte Carlo sampling along geodesics via transformations of the two distributions. We show how to implement an iterative algorithm to sample along these geodesics and show how changing the distances along the geodesic affect the variance and accuracy of the estimation of the density ratio. Our experiments demonstrate that the proposed approach outperforms the existing approaches using incremental mixtures that do not take the geometry of the △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.16773 [pdf]

Valuation methods for professional sports clubs: A historical review, a model development, and the application to Japanese football clubs

Authors: Masaaki Kimura, Zen Walsh, Takuo Inoue, Toshiya Takahashi, Hideki Koizumi

Abstract: In the trend towards the globalization of football and the increasing commercialization of professional football clubs, a methodology for calculating the firm value of clubs in non-western countries has yet to be established. This study reviews the valuation methods for the club firm values in Europe and North America and how values are calculated at the time of changing ownership of Japanese club… ▽ More In the trend towards the globalization of football and the increasing commercialization of professional football clubs, a methodology for calculating the firm value of clubs in non-western countries has yet to be established. This study reviews the valuation methods for the club firm values in Europe and North America and how values are calculated at the time of changing ownership of Japanese clubs and develops regression models with higher explanatory power than before to estimate the more accurate firm value of Japanese football clubs. A review of the existing literature on methods for calculating the firm value of professional sports clubs in Europe and North America, as well as financial statements and registers relating to changes of ownership of Japanese clubs, was conducted. After that, multiple regression analyses were conducted using the firm value of European clubs as the explained variable. From the literature review and the Japanese case studies, it has become clear that European clubs' standard valuation methods are based on revenue and other factors, while in Japan, valuation is based solely on the par value of stocks or net assets. Multiple regression analysis revealed that the firm value of European clubs over the past three years is best explained by revenue or player market value and the number of SNS followers. Two models with high explanatory power were developed. The estimated firm value using the revenue-based formula was higher than the one based on player market value. However, in the J.League, the former was more than three times higher than the latter, while the former was only 1.2 times higher for European clubs. The discrepancy relates to differences in European and J.League clubs' revenues and asset structures. In either formula, the firm value of J.League clubs exceeded the actual transaction price when the change of ownership occurred in the past. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2405.14522 [pdf, other]

Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

Authors: Yuya Yoshikawa, Masanari Kimura, Ryotaro Shimizu, Yuki Saito

Abstract: Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions… ▽ More Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model's decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.00442 [pdf, other]

Geometric Insights into Focal Loss: Reducing Curvature for Enhanced Model Calibration

Authors: Masanari Kimura, Hiroki Naganuma

Abstract: The key factor in implementing machine learning algorithms in decision-making situations is not only the accuracy of the model but also its confidence level. The confidence level of a model in a classification problem is often given by the output vector of a softmax function for convenience. However, these values are known to deviate significantly from the actual expected model confidence. This pr… ▽ More The key factor in implementing machine learning algorithms in decision-making situations is not only the accuracy of the model but also its confidence level. The confidence level of a model in a classification problem is often given by the output vector of a softmax function for convenience. However, these values are known to deviate significantly from the actual expected model confidence. This problem is called model calibration and has been studied extensively. One of the simplest techniques to tackle this task is focal loss, a generalization of cross-entropy by introducing one positive parameter. Although many related studies exist because of the simplicity of the idea and its formalization, the theoretical analysis of its behavior is still insufficient. In this study, our objective is to understand the behavior of focal loss by reinterpreting this function geometrically. Our analysis suggests that focal loss reduces the curvature of the loss surface in training the model. This indicates that curvature may be one of the essential factors in achieving model calibration. We design numerical experiments to support this conjecture to reveal the behavior of focal loss and the relationship between calibration performance and curvature. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: This paper is under consideration at Pattern Recognition Letters

arXiv:2403.17410 [pdf, other]

On permutation-invariant neural networks

Authors: Masanari Kimura, Ryotaro Shimizu, Yuki Hirakawa, Ryosuke Goto, Yuki Saito

Abstract: Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural netwo… ▽ More Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural network architectures such as Deep Sets and Transformers has presented a significant advancement in the treatment of set-based data. These architectures are specifically engineered to naturally accommodate sets as input, enabling more effective representation and processing of set structures. Consequently, there has been a surge of research endeavors dedicated to exploring and harnessing the capabilities of these architectures for various tasks involving the approximation of set functions. This comprehensive survey aims to provide an overview of the diverse problem settings and ongoing research efforts pertaining to neural networks that approximate set functions. By delving into the intricacies of these approaches and elucidating the associated challenges, the survey aims to equip readers with a comprehensive understanding of the field. Through this comprehensive perspective, we hope that researchers can gain valuable insights into the potential applications, inherent limitations, and future directions of set-based neural networks. Indeed, from this survey we gain two insights: i) Deep Sets and its variants can be generalized by differences in the aggregation function, and ii) the behavior of Deep Sets is sensitive to the choice of the aggregation function. From these observations, we show that Deep Sets, one of the well-known permutation-invariant neural networks, can be generalized in the sense of a quasi-arithmetic mean. △ Less

Submitted 28 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.10175 [pdf, other]

A Short Survey on Importance Weighting for Machine Learning

Authors: Masanari Kimura, Hideitsu Hino

Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the… ▽ More Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research. △ Less

Submitted 14 May, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2302.12991 [pdf, other]

Generalization Bounds for Set-to-Set Matching with Negative Sampling

Authors: Masanari Kimura

Abstract: The problem of matching two sets of multiple elements, namely set-to-set matching, has received a great deal of attention in recent years. In particular, it has been reported that good experimental results can be obtained by preparing a neural network as a matching function, especially in complex cases where, for example, each element of the set is an image. However, theoretical analysis of set-to… ▽ More The problem of matching two sets of multiple elements, namely set-to-set matching, has received a great deal of attention in recent years. In particular, it has been reported that good experimental results can be obtained by preparing a neural network as a matching function, especially in complex cases where, for example, each element of the set is an image. However, theoretical analysis of set-to-set matching with such black-box functions is lacking. This paper aims to perform a generalization error analysis in set-to-set matching to reveal the behavior of the model in that task. △ Less

Submitted 25 February, 2023; originally announced February 2023.

Comments: This paper is accepted at the International Conference on Neural Information Processing (ICONIP2022)

arXiv:2206.10936 [pdf, other]

Information Geometry of Dropout Training

Authors: Masanari Kimura, Hideitsu Hino

Abstract: Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their r… ▽ More Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their regularization performance depends on the amount of the curvature. Then, we showed that dropout essentially corresponds to a regularization that depends on the Fisher information, and support this result from numerical experiments. Such a theoretical analysis of the technique from a different perspective is expected to greatly assist in the understanding of neural networks, which are still in their infancy. △ Less

Submitted 22 June, 2022; originally announced June 2022.

arXiv:2103.17060 [pdf, other]

doi 10.3390/e23050528

$α$-Geodesical Skew Divergence

Authors: Masanari Kimura, Hideitsu Hino

Abstract: The asymmetric skew divergence smooths one of the distributions by mixing it, to a degree determined by the parameter $λ$, with the other distribution. Such divergence is an approximation of the KL divergence that does not require the target distribution to be absolutely continuous with respect to the source distribution. In this paper, an information geometric generalization of the skew divergenc… ▽ More The asymmetric skew divergence smooths one of the distributions by mixing it, to a degree determined by the parameter $λ$, with the other distribution. Such divergence is an approximation of the KL divergence that does not require the target distribution to be absolutely continuous with respect to the source distribution. In this paper, an information geometric generalization of the skew divergence called the $α$-geodesical skew divergence is proposed, and its properties are studied. △ Less

Submitted 25 April, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Journal ref: Entropy. 2021; 23(5):528

arXiv:2101.10229 [pdf, other]

Universal Approximation Properties for an ODENet and a ResNet: Mathematical Analysis and Numerical Experiments

Authors: Yuto Aizawa, Masato Kimura, Kazunori Matsui

Abstract: We prove a universal approximation property (UAP) for a class of ODENet and a class of ResNet, which are simplified mathematical models for deep learning systems with skip connections. The UAP can be stated as follows. Let $n$ and $m$ be the dimension of input and output data, and assume $m\leq n$. Then we show that ODENet of width $n+m$ with any non-polynomial continuous activation function can a… ▽ More We prove a universal approximation property (UAP) for a class of ODENet and a class of ResNet, which are simplified mathematical models for deep learning systems with skip connections. The UAP can be stated as follows. Let $n$ and $m$ be the dimension of input and output data, and assume $m\leq n$. Then we show that ODENet of width $n+m$ with any non-polynomial continuous activation function can approximate any continuous function on a compact subset on $\mathbb{R}^n$. We also show that ResNet has the same property as the depth tends to infinity. Furthermore, we derive the gradient of a loss function explicitly with respect to a certain tuning variable. We use this to construct a learning algorithm for ODENet. To demonstrate the usefulness of this algorithm, we apply it to a regression problem, a binary classification, and a multinomial classification in MNIST. △ Less

Submitted 17 May, 2023; v1 submitted 22 December, 2020; originally announced January 2021.

arXiv:2007.03899 [pdf, other]

Density Fixing: Simple yet Effective Regularization Method based on the Class Prior

Authors: Masanari Kimura, Ryohei Izawa

Abstract: Machine learning models suffer from overfitting, which is caused by a lack of labeled data. To tackle this problem, we proposed a framework of regularization methods, called density-fixing, that can be used commonly for supervised and semi-supervised learning. Our proposed regularization method improves the generalization performance by forcing the model to approximate the class's prior distributi… ▽ More Machine learning models suffer from overfitting, which is caused by a lack of labeled data. To tackle this problem, we proposed a framework of regularization methods, called density-fixing, that can be used commonly for supervised and semi-supervised learning. Our proposed regularization method improves the generalization performance by forcing the model to approximate the class's prior distribution or the frequency of occurrence. This regularization term is naturally derived from the formula of maximum likelihood estimation and is theoretically justified. We further provide the several theoretical analyses of the proposed method including asymptotic behavior. Our experimental results on multiple benchmark datasets are sufficient to support our argument, and we suggest that this simple and effective regularization method is useful in real-world machine learning problems. △ Less

Submitted 6 September, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.06231 [pdf, other]

Why Mixup Improves the Model Performance

Authors: Masanari Kimura

Abstract: Machine learning techniques are used in a wide range of domains. However, machine learning models often suffer from the problem of over-fitting. Many data augmentation methods have been proposed to tackle such a problem, and one of them is called mixup. Mixup is a recently proposed regularization procedure, which linearly interpolates a random pair of training examples. This regularization method… ▽ More Machine learning techniques are used in a wide range of domains. However, machine learning models often suffer from the problem of over-fitting. Many data augmentation methods have been proposed to tackle such a problem, and one of them is called mixup. Mixup is a recently proposed regularization procedure, which linearly interpolates a random pair of training examples. This regularization method works very well experimentally, but its theoretical guarantee is not adequately discussed. In this study, we aim to discover why mixup works well from the aspect of the statistical learning theory. △ Less

Submitted 17 June, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

arXiv:1912.02945 [pdf]

doi 10.1145/3388176.3388187

A pedestrian path-planning model in accordance with obstacle's danger with reinforcement learning

Authors: Thanh-Trung Trinh, Dinh-Minh Vu, Masaomi Kimura

Abstract: Most microscopic pedestrian navigation models use the concept of "forces" applied to the pedestrian agents to replicate the navigation environment. While the approach could provide believable results in regular situations, it does not always resemble natural pedestrian navigation behaviour in many typical settings. In our research, we proposed a novel approach using reinforcement learning for simu… ▽ More Most microscopic pedestrian navigation models use the concept of "forces" applied to the pedestrian agents to replicate the navigation environment. While the approach could provide believable results in regular situations, it does not always resemble natural pedestrian navigation behaviour in many typical settings. In our research, we proposed a novel approach using reinforcement learning for simulation of pedestrian agent path planning and collision avoidance problem. The primary focus of this approach is using human perception of the environment and danger awareness of interferences. The implementation of our model has shown that the path planned by the agent shares many similarities with a human pedestrian in several aspects such as following common walking conventions and human behaviours. △ Less

Submitted 5 December, 2019; originally announced December 2019.

arXiv:1909.07156 [pdf, other]

New Perspective of Interpretability of Deep Neural Networks

Authors: Masanari Kimura, Masayuki Tanaka

Abstract: Deep neural networks (DNNs) are known as black-box models. In other words, it is difficult to interpret the internal state of the model. Improving the interpretability of DNNs is one of the hot research topics. However, at present, the definition of interpretability for DNNs is vague, and the question of what is a highly explanatory model is still controversial. To address this issue, we provide t… ▽ More Deep neural networks (DNNs) are known as black-box models. In other words, it is difficult to interpret the internal state of the model. Improving the interpretability of DNNs is one of the hot research topics. However, at present, the definition of interpretability for DNNs is vague, and the question of what is a highly explanatory model is still controversial. To address this issue, we provide the definition of the human predictability of the model, as a part of the interpretability of the DNNs. The human predictability proposed in this paper is defined by easiness to predict the change of the inference when perturbating the model of the DNNs. In addition, we introduce one example of high human-predictable DNNs. We discuss that our definition will help to the research of the interpretability of the DNNs considering various types of applications. △ Less

Submitted 12 September, 2019; originally announced September 2019.

arXiv:1906.10822 [pdf, other]

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Authors: Kosuke Haruki, Taiji Suzuki, Yohei Hamakawa, Takeshi Toda, Ryuji Sakai, Masahiro Ozawa, Mitsuhiro Kimura

Abstract: Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima. To overcome this difficulty, we propose gradient noise convolu… ▽ More Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima. To overcome this difficulty, we propose gradient noise convolution (GNC), which effectively smooths sharper minima of the loss function. For DP-SGD, GNC utilizes so-called gradient noise, which is induced by stochastic gradient variation and convolved to the loss function as a smoothing effect. GNC computation can be performed by simply computing the stochastic gradient on each parallel worker and merging them, and is therefore extremely easy to implement. Due to convolving with the gradient noise, which tends to spread along a sharper direction of the loss function, GNC can effectively smooth sharp minima and achieve better generalization, whereas isotropic random noise cannot. We empirically show this effect by comparing GNC with isotropic random noise, and show that it achieves state-of-the-art generalization performance for large-scale deep neural network optimization. △ Less

Submitted 25 June, 2019; originally announced June 2019.

Comments: 19 pages, 11 figures, 7 tables

arXiv:1802.06368 [pdf, other]

Node Centralities and Classification Performance for Characterizing Node Embedding Algorithms

Authors: Kento Nozawa, Masanari Kimura, Atsunori Kanemura

Abstract: Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through system… ▽ More Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through systematic experiments with four node embedding algorithms, four or five graph centralities, and six datasets. Experimental results give insights into the properties of node embedding algorithms, which can be a basis for further research on this topic. △ Less

Submitted 18 February, 2018; originally announced February 2018.

Comments: Under review at ICLR 2018 workshop track

arXiv:1112.0611 [pdf, ps, other]

Information-Maximization Clustering based on Squared-Loss Mutual Information

Authors: Masashi Sugiyama, Makoto Yamada, Manabu Kimura, Hirotaka Hachiya

Abstract: Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing… ▽ More Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve non-convex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach. △ Less

Submitted 2 December, 2011; originally announced December 2011.

Showing 1–17 of 17 results for author: Kimura, M