-
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Authors:
Jingxuan Wei,
Nan Xu,
Guiyong Chang,
Yin Luo,
BiHui Yu,
Ruifeng Guo
Abstract:
In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scen…
▽ More
In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection
Authors:
Gyusam Chang,
Wonseok Roh,
Sujin Jang,
Dongwook Lee,
Daehyun Ji,
Gyeongrok Oh,
Jinsun Park,
Jinkyu Kim,
Sangpil Kim
Abstract:
Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image moda…
▽ More
Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.
△ Less
Submitted 6 March, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
A Survey on Image-text Multimodal Models
Authors:
Ruifeng Guo,
Jingxuan Wei,
Linzhuang Sun,
Bihui Yu,
Guiyong Chang,
Dawei Liu,
Sibo Zhang,
Zhengbing Yao,
Mingjun Xu,
Liping Bu
Abstract:
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-spec…
▽ More
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: \url{https://github.com/i2vec/A-survey-on-image-text-multimodal-models}.
△ Less
Submitted 18 June, 2024; v1 submitted 23 September, 2023;
originally announced September 2023.
-
Low-rank extended Kalman filtering for online learning of neural networks from streaming data
Authors:
Peter G. Chang,
Gerardo Durán-Martín,
Alexander Y Shestopaloff,
Matt Jones,
Kevin Murphy
Abstract:
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior precision matrix, which gives a cost per step which is linear in the number of model parameters. In…
▽ More
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior precision matrix, which gives a cost per step which is linear in the number of model parameters. In contrast to methods based on stochastic variational inference, our method is fully deterministic, and does not require step-size tuning. We show experimentally that this results in much faster (more sample efficient) learning, which results in more rapid adaptation to changing distributions, and faster accumulation of reward when used as part of a contextual bandit algorithm.
△ Less
Submitted 27 June, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
ORA3D: Overlap Region Aware Multi-view 3D Object Detection
Authors:
Wonseok Roh,
Gyusam Chang,
Seokha Moon,
Giljoo Nam,
Chanyoung Kim,
Younghyun Kim,
Jinkyu Kim,
Sangpil Kim
Abstract:
Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap region are often largely occluded or suffer from deformation due to camera distortion, causing a domain shift. To mitigate this issue, we propose using the…
▽ More
Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap region are often largely occluded or suffer from deformation due to camera distortion, causing a domain shift. To mitigate this issue, we propose using the following two main modules: (1) Stereo Disparity Estimation for Weak Depth Supervision and (2) Adversarial Overlap Region Discriminator. The former utilizes the traditional stereo disparity estimation method to obtain reliable disparity information from the overlap region. Given the disparity estimates as supervision, we propose regularizing the network to fully utilize the geometric potential of binocular images and improve the overall detection accuracy accordingly. Further, the latter module minimizes the representational gap between non-overlap and overlapping regions. We demonstrate the effectiveness of the proposed method with the nuScenes large-scale multi-view 3D object detection data. Our experiments show that our proposed method outperforms current state-of-the-art models, i.e., DETR3D and BEVDet.
△ Less
Submitted 29 June, 2023; v1 submitted 2 July, 2022;
originally announced July 2022.
-
Sustainable AI: Environmental Implications, Challenges and Opportunities
Authors:
Carole-Jean Wu,
Ramya Raghavendra,
Udit Gupta,
Bilge Acun,
Newsha Ardalani,
Kiwan Maeng,
Gloria Chang,
Fiona Aga Behram,
James Huang,
Charles Bai,
Michael Gschwind,
Anurag Gupta,
Myle Ott,
Anastasia Melnikov,
Salvatore Candido,
David Brooks,
Geeta Chauhan,
Benjamin Lee,
Hsien-Hsin S. Lee,
Bugra Akyildiz,
Maximilian Balandat,
Joe Spisak,
Ravi Jain,
Mike Rabbat,
Kim Hazelwood
Abstract:
This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, w…
▽ More
This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.
△ Less
Submitted 9 January, 2022; v1 submitted 30 October, 2021;
originally announced November 2021.
-
Intelligent Bandwidth Allocation for Latency Management in NG-EPON using Reinforcement Learning Methods
Authors:
Qi Zhou,
Jingjie Zhu,
Junwen Zhang,
Zhensheng Jia,
Bernardo Huberman,
Gee-Kung Chang
Abstract:
A novel intelligent bandwidth allocation scheme in NG-EPON using reinforcement learning is proposed and demonstrated for latency management. We verify the capability of the proposed scheme under both fixed and dynamic traffic loads scenarios to achieve <1ms average latency. The RL agent demonstrates an efficient intelligent mechanism to manage the latency, which provides a promising IBA solution f…
▽ More
A novel intelligent bandwidth allocation scheme in NG-EPON using reinforcement learning is proposed and demonstrated for latency management. We verify the capability of the proposed scheme under both fixed and dynamic traffic loads scenarios to achieve <1ms average latency. The RL agent demonstrates an efficient intelligent mechanism to manage the latency, which provides a promising IBA solution for the next-generation access network.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Ground Truth Simulation for Deep Learning Classification of Mid-Resolution Venus Images Via Unmixing of High-Resolution Hyperspectral Fenix Data
Authors:
Ido Faran,
Nathan S. Netanyahu,
Eli David,
Maxim Shoshany,
Fadi Kizel,
Jisung Geba Chang,
Ronit Rud
Abstract:
Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for traini…
▽ More
Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for training a deep network. Specifically, we simulate GT from high-resolution, hyperspectral FENIX images, and use it for training a convolutional neural network (CNN) for pixel-based classification. We show how the model can be transferred successfully to classify new mid-resolution VENuS imagery.
△ Less
Submitted 23 November, 2019;
originally announced November 2019.
-
Power Loading based on Portfolio Theory for Densified Millimeter-Wave Small-Cell Communications
Authors:
Shuyi Shen,
Bernardo A. Huberman,
Lin Cheng,
Gee-Kung Chang
Abstract:
We experimentally demonstrate a novel scheme of power loading based on portfolio theory for millimeter-wave small-cell densification. By exploiting the statistical characteristics of interference, this approach improves the average throughput by 91% and reduces the variance.
We experimentally demonstrate a novel scheme of power loading based on portfolio theory for millimeter-wave small-cell densification. By exploiting the statistical characteristics of interference, this approach improves the average throughput by 91% and reduces the variance.
△ Less
Submitted 31 January, 2019;
originally announced February 2019.
-
Design Considerations of a Sub-50 μW Receiver Front-end for Implantable Devices in MedRadio Band
Authors:
Gregory Chang,
Shovan Maity,
Baibhab Chatterjee,
Shreyas Sen
Abstract:
Emerging health-monitor applications, such as information transmission through multi-channel neural implants, image and video communication from inside the body etc., calls for ultra-low active power (<50$μ$W) high data-rate, energy-scalable, highly energy-efficient (pJ/bit) radios. Previous literature has strongly focused on low average power duty-cycled radios or low power but low-date radios. I…
▽ More
Emerging health-monitor applications, such as information transmission through multi-channel neural implants, image and video communication from inside the body etc., calls for ultra-low active power (<50$μ$W) high data-rate, energy-scalable, highly energy-efficient (pJ/bit) radios. Previous literature has strongly focused on low average power duty-cycled radios or low power but low-date radios. In this paper, we investigate power performance trade-off of each front-end component in a conventional radio including active matching, down-conversion and RF/IF amplification and prioritize them based on highest performance/energy metric. The analysis reveals 50$Ω$ active matching and RF gain is prohibitive for 50$μ$W power-budget. A mixer-first architecture with an N-path mixer and a self-biased inverter based baseband LNA, designed in TSMC 65nm technology show that sub 50$μ$W performance can be achieved up to 10Mbps (< 5pJ/b) with OOK modulation.
△ Less
Submitted 17 October, 2017;
originally announced October 2017.
-
Smart Wireless Communication is the Cornerstone of Smart Infrastructures
Authors:
Mary Ann Weitnauer,
Jennifer Rexford,
Nicholas Laneman,
Matthieu Bloch,
Santiago Griljava,
Catherine Ross,
Gee-Kung Chang
Abstract:
Emerging smart infrastructures, such as Smart City, Smart Grid, Smart Health, and Smart Transportation, need smart wireless connectivity. However, the requirements of these smart infrastructures cannot be met with today's wireless networks. A new wireless infrastructure is needed to meet unprecedented needs in terms of agility, reliability, security, scalability, and partnerships.
We are at the…
▽ More
Emerging smart infrastructures, such as Smart City, Smart Grid, Smart Health, and Smart Transportation, need smart wireless connectivity. However, the requirements of these smart infrastructures cannot be met with today's wireless networks. A new wireless infrastructure is needed to meet unprecedented needs in terms of agility, reliability, security, scalability, and partnerships.
We are at the beginning of a revolution in how we live with technology, resulting from a convergence of machine learning (ML), the Internet-of-Things (IoT), and robotics. A smart infrastructure monitors and processes a vast amount of data, collected from a dense and wide distribution of heterogeneous sensors (e.g., the IoT), as well as from web applications like social media. In real time, using machine learning, patterns and relationships in the data over space, time, and application can be detected and predictions can be made; on the basis of these, resources can be managed, decisions can be made, and devices can be actuated to optimize metrics, such as cost, health, safety, and convenience.
△ Less
Submitted 22 June, 2017;
originally announced June 2017.
-
Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks
Authors:
Cem M. Deniz,
Siyuan Xiang,
Spencer Hallyburton,
Arakua Welbeck,
James S. Babb,
Stephen Honig,
Kyunghyun Cho,
Gregory Chang
Abstract:
Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (…
▽ More
Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (CNNs). This study had institutional review board approval and written informed consent was obtained from all subjects. A dataset of volumetric structural MR images of the proximal femur from 86 subject were manually-segmented by an expert. We performed experiments by training two different CNN architectures with multiple number of initial feature maps and layers, and tested their segmentation performance against the gold standard of manual segmentations using four-fold cross-validation. Automatic segmentation of the proximal femur achieved a high dice similarity score of 0.94$\pm$0.05 with precision = 0.95$\pm$0.02, and recall = 0.94$\pm$0.08 using a CNN architecture based on 3D convolution exceeding the performance of 2D CNNs. The high segmentation accuracy provided by CNNs has the potential to help bring the use of structural MRI measurements of bone quality into clinical practice for management of osteoporosis.
△ Less
Submitted 5 February, 2019; v1 submitted 20 April, 2017;
originally announced April 2017.
-
Locally connected spanning trees on graphs
Authors:
Ching-Chi Lin,
Gerard J. Chang,
Gen-Huey Chen
Abstract:
A locally connected spanning tree of a graph $G$ is a spanning tree $T$ of $G$ such that the set of all neighbors of $v$ in $T$ induces a connected subgraph of $G$ for every $v\in V(G)$. The purpose of this paper is to give linear-time algorithms for finding locally connected spanning trees on strongly chordal graphs and proper circular-arc graphs, respectively.
A locally connected spanning tree of a graph $G$ is a spanning tree $T$ of $G$ such that the set of all neighbors of $v$ in $T$ induces a connected subgraph of $G$ for every $v\in V(G)$. The purpose of this paper is to give linear-time algorithms for finding locally connected spanning trees on strongly chordal graphs and proper circular-arc graphs, respectively.
△ Less
Submitted 8 September, 2004;
originally announced September 2004.
-
Diagnosabilities of regular networks
Authors:
Guey-Yun Chang,
Gerard J. Chang,
Gen-Huey Chen
Abstract:
In this paper, we study diagnosabilities of multiprocessor systems under two diagnosis models: the PMC model and the comparison model. In each model, we further consider two different diagnosis strategies: the precise diagnosis strategy proposed by Preparata et al. and the pessimistic diagnosis strategy proposed by Friedman. The main result of this paper is to determine diagnosabilities of regul…
▽ More
In this paper, we study diagnosabilities of multiprocessor systems under two diagnosis models: the PMC model and the comparison model. In each model, we further consider two different diagnosis strategies: the precise diagnosis strategy proposed by Preparata et al. and the pessimistic diagnosis strategy proposed by Friedman. The main result of this paper is to determine diagnosabilities of regular networks with certain conditions, which include several widely used multiprocessor systems such as variants of hypercubes and many others.
△ Less
Submitted 9 August, 2004;
originally announced August 2004.