subscribe to arXiv mailings

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

Authors: Jingxuan Wei, Nan Xu, Guiyong Chang, Yin Luo, BiHui Yu, Ruifeng Guo

Abstract: In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scen… ▽ More In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.03721 [pdf, other]

CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

Authors: Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh, Jinsun Park, Jinkyu Kim, Sangpil Kim

Abstract: Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image moda… ▽ More Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance. △ Less

Submitted 6 March, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

Comments: Accepted by AAAI 2024

arXiv:2309.15857 [pdf, other]

A Survey on Image-text Multimodal Models

Authors: Ruifeng Guo, Jingxuan Wei, Linzhuang Sun, Bihui Yu, Guiyong Chang, Dawei Liu, Sibo Zhang, Zhengbing Yao, Mingjun Xu, Liping Bu

Abstract: With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-spec… ▽ More With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: \url{https://github.com/i2vec/A-survey-on-image-text-multimodal-models}. △ Less

Submitted 18 June, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

arXiv:2305.19535 [pdf, other]

Low-rank extended Kalman filtering for online learning of neural networks from streaming data

Authors: Peter G. Chang, Gerardo Durán-Martín, Alexander Y Shestopaloff, Matt Jones, Kevin Murphy

Abstract: We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior precision matrix, which gives a cost per step which is linear in the number of model parameters. In… ▽ More We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior precision matrix, which gives a cost per step which is linear in the number of model parameters. In contrast to methods based on stochastic variational inference, our method is fully deterministic, and does not require step-size tuning. We show experimentally that this results in much faster (more sample efficient) learning, which results in more rapid adaptation to changing distributions, and faster accumulation of reward when used as part of a contextual bandit algorithm. △ Less

Submitted 27 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Journal ref: COLLAS conference 2023

arXiv:2207.00865 [pdf, other]

ORA3D: Overlap Region Aware Multi-view 3D Object Detection

Authors: Wonseok Roh, Gyusam Chang, Seokha Moon, Giljoo Nam, Chanyoung Kim, Younghyun Kim, Jinkyu Kim, Sangpil Kim

Abstract: Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap region are often largely occluded or suffer from deformation due to camera distortion, causing a domain shift. To mitigate this issue, we propose using the… ▽ More Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap region are often largely occluded or suffer from deformation due to camera distortion, causing a domain shift. To mitigate this issue, we propose using the following two main modules: (1) Stereo Disparity Estimation for Weak Depth Supervision and (2) Adversarial Overlap Region Discriminator. The former utilizes the traditional stereo disparity estimation method to obtain reliable disparity information from the overlap region. Given the disparity estimates as supervision, we propose regularizing the network to fully utilize the geometric potential of binocular images and improve the overall detection accuracy accordingly. Further, the latter module minimizes the representational gap between non-overlap and overlapping regions. We demonstrate the effectiveness of the proposed method with the nuScenes large-scale multi-view 3D object detection data. Our experiments show that our proposed method outperforms current state-of-the-art models, i.e., DETR3D and BEVDet. △ Less

Submitted 29 June, 2023; v1 submitted 2 July, 2022; originally announced July 2022.

Comments: BMVC2022

arXiv:2111.00364 [pdf, other]

Sustainable AI: Environmental Implications, Challenges and Opportunities

Authors: Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, Kim Hazelwood

Abstract: This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, w… ▽ More This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner. △ Less

Submitted 9 January, 2022; v1 submitted 30 October, 2021; originally announced November 2021.

arXiv:2001.07698 [pdf]

Intelligent Bandwidth Allocation for Latency Management in NG-EPON using Reinforcement Learning Methods

Authors: Qi Zhou, Jingjie Zhu, Junwen Zhang, Zhensheng Jia, Bernardo Huberman, Gee-Kung Chang

Abstract: A novel intelligent bandwidth allocation scheme in NG-EPON using reinforcement learning is proposed and demonstrated for latency management. We verify the capability of the proposed scheme under both fixed and dynamic traffic loads scenarios to achieve <1ms average latency. The RL agent demonstrates an efficient intelligent mechanism to manage the latency, which provides a promising IBA solution f… ▽ More A novel intelligent bandwidth allocation scheme in NG-EPON using reinforcement learning is proposed and demonstrated for latency management. We verify the capability of the proposed scheme under both fixed and dynamic traffic loads scenarios to achieve <1ms average latency. The RL agent demonstrates an efficient intelligent mechanism to manage the latency, which provides a promising IBA solution for the next-generation access network. △ Less

Submitted 21 January, 2020; originally announced January 2020.

arXiv:1911.10442 [pdf, other]

doi 10.1109/IGARSS.2019.8900186

Ground Truth Simulation for Deep Learning Classification of Mid-Resolution Venus Images Via Unmixing of High-Resolution Hyperspectral Fenix Data

Authors: Ido Faran, Nathan S. Netanyahu, Eli David, Maxim Shoshany, Fadi Kizel, Jisung Geba Chang, Ronit Rud

Abstract: Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for traini… ▽ More Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for training a deep network. Specifically, we simulate GT from high-resolution, hyperspectral FENIX images, and use it for training a convolutional neural network (CNN) for pixel-based classification. We show how the model can be transferred successfully to classify new mid-resolution VENuS imagery. △ Less

Submitted 23 November, 2019; originally announced November 2019.

Journal ref: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 807-810, Yokohama, Japan, July 2019

arXiv:1902.00046 [pdf]

Power Loading based on Portfolio Theory for Densified Millimeter-Wave Small-Cell Communications

Authors: Shuyi Shen, Bernardo A. Huberman, Lin Cheng, Gee-Kung Chang

Abstract: We experimentally demonstrate a novel scheme of power loading based on portfolio theory for millimeter-wave small-cell densification. By exploiting the statistical characteristics of interference, this approach improves the average throughput by 91% and reduces the variance. We experimentally demonstrate a novel scheme of power loading based on portfolio theory for millimeter-wave small-cell densification. By exploiting the statistical characteristics of interference, this approach improves the average throughput by 91% and reduces the variance. △ Less

Submitted 31 January, 2019; originally announced February 2019.

arXiv:1710.06541 [pdf]

Design Considerations of a Sub-50 μW Receiver Front-end for Implantable Devices in MedRadio Band

Authors: Gregory Chang, Shovan Maity, Baibhab Chatterjee, Shreyas Sen

Abstract: Emerging health-monitor applications, such as information transmission through multi-channel neural implants, image and video communication from inside the body etc., calls for ultra-low active power (<50$μ$W) high data-rate, energy-scalable, highly energy-efficient (pJ/bit) radios. Previous literature has strongly focused on low average power duty-cycled radios or low power but low-date radios. I… ▽ More Emerging health-monitor applications, such as information transmission through multi-channel neural implants, image and video communication from inside the body etc., calls for ultra-low active power (<50$μ$W) high data-rate, energy-scalable, highly energy-efficient (pJ/bit) radios. Previous literature has strongly focused on low average power duty-cycled radios or low power but low-date radios. In this paper, we investigate power performance trade-off of each front-end component in a conventional radio including active matching, down-conversion and RF/IF amplification and prioritize them based on highest performance/energy metric. The analysis reveals 50$Ω$ active matching and RF gain is prohibitive for 50$μ$W power-budget. A mixer-first architecture with an N-path mixer and a self-biased inverter based baseband LNA, designed in TSMC 65nm technology show that sub 50$μ$W performance can be achieved up to 10Mbps (< 5pJ/b) with OOK modulation. △ Less

Submitted 17 October, 2017; originally announced October 2017.

Comments: Accepted to appear on International Conference on VLSI Design 2018 (VLSID)

arXiv:1706.07363 [pdf]

Smart Wireless Communication is the Cornerstone of Smart Infrastructures

Authors: Mary Ann Weitnauer, Jennifer Rexford, Nicholas Laneman, Matthieu Bloch, Santiago Griljava, Catherine Ross, Gee-Kung Chang

Abstract: Emerging smart infrastructures, such as Smart City, Smart Grid, Smart Health, and Smart Transportation, need smart wireless connectivity. However, the requirements of these smart infrastructures cannot be met with today's wireless networks. A new wireless infrastructure is needed to meet unprecedented needs in terms of agility, reliability, security, scalability, and partnerships. We are at the… ▽ More Emerging smart infrastructures, such as Smart City, Smart Grid, Smart Health, and Smart Transportation, need smart wireless connectivity. However, the requirements of these smart infrastructures cannot be met with today's wireless networks. A new wireless infrastructure is needed to meet unprecedented needs in terms of agility, reliability, security, scalability, and partnerships. We are at the beginning of a revolution in how we live with technology, resulting from a convergence of machine learning (ML), the Internet-of-Things (IoT), and robotics. A smart infrastructure monitors and processes a vast amount of data, collected from a dense and wide distribution of heterogeneous sensors (e.g., the IoT), as well as from web applications like social media. In real time, using machine learning, patterns and relationships in the data over space, time, and application can be detected and predictions can be made; on the basis of these, resources can be managed, decisions can be made, and devices can be actuated to optimize metrics, such as cost, health, safety, and convenience. △ Less

Submitted 22 June, 2017; originally announced June 2017.

Comments: A Computing Community Consortium (CCC) white paper, 5 pages

arXiv:1704.06176 [pdf, other]

doi 10.1038/s41598-018-34817-6

Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks

Authors: Cem M. Deniz, Siyuan Xiang, Spencer Hallyburton, Arakua Welbeck, James S. Babb, Stephen Honig, Kyunghyun Cho, Gregory Chang

Abstract: Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (… ▽ More Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (CNNs). This study had institutional review board approval and written informed consent was obtained from all subjects. A dataset of volumetric structural MR images of the proximal femur from 86 subject were manually-segmented by an expert. We performed experiments by training two different CNN architectures with multiple number of initial feature maps and layers, and tested their segmentation performance against the gold standard of manual segmentations using four-fold cross-validation. Automatic segmentation of the proximal femur achieved a high dice similarity score of 0.94$\pm$0.05 with precision = 0.95$\pm$0.02, and recall = 0.94$\pm$0.08 using a CNN architecture based on 3D convolution exceeding the performance of 2D CNNs. The high segmentation accuracy provided by CNNs has the potential to help bring the use of structural MRI measurements of bone quality into clinical practice for management of osteoporosis. △ Less

Submitted 5 February, 2019; v1 submitted 20 April, 2017; originally announced April 2017.

Comments: This is a pre-print of an article published in Scientific Reports. The final authenticated version is available online at: https://doi.org/10.1038/s41598-018-34817-6

Journal ref: Scientific Reports, volume 8, Article number: 16485 (2018)

arXiv:cs/0409013 [pdf, ps, other]

Locally connected spanning trees on graphs

Authors: Ching-Chi Lin, Gerard J. Chang, Gen-Huey Chen

Abstract: A locally connected spanning tree of a graph $G$ is a spanning tree $T$ of $G$ such that the set of all neighbors of $v$ in $T$ induces a connected subgraph of $G$ for every $v\in V(G)$. The purpose of this paper is to give linear-time algorithms for finding locally connected spanning trees on strongly chordal graphs and proper circular-arc graphs, respectively. A locally connected spanning tree of a graph $G$ is a spanning tree $T$ of $G$ such that the set of all neighbors of $v$ in $T$ induces a connected subgraph of $G$ for every $v\in V(G)$. The purpose of this paper is to give linear-time algorithms for finding locally connected spanning trees on strongly chordal graphs and proper circular-arc graphs, respectively. △ Less

Submitted 8 September, 2004; originally announced September 2004.

Comments: 14 pages, 3 figures

ACM Class: F.2.2; G.2.2

arXiv:cs/0408022 [pdf, ps, other]

Diagnosabilities of regular networks

Authors: Guey-Yun Chang, Gerard J. Chang, Gen-Huey Chen

Abstract: In this paper, we study diagnosabilities of multiprocessor systems under two diagnosis models: the PMC model and the comparison model. In each model, we further consider two different diagnosis strategies: the precise diagnosis strategy proposed by Preparata et al. and the pessimistic diagnosis strategy proposed by Friedman. The main result of this paper is to determine diagnosabilities of regul… ▽ More In this paper, we study diagnosabilities of multiprocessor systems under two diagnosis models: the PMC model and the comparison model. In each model, we further consider two different diagnosis strategies: the precise diagnosis strategy proposed by Preparata et al. and the pessimistic diagnosis strategy proposed by Friedman. The main result of this paper is to determine diagnosabilities of regular networks with certain conditions, which include several widely used multiprocessor systems such as variants of hypercubes and many others. △ Less

Submitted 9 August, 2004; originally announced August 2004.

Comments: 26 pages

Report number: NCTS/TPE-Math Technical Report 2004-013

Showing 1–14 of 14 results for author: Chang, G