subscribe to arXiv mailings

Hallucination Mitigation Prompts Long-term Video Understanding

Authors: Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao Xie

Abstract: Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hal… ▽ More Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2403.09195 [pdf, other]

SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration

Authors: Yanfei Song, Bangzheng Pu, Peng Wang, Hongxu Jiang, Dong Dong, Yongxiang Cao, Yiqing Shen

Abstract: Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not ade… ▽ More Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/. △ Less

Submitted 17 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2308.07717 [pdf, other]

Real-time Automatic M-mode Echocardiography Measurement with Panel Attention from Local-to-Global Pixels

Authors: Ching-Hsun Tseng, Shao-Ju Chien, Po-Shen Wang, Shin-Jye Lee, Wei-Huan Hu, Bin Pu, Xiao-jun Zeng

Abstract: Motion mode (M-mode) recording is an essential part of echocardiography to measure cardiac dimension and function. However, the current diagnosis cannot build an automatic scheme, as there are three fundamental obstructs: Firstly, there is no open dataset available to build the automation for ensuring constant results and bridging M-mode echocardiography with real-time instance segmentation (RIS);… ▽ More Motion mode (M-mode) recording is an essential part of echocardiography to measure cardiac dimension and function. However, the current diagnosis cannot build an automatic scheme, as there are three fundamental obstructs: Firstly, there is no open dataset available to build the automation for ensuring constant results and bridging M-mode echocardiography with real-time instance segmentation (RIS); Secondly, the examination is involving the time-consuming manual labelling upon M-mode echocardiograms; Thirdly, as objects in echocardiograms occupy a significant portion of pixels, the limited receptive field in existing backbones (e.g., ResNet) composed from multiple convolution layers are inefficient to cover the period of a valve movement. Existing non-local attentions (NL) compromise being unable real-time with a high computation overhead or losing information from a simplified version of the non-local block. Therefore, we proposed RAMEM, a real-time automatic M-mode echocardiography measurement scheme, contributes three aspects to answer the problems: 1) provide MEIS, a dataset of M-mode echocardiograms for instance segmentation, to enable consistent results and support the development of an automatic scheme; 2) propose panel attention, local-to-global efficient attention by pixel-unshuffling, embedding with updated UPANets V2 in a RIS scheme toward big object detection with global receptive field; 3) develop and implement AMEM, an efficient algorithm of automatic M-mode echocardiography measurement enabling fast and accurate automatic labelling among diagnosis. The experimental results show that RAMEM surpasses existing RIS backbones (with non-local attention) in PASCAL 2012 SBD and human performances in real-time MEIS tested. The code of MEIS and dataset are available at https://github.com/hanktseng131415go/RAME. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2303.09858 [pdf, other]

Preventing Unauthorized AI Over-Analysis by Medical Image Adversarial Watermarking

Authors: Xingxing Wei, Bangzheng Pu, Shiji Zhao, Chen Chi, Huazhu Fu

Abstract: The advancement of deep learning has facilitated the integration of Artificial Intelligence (AI) into clinical practices, particularly in computer-aided diagnosis. Given the pivotal role of medical images in various diagnostic procedures, it becomes imperative to ensure the responsible and secure utilization of AI techniques. However, the unauthorized utilization of AI for image analysis raises si… ▽ More The advancement of deep learning has facilitated the integration of Artificial Intelligence (AI) into clinical practices, particularly in computer-aided diagnosis. Given the pivotal role of medical images in various diagnostic procedures, it becomes imperative to ensure the responsible and secure utilization of AI techniques. However, the unauthorized utilization of AI for image analysis raises significant concerns regarding patient privacy and potential infringement on the proprietary rights of data custodians. Consequently, the development of pragmatic and cost-effective strategies that safeguard patient privacy and uphold medical image copyrights emerges as a critical necessity. In direct response to this pressing demand, we present a pioneering solution named Medical Image Adversarial watermarking (MIAD-MARK). Our approach introduces watermarks that strategically mislead unauthorized AI diagnostic models, inducing erroneous predictions without compromising the integrity of the visual content. Importantly, our method integrates an authorization protocol tailored for legitimate users, enabling the removal of the MIAD-MARK through encryption-generated keys. Through extensive experiments, we validate the efficacy of MIAD-MARK across three prominent medical image datasets. The empirical outcomes demonstrate the substantial impact of our approach, notably reducing the accuracy of standard AI diagnostic models to a mere 8.57% under white box conditions and 45.83% in the more challenging black box scenario. Additionally, our solution effectively mitigates unauthorized exploitation of medical images even in the presence of sophisticated watermark removal networks. Notably, those AI diagnosis networks exhibit a meager average accuracy of 38.59% when applied to images protected by MIAD-MARK, underscoring the robustness of our safeguarding mechanism. △ Less

Submitted 13 September, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

arXiv:2211.01671 [pdf, other]

Visually Adversarial Attacks and Defenses in the Physical World: A Survey

Authors: Xingxing Wei, Bangzheng Pu, Jiefan Lu, Baoyuan Wu

Abstract: Although Deep Neural Networks (DNNs) have been widely applied in various real-world scenarios, they are vulnerable to adversarial examples. The current adversarial attacks in computer vision can be divided into digital attacks and physical attacks according to their different attack forms. Compared with digital attacks, which generate perturbations in the digital pixels, physical attacks are more… ▽ More Although Deep Neural Networks (DNNs) have been widely applied in various real-world scenarios, they are vulnerable to adversarial examples. The current adversarial attacks in computer vision can be divided into digital attacks and physical attacks according to their different attack forms. Compared with digital attacks, which generate perturbations in the digital pixels, physical attacks are more practical in the real world. Owing to the serious security problem caused by physically adversarial examples, many works have been proposed to evaluate the physically adversarial robustness of DNNs in the past years. In this paper, we summarize a survey versus the current physically adversarial attacks and physically adversarial defenses in computer vision. To establish a taxonomy, we organize the current physical attacks from attack tasks, attack forms, and attack methods, respectively. Thus, readers can have a systematic knowledge of this topic from different aspects. For the physical defenses, we establish the taxonomy from pre-processing, in-processing, and post-processing for the DNN models to achieve full coverage of the adversarial defenses. Based on the above survey, we finally discuss the challenges of this research field and further outlook on the future direction. △ Less

Submitted 13 July, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

arXiv:2206.12340 [pdf]

doi 10.1007/s11356-023-27119-6

How to hide your voice: Noise-cancelling bird photography blind

Authors: Caner Baydur, Baojing Pu, Xiaoqing Xu

Abstract: Getting close to birds is a great challenge in wildlife photography. Bird photography blinds may be the most effective and least intrusive way if properly designed. However, the acoustic design of the blinds has been overlooked so far. Herein, we present noise-cancelling blinds which allow photographing birds at close range. Firstly, we conduct a questionnaire in the eco-tourism centre located in… ▽ More Getting close to birds is a great challenge in wildlife photography. Bird photography blinds may be the most effective and least intrusive way if properly designed. However, the acoustic design of the blinds has been overlooked so far. Herein, we present noise-cancelling blinds which allow photographing birds at close range. Firstly, we conduct a questionnaire in the eco-tourism centre located in Yunnan, China. Thus, we determine the birders' expectations of the indoor sound environment. We then identify diverse variables to examine the impact of architectural and acoustic decisions on noise propagation. Finally, we examine the acoustic performance of the blinds by considering the birds' hearing threshold. The numerical simulations are performed in the acoustics module of Comsol MultiPhysics. Our study demonstrated that photography blinds require a strong and thorough acoustic design for both human and bird well-being. △ Less

Submitted 27 November, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

Comments: 26 pages, 11 figures. Revised argument in sections 2 and 4, results unchanged, references added

Journal ref: Environmental Science and Pollution Research, 1-14; 2023

arXiv:2106.10693 [pdf, other]

Fast PDN Impedance Prediction Using Deep Learning

Authors: Ling Zhang, Jack Juang, Zurab Kiguradze, Bo Pu, Shuai Jin, Songping Wu, Zhiping Yang, Chulsoon Hwang

Abstract: Modeling and simulating a power distribution network (PDN) for printed circuit boards (PCBs) with irregular board shapes and multi-layer stackup is computationally inefficient using full-wave simulations. This paper presents a new concept of using deep learning for PDN impedance prediction. A boundary element method (BEM) is applied to efficiently calculate the impedance for arbitrary board shape… ▽ More Modeling and simulating a power distribution network (PDN) for printed circuit boards (PCBs) with irregular board shapes and multi-layer stackup is computationally inefficient using full-wave simulations. This paper presents a new concept of using deep learning for PDN impedance prediction. A boundary element method (BEM) is applied to efficiently calculate the impedance for arbitrary board shape and stackup. Then over one million boards with different shapes, stackup, IC location, and decap placement are randomly generated to train a deep neural network (DNN). The trained DNN can predict the impedance accurately for new board configurations that have not been used for training. The consumed time using the trained DNN is only 0.1 seconds, which is over 100 times faster than the BEM method and 5000 times faster than full-wave simulations. △ Less

Submitted 20 June, 2021; originally announced June 2021.

arXiv:1905.00565 [pdf, other]

Parallelizing Convergent Cross Mapping Using Apache Spark

Authors: Bo Pu, Lujie Duan, Nathaniel Osgood

Abstract: Identifying the causal relationships between subjects or variables remains an important problem across various scientific fields. This is particularly important but challenging in complex systems, such as those involving human behavior, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embedding of time series, convergent cross mapping (CCM) serve… ▽ More Identifying the causal relationships between subjects or variables remains an important problem across various scientific fields. This is particularly important but challenging in complex systems, such as those involving human behavior, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embedding of time series, convergent cross mapping (CCM) serves as an important method for addressing this problem. While powerful, CCM is computationally costly; moreover, CCM results are highly sensitive to several parameter values. While best practice entails exploring a range of parameter settings when assessing casual relationships, the resulting computational burden can raise barriers to practical use, especially for long time series exhibiting weak causal linkages. We demonstrate here several means of accelerating CCM by harnessing the distributed Apache Spark platform. We characterize and report on results of several experiments with parallelized solutions that demonstrate high scalability and a capacity for over an order of magnitude performance improvement for the baseline configuration. Such economies in computation time can speed learning and robust identification of causal drivers in complex systems. △ Less

Submitted 1 May, 2019; originally announced May 2019.

Comments: 11 pages, 5 figures, SBP conference paper

Showing 1–8 of 8 results for author: Pu, B