-
Information dissemination and confusion in signed networks
Authors:
Ligang Jin,
Eckhard Steffen
Abstract:
We introduce a model of information dissemination in signed networks. It is a discrete-time process in which uninformed actors incrementally receive information from their informed neighbors or from the outside. Our goal is to minimize the number of confused actors - that is, the number of actors who receive contradictory information. We prove upper bounds for the number of confused actors in sign…
▽ More
We introduce a model of information dissemination in signed networks. It is a discrete-time process in which uninformed actors incrementally receive information from their informed neighbors or from the outside. Our goal is to minimize the number of confused actors - that is, the number of actors who receive contradictory information. We prove upper bounds for the number of confused actors in signed networks and in equivalence classes of signed networks. In particular, we show that there are signed networks where, for any information placement strategy, almost 60\% of the actors are confused. Furthermore, this is also the case when considering the minimum number of confused actors within an equivalence class of signed graphs.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models
Authors:
Linghao Jin,
Li An,
Xuezhe Ma
Abstract:
Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of…
▽ More
Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline
Authors:
Qi Jia,
Baoyu Fan,
Cong Xu,
Lu Liu,
Liang Jin,
Guoguang Du,
Zhenhua Guo,
Yaqian Zhao,
Xuanjing Huang,
Rengang Li
Abstract:
Existing video multi-modal sentiment analysis mainly focuses on the sentiment expression of people within the video, yet often neglects the induced sentiment of viewers while watching the videos. Induced sentiment of viewers is essential for inferring the public response to videos, has broad application in analyzing public societal sentiment, effectiveness of advertising and other areas. The micro…
▽ More
Existing video multi-modal sentiment analysis mainly focuses on the sentiment expression of people within the video, yet often neglects the induced sentiment of viewers while watching the videos. Induced sentiment of viewers is essential for inferring the public response to videos, has broad application in analyzing public societal sentiment, effectiveness of advertising and other areas. The micro videos and the related comments provide a rich application scenario for viewers induced sentiment analysis. In light of this, we introduces a novel research task, Multi-modal Sentiment Analysis for Comment Response of Video Induced(MSA-CRVI), aims to inferring opinions and emotions according to the comments response to micro video. Meanwhile, we manually annotate a dataset named Comment Sentiment toward to Micro Video (CSMV) to support this research. It is the largest video multi-modal sentiment dataset in terms of scale and video duration to our knowledge, containing 107,267 comments and 8,210 micro videos with a video duration of 68.83 hours. To infer the induced sentiment of comment should leverage the video content, so we propose the Video Content-aware Comment Sentiment Analysis (VC-CSA) method as baseline to address the challenges inherent in this new task. Extensive experiments demonstrate that our method is showing significant improvements over other established baselines.
△ Less
Submitted 15 May, 2024;
originally announced July 2024.
-
Pathfinder: Exploring Path Diversity for Assessing Internet Censorship Inconsistency
Authors:
Xiaoqin Liang,
Guannan Liu,
Lin Jin,
Shuai Hao,
Haining Wang
Abstract:
Internet censorship is typically enforced by authorities to achieve information control for a certain group of Internet users. So far existing censorship studies have primarily focused on country-level characterization because (1) in many cases, censorship is enabled by governments with nationwide policies and (2) it is usually hard to control how the probing packets are routed to trigger censorsh…
▽ More
Internet censorship is typically enforced by authorities to achieve information control for a certain group of Internet users. So far existing censorship studies have primarily focused on country-level characterization because (1) in many cases, censorship is enabled by governments with nationwide policies and (2) it is usually hard to control how the probing packets are routed to trigger censorship in different networks inside a country. However, the deployment and implementation of censorship could be highly diverse at the ISP level. In this paper, we investigate Internet censorship from a different perspective by scrutinizing the diverse censorship deployment inside a country. Specifically, by leveraging an end-to-end measurement framework, we deploy multiple geo-distributed back-end control servers to explore various paths from one single vantage point. The generated traffic with the same domain but different control servers' IPs could be forced to traverse different transit networks, thereby being examined by different censorship devices if present. Through our large-scale experiments and in-depth investigation, we reveal that the diversity of Internet censorship caused by different routing paths inside a country is prevalent, implying that (1) the implementations of centralized censorship are commonly incomplete or flawed and (2) decentralized censorship is also common. Moreover, we identify that different hosting platforms also result in inconsistent censorship activities due to different peering relationships with the ISPs in a country. Finally, we present extensive case studies in detail to illustrate the configurations that lead to censorship inconsistency and explore the causes.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models
Authors:
Jiahuan Cao,
Dezhi Peng,
Peirong Zhang,
Yongxin Shi,
Yang Liu,
Kai Ding,
Lianwen Jin
Abstract:
Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowle…
▽ More
Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose \textbf{TongGu} (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu's superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset will be public available.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
CLASH: Complementary Learning with Neural Architecture Search for Gait Recognition
Authors:
Huanzhang Dou,
Pengyi Zhang,
Yuhan Zhao,
Lu Jin,
Xi Li
Abstract:
Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitab…
▽ More
Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture Search (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models
Authors:
Xu Han,
Linghao Jin,
Xuezhe Ma,
Xiaofeng Liu
Abstract:
Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susce…
▽ More
Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Authors:
Jiaxin Zhang,
Wentao Yang,
Songxuan Lai,
Zecheng Xie,
Lianwen Jin
Abstract:
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual t…
▽ More
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
TrustUQA: A Trustful Framework for Unified Structured Data Question Answering
Authors:
Wen Zhang,
Long Jin,
Yushan Zhu,
Jiaoyan Chen,
Zhiwei Huang,
Junjie Wang,
Yin Hua,
Lei Liang,
Huajun Chen
Abstract:
Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple…
▽ More
Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
Authors:
Linqing Chen,
Weilei Wang,
Zilong Bai,
Peng Xu,
Yan Fang,
Jie Fang,
Wentao Wu,
Lizhi Zhou,
Ruiji Zhang,
Yubin Xia,
Chaobo Xu,
Ran Hu,
Licong Xu,
Qijun Cai,
Haoran Hua,
Jing Sun,
Jin Liu,
Tian Qiu,
Haowen Liu,
Meng Hu,
Xiuwen Li,
Fei Gao,
Yufu Wang,
Lin Tie,
Chaochao Wang
, et al. (11 additional authors not shown)
Abstract:
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpo…
▽ More
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmaGPT, a suite of domain specilized LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus tailored to the Bio-Pharmaceutical and Chemical domains. Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. Remarkably, this performance is achieved with a model that has only a fraction, sometimes just one-tenth-of the parameters of general-purpose large models. This advancement establishes a new benchmark for LLMs in the bio-pharmaceutical and chemical fields, addressing the existing gap in specialized language modeling. It also suggests a promising path for enhanced research and development, paving the way for more precise and effective NLP applications in these areas.
△ Less
Submitted 9 July, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
COT: A Generative Approach for Hate Speech Counter-Narratives via Contrastive Optimal Transport
Authors:
Linhao Zhang,
Li Jin,
Guangluan Xu,
Xiaoyu Li,
Xian Sun
Abstract:
Counter-narratives, which are direct responses consisting of non-aggressive fact-based arguments, have emerged as a highly effective approach to combat the proliferation of hate speech. Previous methodologies have primarily focused on fine-tuning and post-editing techniques to ensure the fluency of generated contents, while overlooking the critical aspects of individualization and relevance concer…
▽ More
Counter-narratives, which are direct responses consisting of non-aggressive fact-based arguments, have emerged as a highly effective approach to combat the proliferation of hate speech. Previous methodologies have primarily focused on fine-tuning and post-editing techniques to ensure the fluency of generated contents, while overlooking the critical aspects of individualization and relevance concerning the specific hatred targets, such as LGBT groups, immigrants, etc. This research paper introduces a novel framework based on contrastive optimal transport, which effectively addresses the challenges of maintaining target interaction and promoting diversification in generating counter-narratives. Firstly, an Optimal Transport Kernel (OTK) module is leveraged to incorporate hatred target information in the token representations, in which the comparison pairs are extracted between original and transported features. Secondly, a self-contrastive learning module is employed to address the issue of model degeneration. This module achieves this by generating an anisotropic distribution of token representations. Finally, a target-oriented search method is integrated as an improved decoding strategy to explicitly promote domain relevance and diversification in the inference process. This strategy modifies the model's confidence score by considering both token similarity and target relevance. Quantitative and qualitative experiments have been evaluated on two benchmark datasets, which demonstrate that our proposed model significantly outperforms current methods evaluated by metrics from multiple aspects.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
DVOS: Self-Supervised Dense-Pattern Video Object Segmentation
Authors:
Keyhan Najafian,
Farhad Maleki,
Ian Stavness,
Lingling Jin
Abstract:
Video object segmentation approaches primarily rely on large-scale pixel-accurate human-annotated datasets for model development. In Dense Video Object Segmentation (DVOS) scenarios, each video frame encompasses hundreds of small, dense, and partially occluded objects. Accordingly, the labor-intensive manual annotation of even a single frame often takes hours, which hinders the development of DVOS…
▽ More
Video object segmentation approaches primarily rely on large-scale pixel-accurate human-annotated datasets for model development. In Dense Video Object Segmentation (DVOS) scenarios, each video frame encompasses hundreds of small, dense, and partially occluded objects. Accordingly, the labor-intensive manual annotation of even a single frame often takes hours, which hinders the development of DVOS for many applications. Furthermore, in videos with dense patterns, following a large number of objects that move in different directions poses additional challenges. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for DVOS utilizing a diffusion-based method through multi-task learning. Emulating real videos' optical flow and simulating their motion, we developed a methodology to synthesize computationally annotated videos that can be used for training DVOS models; The model performance was further improved by utilizing weakly labeled (computationally generated but imprecise) data. To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos, capturing wheat crops in fields of different locations across various growth stages, spanning from heading to maturity. Despite using only a few manually annotated video frames, the proposed approach yielded high-performing models, achieving a Dice score of 0.82 when tested on a drone-captured external test set. While we showed the efficacy of the proposed approach for wheat head segmentation, its application can be extended to other crops or DVOS in other domains, such as crowd analysis or microscopic image analysis.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction
Authors:
Pengjie Wang,
Kaile Zhang,
Xinyu Wang,
Shengwei Han,
Yongge Liu,
Lianwen Jin,
Xiang Bai,
Yuliang Liu
Abstract:
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters throug…
▽ More
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolor{blue}{counterparts}, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Deciphering Oracle Bone Language with Diffusion Models
Authors:
Haisu Guan,
Huanxin Yang,
Xinyu Wang,
Shengwei Han,
Yongge Liu,
Lianwen Jin,
Xiang Bai,
Yuliang Liu
Abstract:
Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a no…
▽ More
Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent Dementia
Authors:
Liyun Deng,
Lei Jin,
Guangcheng Wang,
Quan Shi,
Han Wang
Abstract:
In response to the social issue of the increasing number of elderly vulnerable groups going missing due to the aggravating aging population in China, our team has developed a wearable anti-loss device and intelligent early warning system for elderly individuals with intermittent dementia using artificial intelligence and IoT technology. This system comprises an anti-loss smart helmet, a cloud comp…
▽ More
In response to the social issue of the increasing number of elderly vulnerable groups going missing due to the aggravating aging population in China, our team has developed a wearable anti-loss device and intelligent early warning system for elderly individuals with intermittent dementia using artificial intelligence and IoT technology. This system comprises an anti-loss smart helmet, a cloud computing module, and an intelligent early warning application on the caregiver's mobile device. The smart helmet integrates a miniature camera module, a GPS module, and a 5G communication module to collect first-person images and location information of the elderly. Data is transmitted remotely via 5G, FTP, and TCP protocols. In the cloud computing module, our team has proposed for the first time a multimodal dangerous state recognition network based on scene and location information to accurately assess the risk of elderly individuals going missing. Finally, the application software interface designed for the caregiver's mobile device implements multi-level early warnings. The system developed by our team requires no operation or response from the elderly, achieving fully automatic environmental perception, risk assessment, and proactive alarming. This overcomes the limitations of traditional monitoring devices, which require active operation and response, thus avoiding the issue of the digital divide for the elderly. It effectively prevents accidental loss and potential dangers for elderly individuals with dementia.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models
Authors:
Jiahuan Cao,
Yongxin Shi,
Dezhi Peng,
Yang Liu,
Lianwen Jin
Abstract:
Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities…
▽ More
Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.
△ Less
Submitted 30 May, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
The SkatingVerse Workshop & Challenge: Methods and Results
Authors:
Jian Zhao,
Lei Jin,
Jianshu Li,
Zheng Zhu,
Yinglei Teng,
Jiaojiao Zhao,
Sadaf Gulshad,
Zheng Wang,
Bo Zhao,
Xiangbo Shu,
Yunchao Wei,
Xuecheng Nie,
Xiaojie Jin,
Xiaodan Liang,
Shin'ichi Satoh,
Yandong Guo,
Cewu Lu,
Junliang Xing,
Jane Shen Shengmei
Abstract:
The SkatingVerse Workshop & Challenge aims to encourage research in developing novel and accurate methods for human action understanding. The SkatingVerse dataset used for the SkatingVerse Challenge has been publicly released. There are two subsets in the dataset, i.e., the training subset and testing subset. The training subsets consists of 19,993 RGB video sequences, and the testing subsets cons…
▽ More
The SkatingVerse Workshop & Challenge aims to encourage research in developing novel and accurate methods for human action understanding. The SkatingVerse dataset used for the SkatingVerse Challenge has been publicly released. There are two subsets in the dataset, i.e., the training subset and testing subset. The training subsets consists of 19,993 RGB video sequences, and the testing subsets consists of 8,586 RGB video sequences. Around 10 participating teams from the globe competed in the SkatingVerse Challenge. In this paper, we provide a brief summary of the SkatingVerse Workshop & Challenge including brief introductions to the top three methods. The submission leaderboard will be reopened for researchers that are interested in the human action understanding challenge. The benchmark dataset and other information can be found at: https://skatingverse.github.io/.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
CTGNN: Crystal Transformer Graph Neural Network for Crystal Material Property Prediction
Authors:
Zijian Du,
Luozhijie Jin,
Le Shu,
Yan Cen,
Yuanfeng Xu,
Yongfeng Mei,
Hao Zhang
Abstract:
The combination of deep learning algorithm and materials science has made significant progress in predicting novel materials and understanding various behaviours of materials. Here, we introduced a new model called as the Crystal Transformer Graph Neural Network (CTGNN), which combines the advantages of Transformer model and graph neural networks to address the complexity of structure-properties r…
▽ More
The combination of deep learning algorithm and materials science has made significant progress in predicting novel materials and understanding various behaviours of materials. Here, we introduced a new model called as the Crystal Transformer Graph Neural Network (CTGNN), which combines the advantages of Transformer model and graph neural networks to address the complexity of structure-properties relation of material data. Compared to the state-of-the-art models, CTGNN incorporates the graph network structure for capturing local atomic interactions and the dual-Transformer structures to model intra-crystal and inter-atomic relationships comprehensively. The benchmark carried on by the proposed CTGNN indicates that CTGNN significantly outperforms existing models like CGCNN and MEGNET in the prediction of formation energy and bandgap properties. Our work highlights the potential of CTGNN to enhance the performance of properties prediction and accelerates the discovery of new materials, particularly for perovskite materials.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
BEVRender: Vision-based Cross-view Vehicle Registration in Off-road GNSS-denied Environment
Authors:
Lihong Jin,
Wei Dong,
Michael Kaess
Abstract:
We introduce BEVRender, a novel learning-based approach for the localization of ground vehicles in Global Navigation Satellite System (GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local b…
▽ More
We introduce BEVRender, a novel learning-based approach for the localization of ground vehicles in Global Navigation Satellite System (GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's eye view (BEV) images of the local terrain. Subsequently, these images are aligned with a geo-referenced aerial map via template-matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency. The code for BEVRender will be made available soon.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Authors:
Jiaxin Zhang,
Dezhi Peng,
Chongyu Liu,
Peirong Zhang,
Lianwen Jin
Abstract:
Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model t…
▽ More
Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model that unifies five document image restoration tasks including dewarping, deshadowing, appearance enhancement, deblurring, and binarization. To instruct DocRes to perform various restoration tasks, we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The DTSPrompt for different tasks comprises distinct prior features, which are additional characteristics extracted from the input image. Beyond its role as a cue for task-specific execution, DTSPrompt can also serve as supplementary information to enhance the model's performance. Moreover, DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions. Experimental results demonstrate that DocRes achieves competitive or superior performance compared to existing state-of-the-art task-specific models. This underscores the potential of DocRes across a broader spectrum of document image restoration tasks. The source code is publicly available at https://github.com/ZZZHANG-jx/DocRes
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving
Authors:
Chen Min,
Dawei Zhao,
Liang Xiao,
Jian Zhao,
Xinli Xu,
Zheng Zhu,
Lei Jin,
Jianshu Li,
Yulan Guo,
Junliang Xing,
Liping Jing,
Yiming Nie,
Bin Dai
Abstract:
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by i…
▽ More
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Dyna-Style Learning with A Macroscopic Model for Vehicle Platooning in Mixed-Autonomy Traffic
Authors:
Yichuan Zou,
Li Jin,
Xi Xiong
Abstract:
Platooning of connected and autonomous vehicles (CAVs) plays a vital role in modernizing highways, ushering in enhanced efficiency and safety. This paper explores the significance of platooning in smart highways, employing a coupled partial differential equation (PDE) and ordinary differential equation (ODE) model to elucidate the complex interaction between bulk traffic flow and CAV platoons. Our…
▽ More
Platooning of connected and autonomous vehicles (CAVs) plays a vital role in modernizing highways, ushering in enhanced efficiency and safety. This paper explores the significance of platooning in smart highways, employing a coupled partial differential equation (PDE) and ordinary differential equation (ODE) model to elucidate the complex interaction between bulk traffic flow and CAV platoons. Our study focuses on developing a Dyna-style planning and learning framework tailored for platoon control, with a specific goal of reducing fuel consumption. By harnessing the coupled PDE-ODE model, we improve data efficiency in Dyna-style learning through virtual experiences. Simulation results validate the effectiveness of our macroscopic model in modeling platoons within mixed-autonomy settings, demonstrating a notable $10.11\%$ reduction in vehicular fuel consumption compared to conventional approaches.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
Authors:
Yuliang Liu,
Mingxin Huang,
Hao Yan,
Linger Deng,
Weijia Wu,
Hao Lu,
Chunhua Shen,
Lianwen Jin,
Xiang Bai
Abstract:
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queri…
▽ More
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.
△ Less
Submitted 14 May, 2024; v1 submitted 30 April, 2024;
originally announced April 2024.
-
Insights into the defect-driven heterogeneous structural evolution of Ni-rich layered cathode in lithium-ion batteries
Authors:
Zhongyuan Huang,
Ziwei Chen,
Maolin Yang,
Mihai Chu,
Zenan Li,
Sihao Deng,
Lunhua He,
Lei Jin,
Rafal E. Dunin-Borkowski,
Rui Wang,
Jun Wang,
Tingting Yang,
Yinguo Xiao
Abstract:
Recently, considerable efforts have been made on research and improvement for Ni-rich lithium-ion batteries to meet the demand from vehicles and grid-level large-scale energy storage. Development of next-generation high-performance lithium-ion batteries requires a comprehensive understanding on the underlying electrochemical mechanisms associated with its structural evolution. In this work, advanc…
▽ More
Recently, considerable efforts have been made on research and improvement for Ni-rich lithium-ion batteries to meet the demand from vehicles and grid-level large-scale energy storage. Development of next-generation high-performance lithium-ion batteries requires a comprehensive understanding on the underlying electrochemical mechanisms associated with its structural evolution. In this work, advanced operando neutron diffraction and four-dimensional scanning transmission electron microscopy techniques are applied to clarify the structural evolution of electrodes in two distinct full cells with identical LiNi0.8Co0.1Mn0.1O2 cathode but different anode counterparts. It is found that both of cathodes in two cells exhibit non-intrinsic two-phase-like behavior at the early charge stage, indicating selective Li+ extraction from cathodes. But the heterogeneous evolution of cathode is less serious with graphite-silicon blended anode than that with graphite anode due to the different delithiation rate. Moreover, it is revealed that the formation of heterogeneous structure is led by the distribution of defects including Li/Ni disordering and microcracks, which should be inhibited by assembling appropriate anode to avoid potential threaten on cell performance. The present work unveils the origin of inhomogeneity in Ni-rich lithium-ion batteries and highlights the significance of kinetics control in electrodes for batteries with higher capacity and longer life.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Authors:
Ye Tian,
Baolin Peng,
Linfeng Song,
Lifeng Jin,
Dian Yu,
Haitao Mi,
Dong Yu
Abstract:
Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. I…
▽ More
Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Matching Hadronization and Perturbative Evolution: The Cluster Model in Light of Infrared Shower Cutoff Dependence
Authors:
André H. Hoang,
Oliver L. Jin,
Simon Plätzer,
Daniel Samitz
Abstract:
In the context of Monte Carlo (MC) generators with parton showers that have next-to-leading-logarithmic (NLL) precision, the cutoff $Q_0$ terminating the shower evolution should be viewed as an infrared factorization scale so that parameters or non-perturbative effects of the MC generator may have a field theoretic interpretation with a controllable scheme dependence. This implies that the generat…
▽ More
In the context of Monte Carlo (MC) generators with parton showers that have next-to-leading-logarithmic (NLL) precision, the cutoff $Q_0$ terminating the shower evolution should be viewed as an infrared factorization scale so that parameters or non-perturbative effects of the MC generator may have a field theoretic interpretation with a controllable scheme dependence. This implies that the generator's parton level should be carefully defined within QCD perturbation theory with subleading order precision. Furthermore, it entails that the shower cut $Q_0$ is not treated as one of the generator's tuning parameters, but that the tuning can be carried out reliably for a range of $Q_0$ values and that the hadron level description is $Q_0$-invariant. This in turn imposes non-trival constraints on the behavior of the generator's hadronization model, so that its parameters can adapt accordingly when the $Q_0$ value is changed. We investigate these features using the angular ordered parton shower and the cluster hadronization model implemented in the Herwig~7.2 MC generator focusing in particular on the $e^+e^-$ 2-jettiness distribution, where the shower is known to be NLL precise and where QCD factorization imposes stringent constraints on the hadronization corrections. We show that the Herwig default cluster hadronization model does not exhibit these features or consistency with QCD factorization with a satisfying precision. We design a modification of the cluster hadronization model, where some dynamical parton shower aspects are added that are missing in the default model. For this novel dynamical cluster hadronization model these features and consistency with QCD factorization are realized much more accurately.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models
Authors:
Souvik Das,
Lifeng Jin,
Linfeng Song,
Haitao Mi,
Baolin Peng,
Dong Yu
Abstract:
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination -- generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality during inference by leveraging LLMs' hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current…
▽ More
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination -- generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality during inference by leveraging LLMs' hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current state-of-the-art approaches refine decoding by contrasting early-exit distributions from a lower layer with the final layer to exploit information related to factuality within the model forward procedure. However, such methods often assume the final layer is the most reliable and the lower layer selection process depends on it. In this work, we first propose extrapolation of critical token probabilities beyond the last layer for more accurate contrasting. We additionally employ layer-wise entropy-guided lower layer selection, decoupling the selection process from the final layer. Experiments demonstrate strong performance - surpassing state-of-the-art on multiple different datasets by large margins. Analyses show different kinds of prompts respond to different selection strategies.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
On Joint Convergence of Traffic State and Weight Vector in Learning-Based Dynamic Routing with Value Function Approximation
Authors:
Yidan Wu,
Jianan Zhang,
Li Jin
Abstract:
Learning-based approaches are increasingly popular for traffic control problems. However, these approaches are applied typically as black boxes with limited theoretical guarantees and interpretability. In this paper, we consider the theory of dynamic routing over parallel servers, a representative traffic control task, using semi-gradient on-policy control algorithm, a representative reinforcement…
▽ More
Learning-based approaches are increasingly popular for traffic control problems. However, these approaches are applied typically as black boxes with limited theoretical guarantees and interpretability. In this paper, we consider the theory of dynamic routing over parallel servers, a representative traffic control task, using semi-gradient on-policy control algorithm, a representative reinforcement learning method. We consider a linear value function approximation on an infinite state space; a Lyapunov function is also derived from the approximator. In particular, the structure of the approximator naturally makes possible idling policies, which is an interesting and useful advantage over existing dynamic routing schemes. We show that the convergence of the approximation weights is coupled with the convergence of the traffic state. We show that if the system is stabilizable, then (i) the weight vector converges to a bounded region, and (ii) the traffic state is bounded in the mean. We also empirically show that the proposed algorithm is computationally efficient with an insignificant optimality gap.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
WaveMo: Learning Wavefront Modulations to See Through Scattering
Authors:
Mingyang Xie,
Haiyun Guo,
Brandon Y. Feng,
Lingbo Jin,
Ashok Veeraraghavan,
Christopher A. Metzler
Abstract:
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introdu…
▽ More
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Minimax Least-Square Policy Iteration for Cost-Aware Defense of Traffic Routing against Unknown Threats
Authors:
Yuzhen Zhan,
Li Jin
Abstract:
Dynamic routing is one of the representative control scheme in transportation, production lines, and data transmission. In the modern context of connectivity and autonomy, routing decisions are potentially vulnerable to malicious attacks. In this paper, we consider the dynamic routing problem over parallel traffic links in the face of such threats. An attacker is capable of increasing or destabili…
▽ More
Dynamic routing is one of the representative control scheme in transportation, production lines, and data transmission. In the modern context of connectivity and autonomy, routing decisions are potentially vulnerable to malicious attacks. In this paper, we consider the dynamic routing problem over parallel traffic links in the face of such threats. An attacker is capable of increasing or destabilizing traffic queues by strategic manipulating the nominally optimal routing decisions. A defender is capable of securing the correct routing decision. Attacking and defensive actions induce technological costs. The defender has no prior information about the attacker's strategy. We develop an least-square policy iteration algorithm for the defender to compute a cost-aware and threat-adaptive defensive strategy. The policy evaluation step computes a weight vector that minimizes the sampled temporal-difference error. We derive a concrete theoretical upper bound on the evaluation error based on the theory of value function approximation. The policy improvement step solves a minimax problem and thus iteratively computes the Markov perfect equilibrium of the security game. We also discuss the training error of the entire policy iteration process.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Bridging the Gap Between End-to-End and Two-Step Text Spotting
Authors:
Mingxin Huang,
Hongliang Li,
Yuliang Liu,
Xiang Bai,
Lianwen Jin
Abstract:
Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies, the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper, we introduce Bridg…
▽ More
Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies, the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper, we introduce Bridging Text Spotting, a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods while retaining modularity. To achieve this, we adopt a well-trained detector and recognizer that are developed and trained independently and then lock their parameters to preserve their already acquired capabilities. Subsequently, we introduce a Bridge that connects the locked detector and recognizer through a zero-initialized neural network. This zero-initialized neural network, initialized with weights set to zeros, ensures seamless integration of the large receptive field features in detection into the locked recognizer. Furthermore, since the fixed detector and recognizer cannot naturally acquire end-to-end optimization features, we adopt the Adapter to facilitate their efficient learning of these features. We demonstrate the effectiveness of the proposed method through extensive experiments: Connecting the latest detector and recognizer through Bridging Text Spotting, we achieved an accuracy of 83.3% on Total-Text, 69.8% on CTW1500, and 89.5% on ICDAR 2015. The code is available at https://github.com/mxin262/Bridging-Text-Spotting.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Robust incoherent perfect absorption
Authors:
H. S. Xu,
L. Jin
Abstract:
A coherent perfect absorber is capable of completely absorbing input waves. However, the coherent perfect absorption severely depends on the superposition of the input waves, and the perfect absorption is sensitive to the disorder of the absorber. Thus, a robust incoherent perfect absorption, being insensitive to the superposition of input waves and the system disorder, is desirable for practical…
▽ More
A coherent perfect absorber is capable of completely absorbing input waves. However, the coherent perfect absorption severely depends on the superposition of the input waves, and the perfect absorption is sensitive to the disorder of the absorber. Thus, a robust incoherent perfect absorption, being insensitive to the superposition of input waves and the system disorder, is desirable for practical applications. Here, we demonstrate that the linearly independent destructive interference at the port connections removes the constraint on the coherent input. We propose an approach using the interplay between the loss and localization to form the incoherent perfect absorption. The resonant incidence from either port is completely absorbed. Furthermore, we utilize the lattice configuration supporting the flat band to demonstrate the disorder-immune incoherent perfect absorption. Our findings provide insight into the fundamentals and applications for the perfect absorption of light, microwaves, sound, mechanical waves, and beyond.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Exploiting Priors from 3D Diffusion Models for RGB-Based One-Shot View Planning
Authors:
Sicong Pan,
Liren Jin,
Xuying Huang,
Cyrill Stachniss,
Marija Popović,
Maren Bennewitz
Abstract:
Object reconstruction is relevant for many autonomous robotic tasks that require interaction with the environment. A key challenge in such scenarios is planning view configurations to collect informative measurements for reconstructing an initially unknown object. One-shot view planning enables efficient data collection by predicting view configurations and planning the globally shortest path conn…
▽ More
Object reconstruction is relevant for many autonomous robotic tasks that require interaction with the environment. A key challenge in such scenarios is planning view configurations to collect informative measurements for reconstructing an initially unknown object. One-shot view planning enables efficient data collection by predicting view configurations and planning the globally shortest path connecting all views at once. However, geometric priors about the object are required to conduct one-shot view planning. In this work, we propose a novel one-shot view planning approach that utilizes the powerful 3D generation capabilities of diffusion models as priors. By incorporating such geometric priors into our pipeline, we achieve effective one-shot view planning starting with only a single RGB image of the object to be reconstructed. Our planning experiments in simulation and real-world setups indicate that our approach balances well between object reconstruction quality and movement cost.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception
Authors:
Hao Xiang,
Zhaoliang Zheng,
Xin Xia,
Runsheng Xu,
Letian Gao,
Zewei Zhou,
Xu Han,
Xinkai Ji,
Mingxi Li,
Zonglin Meng,
Li Jin,
Mingyue Lei,
Zhaoyang Ma,
Zihang He,
Haoxuan Ma,
Yunshuang Yuan,
Yingqian Zhao,
Jiaqi Ma
Abstract:
Recent advancements in Vehicle-to-Everything (V2X) technologies have enabled autonomous vehicles to share sensing information to see through occlusions, greatly boosting the perception capability. However, there are no real-world datasets to facilitate the real V2X cooperative perception research -- existing datasets either only support Vehicle-to-Infrastructure cooperation or Vehicle-to-Vehicle c…
▽ More
Recent advancements in Vehicle-to-Everything (V2X) technologies have enabled autonomous vehicles to share sensing information to see through occlusions, greatly boosting the perception capability. However, there are no real-world datasets to facilitate the real V2X cooperative perception research -- existing datasets either only support Vehicle-to-Infrastructure cooperation or Vehicle-to-Vehicle cooperation. In this paper, we propose a dataset that has a mixture of multiple vehicles and smart infrastructure simultaneously to facilitate the V2X cooperative perception development with multi-modality sensing data. Our V2X-Real is collected using two connected automated vehicles and two smart infrastructures, which are all equipped with multi-modal sensors including LiDAR sensors and multi-view cameras. The whole dataset contains 33K LiDAR frames and 171K camera data with over 1.2M annotated bounding boxes of 10 categories in very challenging urban scenarios. According to the collaboration mode and ego perspective, we derive four types of datasets for Vehicle-Centric, Infrastructure-Centric, Vehicle-to-Vehicle, and Infrastructure-to-Infrastructure cooperative perception. Comprehensive multi-class multi-agent benchmarks of SOTA cooperative perception methods are provided. The V2X-Real dataset and benchmark codes will be released.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition
Authors:
Yuyi Zhang,
Yuanzhi Zhu,
Dezhi Peng,
Peirong Zhang,
Zhenhua Yang,
Zhibo Yang,
Cong Yao,
Lianwen Jin
Abstract:
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose…
▽ More
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode's superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
STAIR: Semantic-Targeted Active Implicit Reconstruction
Authors:
Liren Jin,
Haofei Kuang,
Yue Pan,
Cyrill Stachniss,
Marija Popović
Abstract:
Many autonomous robotic applications require object-level understanding when deployed. Actively reconstructing objects of interest, i.e. objects with specific semantic meanings, is therefore relevant for a robot to perform downstream tasks in an initially unknown environment. In this work, we propose a novel framework for semantic-targeted active reconstruction using posed RGB-D measurements and 2…
▽ More
Many autonomous robotic applications require object-level understanding when deployed. Actively reconstructing objects of interest, i.e. objects with specific semantic meanings, is therefore relevant for a robot to perform downstream tasks in an initially unknown environment. In this work, we propose a novel framework for semantic-targeted active reconstruction using posed RGB-D measurements and 2D semantic labels as input. The key components of our framework are a semantic implicit neural representation and a compatible planning utility function based on semantic rendering and uncertainty estimation, enabling adaptive view planning to target objects of interest. Our planning approach achieves better reconstruction performance in terms of mesh and novel view rendering quality compared to implicit reconstruction baselines that do not consider semantics for view planning. Our framework further outperforms a state-of-the-art semantic-targeted active reconstruction pipeline based on explicit maps, justifying our choice of utilising implicit neural representations to tackle semantic-targeted active reconstruction problems.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Properties of the Strong Data Processing Constant for Rényi Divergence
Authors:
Lifu Jin,
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
Strong data processing inequalities (SDPI) are an important object of study in Information Theory and have been well studied for $f$-divergences. Universal upper and lower bounds have been provided along with several applications, connecting them to impossibility (converse) results, concentration of measure, hypercontractivity, and so on. In this paper, we study Rényi divergence and the correspond…
▽ More
Strong data processing inequalities (SDPI) are an important object of study in Information Theory and have been well studied for $f$-divergences. Universal upper and lower bounds have been provided along with several applications, connecting them to impossibility (converse) results, concentration of measure, hypercontractivity, and so on. In this paper, we study Rényi divergence and the corresponding SDPI constant whose behavior seems to deviate from that of ordinary $Φ$-divergences. In particular, one can find examples showing that the universal upper bound relating its SDPI constant to the one of Total Variation does not hold in general. In this work, we prove, however, that the universal lower bound involving the SDPI constant of the Chi-square divergence does indeed hold. Furthermore, we also provide a characterization of the distribution that achieves the supremum when $α$ is equal to $2$ and consequently compute the SDPI constant for Rényi divergence of the general binary channel.
△ Less
Submitted 14 May, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Self-Consistency Boosts Calibration for Math Reasoning
Authors:
Ante Wang,
Linfeng Song,
Ye Tian,
Baolin Peng,
Lifeng Jin,
Haitao Mi,
Jinsong Su,
Dong Yu
Abstract:
Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and acc…
▽ More
Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface
Authors:
Linyi Jin,
Nilesh Kulkarni,
David Fouhey
Abstract:
This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the…
▽ More
This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level, enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation
Authors:
Jiapeng Wang,
Chengyu Wang,
Tingfeng Cao,
Jun Huang,
Lianwen Jin
Abstract:
We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of h…
▽ More
We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality. To achieve this, we first collect an instruction-following prompt engineering dataset named InstructPE for the supervised training of DiffChat. Next, we propose a reinforcement learning framework with the feedback of three core criteria for image creation, i.e., aesthetics, user preference, and content integrity. It involves an action-space dynamic modification technique to obtain more relevant positive samples and harder negative samples during the off-policy sampling. Content integrity is also introduced into the value estimation function for further improvement of produced images. Our method can exhibit superior performance than baseline models and strong competitors based on both automatic and human evaluations, which fully demonstrates its effectiveness.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
A Crosstalk-Aware Timing Prediction Method in Routing
Authors:
Leilei Jin,
Jiajie Xu,
Wenjie Fu,
Hao Yan,
Longxing Shi
Abstract:
With shrinking interconnect spacing in advanced technology nodes, existing timing predictions become less precise due to the challenging quantification of crosstalk-induced delay. During the routing, the crosstalk effect is typically modeled by predicting coupling capacitance with congestion information. However, the timing estimation tends to be overly pessimistic, as the crosstalk-induced delay…
▽ More
With shrinking interconnect spacing in advanced technology nodes, existing timing predictions become less precise due to the challenging quantification of crosstalk-induced delay. During the routing, the crosstalk effect is typically modeled by predicting coupling capacitance with congestion information. However, the timing estimation tends to be overly pessimistic, as the crosstalk-induced delay depends not only on the coupling capacitance but also on the signal arrival time. This work presents a crosstalk-aware timing estimation method using a two-step machine learning approach. Interconnects that are physically adjacent and overlap in signal timing windows are filtered first. Crosstalk delay is predicted by integrating physical topology and timing features without relying on post-routing results and the parasitic extraction. Experimental results show a match rate of over 99% for identifying crosstalk nets compared to the commercial tool on the OpenCores benchmarks, with prediction results being more accurate than those of other state-of-the-art methods.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
A Knowledge Plug-and-Play Test Bed for Open-domain Dialogue Generation
Authors:
Xiangci Li,
Linfeng Song,
Lifeng Jin,
Haitao Mi,
Jessica Ouyang,
Dong Yu
Abstract:
Knowledge-based, open-domain dialogue generation aims to build chit-chat systems that talk to humans using mined support knowledge. Many types and sources of knowledge have previously been shown to be useful as support knowledge. Even in the era of large language models, response generation grounded in knowledge retrieved from additional up-to-date sources remains a practically important approach.…
▽ More
Knowledge-based, open-domain dialogue generation aims to build chit-chat systems that talk to humans using mined support knowledge. Many types and sources of knowledge have previously been shown to be useful as support knowledge. Even in the era of large language models, response generation grounded in knowledge retrieved from additional up-to-date sources remains a practically important approach. While prior work using single-source knowledge has shown a clear positive correlation between the performances of knowledge selection and response generation, there are no existing multi-source datasets for evaluating support knowledge retrieval. Further, prior work has assumed that the knowledge sources available at test time are the same as during training. This unrealistic assumption unnecessarily handicaps models, as new knowledge sources can become available after a model is trained. In this paper, we present a high-quality benchmark named multi-source Wizard of Wikipedia (Ms.WoW) for evaluating multi-source dialogue knowledge selection and response generation. Unlike existing datasets, it contains clean support knowledge, grounded at the utterance level and partitioned into multiple knowledge sources. We further propose a new challenge, dialogue knowledge plug-and-play, which aims to test an already trained dialogue model on using new support knowledge from previously unseen sources in a zero-shot fashion.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation
Authors:
Chris Rockwell,
Nilesh Kulkarni,
Linyi Jin,
Jeong Joon Park,
Justin Johnson,
David F. Fouhey
Abstract:
Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how t…
▽ More
Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Prospects for measuring time variation of astrophysical neutrino sources at dark matter detectors
Authors:
Yi Zhuang,
Louis E. Strigari,
Lei Jin,
Samiran Sinha
Abstract:
We study the prospects for measuring the time variation of solar and atmospheric neutrino fluxes at future large-scale Xenon and Argon dark matter detectors. For solar neutrinos, a yearly time variation arises from the eccentricity of the Earth's orbit, and, for charged current interactions, from a smaller energy-dependent day-night variation to due flavor regeneration as neutrinos travel through…
▽ More
We study the prospects for measuring the time variation of solar and atmospheric neutrino fluxes at future large-scale Xenon and Argon dark matter detectors. For solar neutrinos, a yearly time variation arises from the eccentricity of the Earth's orbit, and, for charged current interactions, from a smaller energy-dependent day-night variation to due flavor regeneration as neutrinos travel through the Earth. For a 100-ton Xenon detector running for 10 years with a Xenon-136 fraction of $\lesssim 0.1\%$, in the electron recoil channel a time-variation amplitude of about 0.8\% is detectable with a power of 90\% and the level of significance of 10\%. This is sufficient to detect time variation due to eccentricity, which has amplitude of $\sim 3\%$. In the nuclear recoil channel, the detectable amplitude is about 10\% under current detector resolution and efficiency conditions, and this generally reduces to about 1\% for improved detector resolution and efficiency, the latter of which is sufficient to detect time variation due to eccentricity. Our analysis assumes both known and unknown periods. We provide scalings to determine the sensitivity to an arbitrary time-varying amplitude as a function of detector parameters. Identifying the time variation of the neutrino fluxes will be important for distinguishing neutrinos from dark matter signals and other detector-related backgrounds, and extracting properties of neutrinos that can be uniquely studied in dark matter experiments.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Datasets for Large Language Models: A Comprehensive Survey
Authors:
Yang Liu,
Jiahuan Cao,
Chongyu Liu,
Kai Ding,
Lianwen Jin
Abstract:
This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current l…
▽ More
This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Collaborative decoding of critical tokens for boosting factuality of large language models
Authors:
Lifeng Jin,
Baolin Peng,
Linfeng Song,
Haitao Mi,
Ye Tian,
Dong Yu
Abstract:
The most common training pipeline for large language models includes pretraining, finetuning and aligning phases, with their respective resulting models, such as the pretrained model and the finetuned model. Finetuned and aligned models show improved abilities of instruction following and safe generation, however their abilities to stay factual about the world are impacted by the finetuning proces…
▽ More
The most common training pipeline for large language models includes pretraining, finetuning and aligning phases, with their respective resulting models, such as the pretrained model and the finetuned model. Finetuned and aligned models show improved abilities of instruction following and safe generation, however their abilities to stay factual about the world are impacted by the finetuning process. Furthermore, the common practice of using sampling during generation also increases chances of hallucination. In this work, we introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens. We first design a critical token classifier to decide which model to use for the next token, and subsequently generates the next token using different decoding strategies. Experiments with different models and datasets show that our decoding framework is able to reduce model hallucination significantly, showcasing the importance of the collaborative decoding framework.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label
Authors:
Xinliang Zhang,
Lei Zhu,
Hangzhou He,
Lujia Jin,
Yanye Lu
Abstract:
Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific…
▽ More
Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific cues, which are important for semantic segmentation. In this study, we propose a class-driven scribble promotion network, which utilizes both scribble annotations and pseudo-labels informed by image-level classes and global semantics for supervision. Directly adopting pseudo-labels might misguide the segmentation model, thus we design a localization rectification module to correct foreground representations in the feature space. To further combine the advantages of both supervisions, we also introduce a distance entropy loss for uncertainty reduction, which adapts per-pixel confidence weights according to the reliable region determined by the scribble and pseudo-label's boundary. Experiments on the ScribbleSup dataset with different qualities of scribble annotations outperform all the previous methods, demonstrating the superiority and robustness of our method.The code is available at https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
On the causal discontinuity of Morse spacetimes
Authors:
Lucas Dahinden,
Liang Jin
Abstract:
Morse spacetime is a model of singular Lorentzian manifold, built upon a Morse function which serves as a global time function outside its critical points. The Borde-Sorkin conjecture states that a Morse spacetime is causally continuous if and only if the index and coindex of critical points of the corresponding Morse function are both different from 1. The conjecture has recently been confirmed b…
▽ More
Morse spacetime is a model of singular Lorentzian manifold, built upon a Morse function which serves as a global time function outside its critical points. The Borde-Sorkin conjecture states that a Morse spacetime is causally continuous if and only if the index and coindex of critical points of the corresponding Morse function are both different from 1. The conjecture has recently been confirmed by Garcia Heveling for the case of small anisotropy and Euclidean background metric. Here, we provide a complementary counterexample: a four dimensional Morse spacetime whose critical point has index 2 and large enough anisotropy is causally discontinuous and thus the Borde-Sorkin conjecture does not hold. The proof features a low regularity causal structure and causal bubbling.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Fine-Grained Self-Endorsement Improves Factuality and Reasoning
Authors:
Ante Wang,
Linfeng Song,
Baolin Peng,
Ye Tian,
Lifeng Jin,
Haitao Mi,
Jinsong Su,
Dong Yu
Abstract:
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our appro…
▽ More
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.