subscribe to arXiv mailings

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

Authors: Chun-Chieh Liao, Wei-Ting Kuo, I-Hsuan Hu, Yen-Chen Shih, Jun-En Ding, Feng Liu, Fang-Ming Hung

Abstract: Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore,… ▽ More Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2403.15004 [pdf]

ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding

Authors: Novendra Setyawan, Ghufron Wahyu Kurniawan, Chi-Chia Sun, Jun-Wei Hsieh, Hui-Kai Su, Wen-Kai Kuo

Abstract: This work presents ParFormer as an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage, hence improving feature extraction capabilities. Integrating both local and global data allows for precise representation of short- and long-range spatial relationships without the need for computationally intensive methods such as shifting windows. Alon… ▽ More This work presents ParFormer as an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage, hence improving feature extraction capabilities. Integrating both local and global data allows for precise representation of short- and long-range spatial relationships without the need for computationally intensive methods such as shifting windows. Along with the parallel token mixer encoder, We offer the Convolutional Attention Patch Embedding (CAPE) as an enhancement of standard patch embedding to improve token mixer extraction with a convolutional attention module. Our comprehensive evaluation demonstrates that our ParFormer outperforms CNN-based and state-of-the-art transformer-based architectures in image classification and several complex tasks such as object recognition. The proposed CAPE has been demonstrated to benefit the overall MetaFormer architecture, even while utilizing the Identity Mapping Token Mixer, resulting in a 0.5\% increase in accuracy. The ParFormer models outperformed ConvNeXt and Swin Transformer for the pure convolution and transformer model in accuracy. Furthermore, our model surpasses the current leading hybrid transformer by reaching competitive Top-1 scores in the ImageNet-1K classification test. Specifically, our model variants with 11M, 23M, and 34M parameters achieve scores of 80.4\%, 82.1\%, and 83.1\%, respectively. Code: https://github.com/novendrastywn/ParFormer-CAPE-2024 △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2401.02402 [pdf, other]

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Authors: Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

Abstract: 3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen obj… ▽ More 3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin. △ Less

Submitted 2 April, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

arXiv:2310.00161 [pdf, other]

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

Abstract: We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector h… ▽ More We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$_r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: Tech report

arXiv:2309.00775 [pdf, other]

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

Abstract: We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard… ▽ More We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks. △ Less

Submitted 1 September, 2023; originally announced September 2023.

Comments: Accepted to ICCV 2023

arXiv:2306.01736 [pdf, other]

DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model

Authors: Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, David A Ross

Abstract: Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision… ▽ More Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision, allowing our segmentation model to benefit from cheaper bounding box annotations. To share knowledge across datasets, we use text embeddings from the same semantic embedding space as classifiers and share all network parameters among datasets. We train DaTaSeg on ADE semantic, COCO panoptic, and Objects365 detection datasets. DaTaSeg improves performance on all datasets, especially small-scale datasets, achieving 54.0 mIoU on ADE semantic and 53.5 PQ on COCO panoptic. DaTaSeg also enables weakly-supervised knowledge transfer on ADE panoptic and Objects365 instance segmentation. Experiments show DaTaSeg scales with the number of training datasets and enables open-vocabulary segmentation through direct transfer. In addition, we annotate an Objects365 instance segmentation set of 1,000 images and will release it as a public benchmark. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.07011 [pdf, other]

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

Abstract: We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional e… ▽ More We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models. △ Less

Submitted 28 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: CVPR 2023 Highlight - https://github.com/mcahny/rovit ; adds LAION-2B result

arXiv:2304.06028 [pdf, other]

RECLIP: Resource-efficient CLIP by Training with Small Images

Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

Abstract: We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the visio… ▽ More We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the vision transformer heavily depends on input image size, our approach significantly reduces the training resource requirements both in theory and in practice. Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-text retrieval accuracy with 6 to 8x less computational resources and 7 to 9x fewer FLOPs than the baseline. Compared to the state-of-the-art contrastive learning methods, RECLIP demonstrates 5 to 59x training resource savings while maintaining highly competitive zero-shot classification and retrieval performance. Finally, RECLIP matches the state of the art in transfer learning to open-vocabulary detection tasks, achieving 32 APr on LVIS. We hope this work will pave the path for the broader research community to explore language supervised pretraining in resource-friendly settings. △ Less

Submitted 31 August, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

Comments: Published at Transactions on Machine Learning Research

arXiv:2303.16839 [pdf, other]

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

Abstract: The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is sur… ▽ More The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. △ Less

Submitted 9 August, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: Published in Transactions on Machine Learning Research ( https://jmlr.org/tmlr/ ). 18 pages, 4 figures

arXiv:2212.03229 [pdf, other]

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Authors: AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

Abstract: We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and th… ▽ More We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2209.15639 [pdf, other]

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Authors: Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

Abstract: We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region clas… ▽ More We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home △ Less

Submitted 23 February, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: Accepted to ICLR 2023 (https://iclr.cc/Conferences/2023). 20 pages, 7 figures

arXiv:2209.06794 [pdf, other]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner , et al. (4 additional authors not shown)

Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaL… ▽ More Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design. △ Less

Submitted 5 June, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

Comments: ICLR 2023 (Notable-top-5%)

arXiv:2209.04372 [pdf, other]

Pre-training image-language transformers for open-vocabulary tasks

Authors: AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

Abstract: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question A… ▽ More We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods. △ Less

Submitted 9 September, 2022; originally announced September 2022.

arXiv:2208.00934 [pdf, other]

Video Question Answering with Iterative Video-Text Co-Tokenization

Authors: AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Abstract: Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization… ▽ More Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: ECCV 2022

arXiv:2205.00949 [pdf, other]

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Authors: AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

Abstract: We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses onl… ▽ More We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks. △ Less

Submitted 30 November, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

arXiv:2203.17273 [pdf, other]

FindIt: Generalized Localization with Natural Language Queries

Authors: Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

Abstract: We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object dete… ▽ More We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released. △ Less

Submitted 8 August, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted to ECCV 2022 (European Conference on Computer Vision)

arXiv:2110.06413 [pdf]

3LSAA: A Secure And Privacy-preserving Zero-knowledge-based Data-sharing Approach Under An Untrusted Environment

Authors: Wei-Yi Kuo, Ren-Song Tsay

Abstract: As data collection and analysis become critical functions for many cloud applications, proper data sharing with approved parties is required. However, the traditional data sharing scheme through centralized data escrow servers may sacrifice owners' privacy and is weak in security. Mainly, the servers physically own all data while the original data owners have only virtual ownership and lose actual… ▽ More As data collection and analysis become critical functions for many cloud applications, proper data sharing with approved parties is required. However, the traditional data sharing scheme through centralized data escrow servers may sacrifice owners' privacy and is weak in security. Mainly, the servers physically own all data while the original data owners have only virtual ownership and lose actual access control. Therefore, we propose a 3-layer SSE-ABE-AES (3LSAA) cryptography-based privacy-protected data-sharing protocol based on the assumption that servers are honest-but-curious. The 3LSAA protocol realizes automatic access control management and convenient file search even if the server is not trustable. Besides achieving data self-sovereignty, our approach also improves system usability, eliminates the defects in the traditional SSE and ABE approaches, and provides a local AES key recovery method for user's availability. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 21 pages, 7 figures

arXiv:2108.09368 [pdf, other]

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Authors: Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

Abstract: 3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space… ▽ More 3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space between 2D images and 3D CAD models in a patch-wise fashion -- establishing correspondences between patches of an image view of an object and patches of CAD geometry. This enables part similarity reasoning for retrieving similar CADs to a new image view without exact matches in the database. Our patch embedding provides more robust CAD retrieval for shape estimation in our end-to-end estimation of CAD model shape and pose for detected objects in a single input image. Experiments on in-the-wild, complex imagery from ScanNet show that our approach is more robust than state of the art in real-world scenarios without any exact CAD matches. △ Less

Submitted 20 August, 2021; originally announced August 2021.

Comments: To appear at ICCV 2021(IEEE/CVF International Conference on Computer Vision)

arXiv:2108.06753 [pdf, other]

Learning Open-World Object Proposals without Learning to Classify

Authors: Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

Abstract: Object proposals have become an integral preprocessing steps of many vision pipelines including object detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of o… ▽ More Object proposals have become an integral preprocessing steps of many vision pipelines including object detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of object regions and their corresponding categories. However, this approach often struggles with novel objects in the open world that are absent in the training set. In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories. Therefore, we propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlap with any ground-truth object (e.g., centerness and IoU). This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization on COCO, as well as cross-dataset evaluation on RoboNet, Object365, and EpicKitchens. Finally, we demonstrate the merit of OLN for long-tail object detection on large vocabulary dataset, LVIS, where we notice clear improvement in rare and common categories. △ Less

Submitted 15 August, 2021; originally announced August 2021.

arXiv:2108.02226 [pdf]

doi 10.1038/s41597-023-02407-5

Terabyte-scale supervised 3D training and benchmarking dataset of the mouse kidney

Authors: Willy Kuo, Diego Rossinelli, Georg Schulz, Roland H. Wenger, Simone Hieber, Bert Müller, Vartan Kurtcuoglu

Abstract: The performance of machine learning algorithms, when used for segmenting 3D biomedical images, does not reach the level expected based on results achieved with 2D photos. This may be explained by the comparative lack of high-volume, high-quality training datasets, which require state-of-the-art imaging facilities, domain experts for annotation and large computational and personal resources. The HR… ▽ More The performance of machine learning algorithms, when used for segmenting 3D biomedical images, does not reach the level expected based on results achieved with 2D photos. This may be explained by the comparative lack of high-volume, high-quality training datasets, which require state-of-the-art imaging facilities, domain experts for annotation and large computational and personal resources. The HR-Kidney dataset presented in this work bridges this gap by providing 1.7 TB of artefact-corrected synchrotron radiation-based X-ray phase-contrast microtomography images of whole mouse kidneys and validated segmentations of 33 729 glomeruli, which corresponds to a one to two orders of magnitude increase over currently available biomedical datasets. The image sets also contain the underlying raw data, threshold- and morphology-based semi-automatic segmentations of renal vasculature and uriniferous tubules, as well as true 3D manual annotations. We therewith provide a broad basis for the scientific community to build upon and expand in the fields of image processing, data augmentation and machine learning, in particular unsupervised and semi-supervised learning investigations, as well as transfer learning and generative adversarial networks. △ Less

Submitted 28 July, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

Journal ref: Scientific Data 10, 510 (2023)

arXiv:2104.13921 [pdf, other]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

Abstract: We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our metho… ▽ More We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild. △ Less

Submitted 11 May, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: ICLR Camera Ready

Journal ref: ICLR 2022

arXiv:2011.10433 [pdf, other]

doi 10.1371/journal.pone.0249580

Crowdsourcing Airway Annotations in Chest Computed Tomography Images

Authors: Veronika Cheplygina, Adria Perez-Rovira, Wieying Kuo, Harm A. W. M. Tiddens, Marleen de Bruijne

Abstract: Measuring airways in chest computed tomography (CT) scans is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated scans for good performance. We investigate whether crowdsourcing can be used to gather airway annotations. We generate image slices at known locatio… ▽ More Measuring airways in chest computed tomography (CT) scans is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated scans for good performance. We investigate whether crowdsourcing can be used to gather airway annotations. We generate image slices at known locations of airways in 24 subjects and request the crowd workers to outline the airway lumen and airway wall. After combining multiple crowd workers, we compare the measurements to those made by the experts in the original scans. Similar to our preliminary study, a large portion of the annotations were excluded, possibly due to workers misunderstanding the instructions. After excluding such annotations, moderate to strong correlations with the expert can be observed, although these correlations are slightly lower than inter-expert correlations. Furthermore, the results across subjects in this study are quite variable. Although the crowd has potential in annotating airways, further development is needed for it to be robust enough for gathering annotations in practice. For reproducibility, data and code are available online: \url{http://github.com/adriapr/crowdairway.git}. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2007.13034 [pdf, other]

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

Authors: Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

Abstract: Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for… ▽ More Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose. We construct a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image. This produces a clean, lightweight representation of the objects in an image; this CAD-based representation ensures a valid, efficient shape representation for applications such as content creation or interactive scenarios, and makes a step towards understanding the transformation of real-world imagery to a synthetic domain. Experiments on real-world images from Pix3D demonstrate the advantage of our approach in comparison to state of the art. To facilitate future research, we additionally propose a new image-to-3D baseline on ScanNet which features larger shape diversity, real-world occlusions, and challenging image views. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: ECCV 2020 (Spotlight)

arXiv:1904.03239 [pdf, other]

ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors

Authors: Weicheng Kuo, Anelia Angelova, Jitendra Malik, Tsung-Yi Lin

Abstract: Instance segmentation aims to detect and segment individual objects in a scene. Most existing methods rely on precise mask annotations of every category. However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required. We introduce ShapeMask, which learns the intermediate concept of object shape to address the problem of generalizat… ▽ More Instance segmentation aims to detect and segment individual objects in a scene. Most existing methods rely on precise mask annotations of every category. However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required. We introduce ShapeMask, which learns the intermediate concept of object shape to address the problem of generalization in instance segmentation to novel categories. ShapeMask starts with a bounding box detection and gradually refines it by first estimating the shape of the detected object through a collection of shape priors. Next, ShapeMask refines the coarse shape into an instance level mask by learning instance embeddings. The shape priors provide a strong cue for object-like prediction, and the instance embeddings model the instance specific appearance information. ShapeMask significantly outperforms the state-of-the-art by 6.4 and 3.8 AP when learning across categories, and obtains competitive performance in the fully supervised setting. It is also robust to inaccurate detections, decreased model capacity, and small training data. Moreover, it runs efficiently with 150ms inference time and trains within 11 hours on TPUs. With a larger backbone model, ShapeMask increases the gap with state-of-the-art to 9.4 and 6.2 AP across categories. Code will be released. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 9207-9216

arXiv:1809.02882 [pdf, other]

Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection

Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

Abstract: Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our e… ▽ More Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our ensemble method compares favorably with the state-of-the-art, while running faster and using less memory. Moreover, our experiments are done using a substantially larger dataset than earlier papers on this topic. Since the labeling time could vary tremendously across examples, we model the labeling time and optimize the return on investment. We validate this idea by core-set selection on our large labeled dataset and by growing it with data from the wild. △ Less

Submitted 8 September, 2018; originally announced September 2018.

arXiv:1806.03265 [pdf, other]

PatchFCN for Intracranial Hemorrhage Detection

Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

Abstract: This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competi… ▽ More This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competitive performance with a human expert and the state-of-the-art on classification tasks (0.976, 0.966 AUC of ROC on retrospective and prospective test sets) and on segmentation tasks (0.785 pixel AP, 0.766 Dice score), while using much less data and a simpler system. In addition, we conduct a series of controlled experiments to understand "why" PatchFCN outperforms standard FCN. Our studies show that PatchFCN finds a good trade-off between batch diversity and the amount of context during training. These findings may also apply to other medical segmentation tasks. △ Less

Submitted 14 April, 2019; v1 submitted 8 June, 2018; originally announced June 2018.

arXiv:1712.02310 [pdf, other]

From Lifestyle Vlogs to Everyday Interactions

Authors: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros, Jitendra Malik

Abstract: A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start wit… ▽ More A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start with a large collection of interaction-rich video data and then annotate and analyze it. We use Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data. We show that by collecting the data first, we are able to achieve greater scale and far greater diversity in terms of actions and actors. Additionally, our data exposes biases built into common explicitly gathered data. We make sense of our data by analyzing the central component of interaction -- hands. We benchmark two tasks: identifying semantic object contact at the video level and non-semantic contact state at the frame level. We additionally demonstrate future prediction of hands. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Comments: Project page at: http://people.eecs.berkeley.edu/~dfouhey/2017/VLOG/

arXiv:1706.02055 [pdf, other]

doi 10.1007/978-3-319-46976-8_22

Early Experiences with Crowdsourcing Airway Annotations in Chest CT

Authors: Veronika Cheplygina, Adria Perez-Rovira, Wieying Kuo, Harm A. W. M. Tiddens, Marleen de Bruijne

Abstract: Measuring airways in chest computed tomography (CT) images is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated data to perform well. We investigate whether crowdsourcing can be used to gather airway annotations which can serve directly for measuring the airw… ▽ More Measuring airways in chest computed tomography (CT) images is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated data to perform well. We investigate whether crowdsourcing can be used to gather airway annotations which can serve directly for measuring the airways, or as training data for the algorithms. We generate image slices at known locations of airways and request untrained crowd workers to outline the airway lumen and airway wall. Our results show that the workers are able to interpret the images, but that the instructions are too complex, leading to many unusable annotations. After excluding unusable annotations, quantitative results show medium to high correlations with expert measurements of the airways. Based on this positive experience, we describe a number of further research directions and provide insight into the challenges of crowdsourcing in medical images from the perspective of first-time users. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Journal ref: LABELS 2016, DLMIA 2016: Deep Learning and Data Labeling for Medical Applications pp 209-218

arXiv:1606.04205 [pdf, ps, other]

Robust And Optimal Opportunistic Scheduling For Downlink 2-Flow Network Coding With Varying Channel Quality and Rate Adaptation (New Simulation Figures)

Authors: Wei-Cheng Kuo, Chih-Chun Wang

Abstract: This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we propose the fi… ▽ More This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we propose the first opportunistic INC + scheduling solution that is provably optimal for time-varying channels, i.e., the corresponding stability region matches the optimal Shannon capacity. Specifically, we first introduce a new binary INC operation, which is distinctly different from the traditional wisdom of XORing two overheard packets. We then develop a queue-length-based scheduling scheme and prove that it, with the help of the new INC operation, achieves the optimal stability region with time-varying channel quality. The proposed algorithm is later generalized to include the capability of rate adaptation. Simulation results show that it again achieves the optimal throughput with rate adaptation. A byproduct of our results is a scheduling scheme for stochastic processing networks (SPNs) with random departure, which relaxes the of deterministic departure in the existing results. △ Less

Submitted 14 June, 2016; originally announced June 2016.

Comments: Extended version for the ToN journal paper

arXiv:1505.02146 [pdf, other]

DeepBox: Learning Objectness with Convolutional Networks

Authors: Weicheng Kuo, Bharath Hariharan, Jitendra Malik

Abstract: Existing object proposal approaches use primarily bottom-up cues to rank proposals, while we believe that objectness is in fact a high level construct. We argue for a data-driven, semantic approach for ranking object proposals. Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method. We use a novel four-layer CNN architecture that… ▽ More Existing object proposal approaches use primarily bottom-up cues to rank proposals, while we believe that objectness is in fact a high level construct. We argue for a data-driven, semantic approach for ranking object proposals. Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method. We use a novel four-layer CNN architecture that is as good as much larger networks on the task of evaluating objectness while being much faster. We show that DeepBox significantly improves over the bottom-up ranking, achieving the same recall with 500 proposals as achieved by bottom-up methods with 2000. This improvement generalizes to categories the CNN has never seen before and leads to a 4.5-point gain in detection mAP. Our implementation achieves this performance while running at 260 ms per image. △ Less

Submitted 26 September, 2015; v1 submitted 8 May, 2015; originally announced May 2015.

Comments: ICCV 2015 Camera-ready version

arXiv:1410.1851 [pdf, ps, other]

Robust And Optimal Opportunistic Scheduling For Downlink 2-Flow Network Coding With Varying Channel Quality and Rate Adaptation

Authors: Wei-Cheng Kuo, Chih-Chun Wang

Abstract: This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput of each flow. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we… ▽ More This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput of each flow. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we propose the first opportunistic INC + scheduling solution that is provably optimal for time-varying channels, i.e., the corresponding stability region matches the optimal Shannon capacity. Specifically, we first introduce a new binary INC operation, which is distinctly different from the traditional wisdom of XORing two overheard packets. We then develop a queue-length-based scheduling scheme, which, with the help of the new INC operation, can robustly and optimally adapt to time-varying channel quality. We then show that the proposed algorithm can be easily extended for rate adaptation and it again robustly achieves the optimal throughput. A byproduct of our results is a scheduling scheme for stochastic processing networks (SPNs) with random departure, which relaxes the assumption of deterministic departure in the existing results. The new SPN scheduler could thus further broaden the applications of SPN scheduling to other real-world scenarios. △ Less

Submitted 7 October, 2014; originally announced October 2014.

Showing 1–31 of 31 results for author: Kuo, W