Skip to main content

Showing 1–31 of 31 results for author: Kuo, W

  1. arXiv:2406.18087  [pdf, other

    cs.SE cs.AI cs.CL

    EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

    Authors: Chun-Chieh Liao, Wei-Ting Kuo, I-Hsuan Hu, Yen-Chen Shih, Jun-En Ding, Feng Liu, Fang-Ming Hung

    Abstract: Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore,… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2403.15004  [pdf

    cs.CV cs.LG

    ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding

    Authors: Novendra Setyawan, Ghufron Wahyu Kurniawan, Chi-Chia Sun, Jun-Wei Hsieh, Hui-Kai Su, Wen-Kai Kuo

    Abstract: This work presents ParFormer as an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage, hence improving feature extraction capabilities. Integrating both local and global data allows for precise representation of short- and long-range spatial relationships without the need for computationally intensive methods such as shifting windows. Alon… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

  3. arXiv:2401.02402  [pdf, other

    cs.CV

    3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

    Authors: Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

    Abstract: 3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen obj… ▽ More

    Submitted 2 April, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

  4. arXiv:2310.00161  [pdf, other

    cs.CV cs.AI cs.LG

    Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

    Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

    Abstract: We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector h… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

    Comments: Tech report

  5. arXiv:2309.00775  [pdf, other

    cs.CV cs.AI cs.LG

    Contrastive Feature Masking Open-Vocabulary Vision Transformer

    Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

    Abstract: We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: Accepted to ICCV 2023

  6. arXiv:2306.01736  [pdf, other

    cs.CV cs.AI cs.LG

    DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model

    Authors: Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, David A Ross

    Abstract: Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  7. arXiv:2305.07011  [pdf, other

    cs.CV cs.AI cs.CL

    Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

    Authors: Dahun Kim, Anelia Angelova, Weicheng Kuo

    Abstract: We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional e… ▽ More

    Submitted 28 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: CVPR 2023 Highlight - https://github.com/mcahny/rovit ; adds LAION-2B result

  8. arXiv:2304.06028  [pdf, other

    cs.CV

    RECLIP: Resource-efficient CLIP by Training with Small Images

    Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

    Abstract: We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the visio… ▽ More

    Submitted 31 August, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

    Comments: Published at Transactions on Machine Learning Research

  9. arXiv:2303.16839  [pdf, other

    cs.CV cs.CL cs.LG

    MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

    Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

    Abstract: The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is sur… ▽ More

    Submitted 9 August, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: Published in Transactions on Machine Learning Research ( https://jmlr.org/tmlr/ ). 18 pages, 4 figures

  10. arXiv:2212.03229  [pdf, other

    cs.CV

    Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

    Authors: AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

    Abstract: We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and th… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

  11. arXiv:2209.15639  [pdf, other

    cs.CV

    F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

    Authors: Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

    Abstract: We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region clas… ▽ More

    Submitted 23 February, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023 (https://iclr.cc/Conferences/2023). 20 pages, 7 figures

  12. arXiv:2209.06794  [pdf, other

    cs.CV cs.CL

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner , et al. (4 additional authors not shown)

    Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaL… ▽ More

    Submitted 5 June, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: ICLR 2023 (Notable-top-5%)

  13. arXiv:2209.04372  [pdf, other

    cs.CV

    Pre-training image-language transformers for open-vocabulary tasks

    Authors: AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

    Abstract: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question A… ▽ More

    Submitted 9 September, 2022; originally announced September 2022.

  14. arXiv:2208.00934  [pdf, other

    cs.CV

    Video Question Answering with Iterative Video-Text Co-Tokenization

    Authors: AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

    Abstract: Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

    Comments: ECCV 2022

  15. arXiv:2205.00949  [pdf, other

    cs.CV

    Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

    Authors: AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

    Abstract: We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses onl… ▽ More

    Submitted 30 November, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

  16. arXiv:2203.17273  [pdf, other

    cs.CV

    FindIt: Generalized Localization with Natural Language Queries

    Authors: Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

    Abstract: We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object dete… ▽ More

    Submitted 8 August, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to ECCV 2022 (European Conference on Computer Vision)

  17. arXiv:2110.06413  [pdf

    cs.CR

    3LSAA: A Secure And Privacy-preserving Zero-knowledge-based Data-sharing Approach Under An Untrusted Environment

    Authors: Wei-Yi Kuo, Ren-Song Tsay

    Abstract: As data collection and analysis become critical functions for many cloud applications, proper data sharing with approved parties is required. However, the traditional data sharing scheme through centralized data escrow servers may sacrifice owners' privacy and is weak in security. Mainly, the servers physically own all data while the original data owners have only virtual ownership and lose actual… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 21 pages, 7 figures

  18. arXiv:2108.09368  [pdf, other

    cs.CV

    Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

    Authors: Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

    Abstract: 3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space… ▽ More

    Submitted 20 August, 2021; originally announced August 2021.

    Comments: To appear at ICCV 2021(IEEE/CVF International Conference on Computer Vision)

  19. arXiv:2108.06753  [pdf, other

    cs.CV

    Learning Open-World Object Proposals without Learning to Classify

    Authors: Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

    Abstract: Object proposals have become an integral preprocessing steps of many vision pipelines including object detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of o… ▽ More

    Submitted 15 August, 2021; originally announced August 2021.

  20. arXiv:2108.02226  [pdf

    cs.CV physics.med-ph q-bio.TO

    Terabyte-scale supervised 3D training and benchmarking dataset of the mouse kidney

    Authors: Willy Kuo, Diego Rossinelli, Georg Schulz, Roland H. Wenger, Simone Hieber, Bert Müller, Vartan Kurtcuoglu

    Abstract: The performance of machine learning algorithms, when used for segmenting 3D biomedical images, does not reach the level expected based on results achieved with 2D photos. This may be explained by the comparative lack of high-volume, high-quality training datasets, which require state-of-the-art imaging facilities, domain experts for annotation and large computational and personal resources. The HR… ▽ More

    Submitted 28 July, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

    Journal ref: Scientific Data 10, 510 (2023)

  21. arXiv:2104.13921  [pdf, other

    cs.CV cs.AI cs.LG

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

    Abstract: We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our metho… ▽ More

    Submitted 11 May, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

    Comments: ICLR Camera Ready

    Journal ref: ICLR 2022

  22. Crowdsourcing Airway Annotations in Chest Computed Tomography Images

    Authors: Veronika Cheplygina, Adria Perez-Rovira, Wieying Kuo, Harm A. W. M. Tiddens, Marleen de Bruijne

    Abstract: Measuring airways in chest computed tomography (CT) scans is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated scans for good performance. We investigate whether crowdsourcing can be used to gather airway annotations. We generate image slices at known locatio… ▽ More

    Submitted 20 November, 2020; originally announced November 2020.

  23. arXiv:2007.13034  [pdf, other

    cs.CV cs.LG eess.IV

    Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

    Authors: Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

    Abstract: Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

    Comments: ECCV 2020 (Spotlight)

  24. arXiv:1904.03239  [pdf, other

    cs.CV

    ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors

    Authors: Weicheng Kuo, Anelia Angelova, Jitendra Malik, Tsung-Yi Lin

    Abstract: Instance segmentation aims to detect and segment individual objects in a scene. Most existing methods rely on precise mask annotations of every category. However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required. We introduce ShapeMask, which learns the intermediate concept of object shape to address the problem of generalizat… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 9207-9216

  25. arXiv:1809.02882  [pdf, other

    cs.CV cs.LG

    Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection

    Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

    Abstract: Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our e… ▽ More

    Submitted 8 September, 2018; originally announced September 2018.

  26. arXiv:1806.03265  [pdf, other

    cs.CV

    PatchFCN for Intracranial Hemorrhage Detection

    Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

    Abstract: This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competi… ▽ More

    Submitted 14 April, 2019; v1 submitted 8 June, 2018; originally announced June 2018.

  27. arXiv:1712.02310  [pdf, other

    cs.CV

    From Lifestyle Vlogs to Everyday Interactions

    Authors: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros, Jitendra Malik

    Abstract: A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start wit… ▽ More

    Submitted 6 December, 2017; originally announced December 2017.

    Comments: Project page at: http://people.eecs.berkeley.edu/~dfouhey/2017/VLOG/

  28. Early Experiences with Crowdsourcing Airway Annotations in Chest CT

    Authors: Veronika Cheplygina, Adria Perez-Rovira, Wieying Kuo, Harm A. W. M. Tiddens, Marleen de Bruijne

    Abstract: Measuring airways in chest computed tomography (CT) images is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated data to perform well. We investigate whether crowdsourcing can be used to gather airway annotations which can serve directly for measuring the airw… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Journal ref: LABELS 2016, DLMIA 2016: Deep Learning and Data Labeling for Medical Applications pp 209-218

  29. arXiv:1606.04205  [pdf, ps, other

    cs.NI

    Robust And Optimal Opportunistic Scheduling For Downlink 2-Flow Network Coding With Varying Channel Quality and Rate Adaptation (New Simulation Figures)

    Authors: Wei-Cheng Kuo, Chih-Chun Wang

    Abstract: This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we propose the fi… ▽ More

    Submitted 14 June, 2016; originally announced June 2016.

    Comments: Extended version for the ToN journal paper

  30. arXiv:1505.02146  [pdf, other

    cs.CV

    DeepBox: Learning Objectness with Convolutional Networks

    Authors: Weicheng Kuo, Bharath Hariharan, Jitendra Malik

    Abstract: Existing object proposal approaches use primarily bottom-up cues to rank proposals, while we believe that objectness is in fact a high level construct. We argue for a data-driven, semantic approach for ranking object proposals. Our framework, which we call DeepBox, uses convolutional neural networks (CNNs) to rerank proposals from a bottom-up method. We use a novel four-layer CNN architecture that… ▽ More

    Submitted 26 September, 2015; v1 submitted 8 May, 2015; originally announced May 2015.

    Comments: ICCV 2015 Camera-ready version

  31. arXiv:1410.1851  [pdf, ps, other

    cs.NI cs.IT

    Robust And Optimal Opportunistic Scheduling For Downlink 2-Flow Network Coding With Varying Channel Quality and Rate Adaptation

    Authors: Wei-Cheng Kuo, Chih-Chun Wang

    Abstract: This paper considers the downlink traffic from a base station to two different clients. When assuming infinite backlog, it is known that inter-session network coding (INC) can significantly increase the throughput of each flow. However, the corresponding scheduling solution (when assuming dynamic arrivals instead and requiring bounded delay) is still nascent. For the 2-flow downlink scenario, we… ▽ More

    Submitted 7 October, 2014; originally announced October 2014.